Robert made a key point regarding all those email alerts. Too many emails just means too much noise and not enough signal. Not enough emails and it becomes a game of "did the email server fail or was there no output". Can't win, there.
I use custom scripts with Zabbix to mitigate some of this. No email means all is well but Zabbix does receive the period "all is still ok" updates. If there hasn't been an update received of any kind in a while (good or bad) then that is flagged as an error condition to send an alert about. It likely means the script stopped running correctly for whatever reason and so investigate that issue. Could still be an email server so I have Zabbix dashboard open in a browser full-time and just glance at it throughout the day as I do other work a few tabs over.
You also settle into a routine. I expect to receive a certain number of daily emails from Veeam as it runs its jobs. I just made a habit of looking for X emails in the Veeam folder when I come to work and just glance to make sure they all say "[Succes]".
If there are fewer than X emails in the folder, investigate if the backup job is just running late or something else happened. Those are rare events and don't take much time.
The ones with warning or failure are coloured differently in the email client so they stand out. Those get actioned. I do this for a few critical parts that I must be assured ran each day. It takes less than a minute. Just part of the routine sweeps.
Automation is key, though. If you have a few servers you can afford to do it manually. Or, if you get paid by the hour. If I were to patch each server I manage manually, it would become my full time job. I have too many to worry about. I can't be stuck in Update Mode 5 days a week and evenings. By the time I'm caught up, it'll be "next month" and start it all over again. I'd never get any other work done.
I used to be against the whole idea of letting servers update automatically but it came to a point where it wasn't a feasible option anymore. I can't hire a person whose single job is to run updates all day long, year in and year out, so something had to give on that front. It does mean things can break and stay broken for longer before it's noticed. You just add your monitoring tools to help spot it sooner. Sometimes it's an unwinnable situation. We all are forced to do more work with fewer people so we have to work smarter and we're only going to work as smart as the bugs in our tools and script allow us to.
Edit: clean up and fix some typos.
1 found this helpful
thumb_up
thumb_down