Nobody wants to work monitoring alert tickets. It's a theme I've seen repeated ad-nauseam for as long as I've been working in technology. The kind of tickets that get created on monitoring boards are repetitive and boring, the kind of thing people only get to when everything else is done for the day. And let's be honest with one another, by the time everything else is complete, nobody is feeling overly enthusiastic about investigating the cause of a low disk space warning.

The problem is, when left alone and unaddressed, these alerts have a tendency to bite us in the ass. Hard. Sure, sometimes it's John Smith filling up his OS drive for the 8th time this year with personal photos he knows he shouldn't have on his company laptop. But other times, these alerts are indicative of real problems that have to be addressed promptly to avoid potential issues. The challenge is that it can be quite hard to distinguish between the two without proper investigation.

Unimportant Until It's Urgent:

Prioritization is hard, particularly as a System Administrator. We have about a thousand things vying for our attention at any given time, and it's not always clear what needs to be dealt with first. End users with problems that should be going to the help desk, bosses who refuse to send requests through proper channels, C-Suite's who urgently need their sports betting website allowed in the company DNS filter, and don't forget that pet project you've been trying to find time to finally start working on. We've got a lot on our plates at any given time and it's hard when so many of the tasks coming across our desks are deemed time sensitive. If everything is time sensitive, then nothing is.

All of that is to say, finding time to investigate an alert that OneDrive isn't properly syncing can be a challenge. Maybe that user just reset their password and their cached credentials haven't updated yet. Or, maybe the user's been working on a critical project for the company and without knowing it, doesn't have a single copy saved anywhere other than their laptop with a silently failing SSD. The problem is, you don't know. And without looking into the alert, you can't know. Alerts like this don't always feel important, and it can be easy to say you'll look at it tomorrow. But what if tomorrow comes and that silently failing SSD finally kicks the bucket. Now, you've got a whole team of people asking questions about why you can't recover that critical project; why wasn't it backed up. And if you're honest with yourself, the questions they're asking are fair. That document should have been backed up! It's deceptively easy to point fingers at the countless contributing factors, after all, nothing happens in a vacuum. But at the end of the day, your boss isn't thinking about contributing factors, they're simply pissed at you.

Getting Buy-In:

Obviously, I'm painting something of a worst case scenario for what's likely a fairly benign alert. But that's really my point, its only benign until its suddenly not. There's likely a good reason somebody setup up alerting for when OneDrive sync isn't working properly. In fact, those alerts were probably born out of a problem not so dissimilar from what I just described. Someone probably lost an important document, and the result after the dust cleared was enabling alerting to prevent that sort of issue from happening again. So why is it still so easy to push that alert off until tomorrow?!

It starts with getting buy-in from the people around you. A culture where everyone understands the extraordinary value in preventing issues before they actually become a problem is a hard one to cultivate. If you're the lone crusader taking on a backlog of alerts, let's face it, you won't get that far without support. So how do we convey the urgency of something that, right now, may not actually seem urgent? Send them to this blog post! I jest, but in reality it depends on the audience. Is your company focused on data and KPIs? Put together a display showing how many hours of reactive work could have been prevented with properly managed alerting. Is a particular reoccurring problem plaguing the accounting department, show them the value in monitoring by setting up an alert on the Quickbooks Database service that's always having problems. Now, you can be aware of and maybe even working to resolve the issue before a ticket even comes in. Demonstrate the value of being on top of things in whatever way resonates best with the powers that be.

How Automation Plays Into All This:

If we monitor everything and there are tens, hundreds, or even thousands of alerts to slog though, it's hard to pick out the real problems from the noise. That's not to say don't monitor anything of course, but there is value in being strategic in what we monitor and which of those monitors can generate alerts and when. Better yet though, we can leverage the magic of automation to super charge all of this. The only thing better than resolving an alert for a potential problem, is if the problem silently solves itself and never bothers you to begin with. Self-healing is a key factor in monitoring and alerting, and the more we can take advantage of automation to facilitate that, the easier everything will be.

To borrow from a real life example, we used to get a few tickets a day created on our monitoring board regarding problematic EDR agents. At least 70 percent of the time, the only thing that needed to happen was the endpoint needed to be rebooted and the alert would clear. After enough manually scheduled ad-hoc reboot windows for devices with failed EDR agents, I started to dig into how we were monitoring for these issues to begin with. I learned that while the monitor for the EDR agent's status did have the ability to self-heal it wasn't able to do so on it's own in most scenarios due to the need for a device reboot. So, I reconfigured our EDR agent monitoring and added the ability for our RMM to automatically place a device with a problematic agent into an overnight reboot window. From there, I simply adjusted the alerting thresholds. Instead of opening an alert if the monitor was unhealthy for more than an hour, it now waited 24 hours. These two relatively simple changes reduced the number of alerts we receive for EDR agents drastically. The best part is, when I see an EDR alert come in now, I know that there really is a problem with that agent that will require some investigation.

For some alerts, self healing doesn't work. In rare cases, human intervention is always required, removing self-healing as a viable option altogether. In those cases, we can still make use of automation to improve things.

Borrowing from another real life example, we typically have a few low disk space alerts generate per day. Our RMM is already setup to run a self-heal automation that executes a disk space cleanup, but sometimes that's not enough and a ticket still gets generated. In those scenarios, we make use of WizTree to get a visual understanding of what is using up space on a given disk. We have an automation that generates a report silently on the device and emails us a download link to the generated CSV file. The problem is those reports can take a while to generate, and time spent waiting adds up quickly. So, I created an automation that looks for new low disk space tickets on our monitoring board. If it finds any, it uses the data on the generated alert to interface with our RMM's API and kicks off the automation that generates the WizTree report. Once it's complete, the generated report gets automatically uploaded to the ticket. Now, instead of a technician wasting valuable time waiting for the report to generate, it's already available for them right there on the ticket.

Where Does This Leave Us:

Realistically, still, nobody wants to work monitoring alert tickets. But when you have buy-in from leadership on their importance, and put in the effort to automate away much of the noise, the situation starts to feel a lot more bearable. When you know an alert only generated because a self-heal couldn't automatically resolve it, it's a lot harder to pretend that alert isn't important. Monitoring and alerting are key tools in a good System Administrators toolkit. Manage things properly and you'll be able to prevent countless problems before anyone notices anything is amiss to begin with.