Monitoring Microsoft Entra Connect

Quick Note on Naming

While the service has been renamed to Microsoft Entra Connect, I'm going to be referring to it in this post as Azure AD Connect. At the time of writing, Microsoft's documentation uses the new name, but the actual download page, the application itself, and the Windows event log still refer to it as AD Connect. Gotta love Microsoft and their naming eccentricities... If you're reading this blog post, I'm sure you feel my pain. Anyways, I'll save the ranting about Microsoft for it's own post, let's get onto monitoring AD Connect.

A Preexisting Setup

A few weeks ago, a pesky RMM alert kept popping up in our monitoring system at work for a failed AD Connect service on our of our client's systems. The first time it alerted I did some basic troubleshooting, mainly stopping and restarting the services, and that seemed to resolve the RMM alert. The trouble was, the alert kept coming back and then self-healing if not addressed quickly enough. After a full reboot of the server in question, the flickering of the alert was still occurring and I know it was time to dig deeper. I've always found that it's easier to diagnose issues like this with the knowledge of how things work under the hood, so I got to work dissecting how the monitor was setup.

With most of the custom monitors in our RMM, I opened up the Automation Policy (AMP) to find a bunch of PowerShell under the hood; I expected that much. What I didn't expect was how incredibly wordy the code was. While I've been with my current company for a number of years now, there are plenty of things that have been in place for longer than I've been on the team. My first question was who had written this script, as the only other person on the team who works on the RMM is a highly experienced PowerShell scripter who wouldn't write code like this. I chatted with this colleague later and they confirmed that they had not written that script and didn't remember where it had come from.

The Original Script

While I won't copy the full ~50 lines of the original script here, I'll add in an except that will give a good idea of what was going on. As an aside, I do apologize for the image in-lieu of raw text. Ghost, this sites CMS, doesn't have a great way of embedding formatted PowerShell into a post. Not that I'd recommend coping and pasting this code anyways.

Except from the old script. Author unknown.

If you've worked in PowerShell before, I suspect you can easily pick out a few different areas for improvement within the short bit of code in the image above. For the sake of whoever wrote this script, I won't go into every pain point but I will highlight the major bug that necessitated the script be rewritten. For each of the 11 strings in the $eventIDs array, a new Get-EventLog query is run to see if the latest event log's ID matches the current $eventID being checked. From there, eleven if-statements are evaluated on the data returned from Get-EventLog.

At first glance, this is a slightly wordy way of getting the job done. However, the major bug is that Get-EventLog is only pulling the most recent event log in the source 'ADSync'. It's pulling the same event log eleven times in a row and evaluating if it matches any of the known-bad IDs. This is why the monitor was flickering, the script would only alert if the single most recent event log in the source happened to match one of the IDs in the array.

How Can This be Improved

Initially, I began making some changes to clean up some of the repeated if-statements, but after reading closer and discovering the major bug, I realized I'd need to take the script back to the drawing board. The first question I answered was if querying the event log was still the best approach. After some research, there didn't seem to be a better way to monitor the status of AD Connect from the AD Connect server itself which was the goal of this monitor. The next question was how to query the event log to only return the logs from the latest, completed synchronization cycle. This was important to ensure that the monitor could self-heal if the most recent completed sync cycle was free of any errors. This also prevents any premature self-heals if the script happened to run while a synchronization was in progress and any potential errors hadn't yet been thrown.

To determine when the last full sync cycle was completed, the script searches for the most recent RunProfileName with name 'Export' from the output of the command Get-AdSyncRunProfileResult. This was probably the most difficult part of the new script to plan out as that command doesn't seem to be terribly well known and determining how to get the information needed took some trial and error. Once I had the time stamp of when the last full synchronization cycle completed, it was fairly easy to get the associated event logs and search them for any of the known-bad IDs. If speed was a bigger concern, the foreach loop could be broken on the first error ID found, but for our purposes it was more useful to list out all errors found to aid in troubleshooting.

Again, apologies for the image in-lieu of raw text. As someone who reads plenty of articles, I know it's annoying when you can't copy and paste code. There is a link to the most recent version of the script on my GitHub page at the bottom of the post.

Brief self-promotion: If you're curious about the module SAToolkit and/or the Write-Syslog function, please check out the blog post I wrote about the development of that module; it's something I'm truly proud of. https://gtio.tech/developing-a-powershell-module Okay, self promotion complete, appreciate you sticking with me there.

Takeaways

While I believe that my version of this script is a big improvement in both efficiency and accuracy, it's not perfect. I really wish there was a better, ideally built-in, way to monitor AD Connect on the server other than querying the Windows event log. Public information on the event log IDs, and even the event log source 'ADSync' is scarce at best and proved inaccurate in my research. While I know at least one of the known-bad IDs I inherited from the prior script is still valid (6100 was what was causing the old monitor to alert) I'm not convinced this list of IDs is complete or entirely accurate. Since I don't know who authored the original script, the information on where those event log IDs came from isn't available to me.

As of about a week ago, the new version of the script is live in our RMM and so far it hasn't alerted my team to any issues. Going forward, there are definitely some improvements that could be made. The first thing I'm planning on focusing on next will be ensuring that the list of event IDs is accurate and complete. I believe there is no such thing as a truly complete script and I'm sure there are fellow administrators out there whose additions would be valuable.

With all of that said, please feel free to use my version of the script in your own environment (with credit) and feel free to adapt it to your needs. If you make any improvements, please open a pull request to the GitHub source and I'll be sure to integrate your additions.