One of the key value propositions for Operations Management Suite is that it gives you real-time operational insight across your datacenters and public clouds through analysis of logs generated by your apps, services, and devices. Using log analysis, you can discover incidents in your systems that you care about and want to address. When an incident occurs you typically want to quickly and reliably fix the underlying issue to restore the system to optimum health. This is where OMS Automation becomes a valuable tool in your management strategy.
Automation is a service within OMS that enables both process and configuration automation. Configuration automation, in the form of Desired State Configuration (DSC) has been covered in a previous post. In this post, we will talk about process automation and the role that automated workflows can play in managing your systems.
Process automation is advantageous in that it allows you to codify your management processes in the form of runbooks and then run these runbooks repeatedly, knowing that the same process will be run each time. This ultimately saves you time and money and increases reliability. It also allows processes to be run automatically in response to events, alerts, or on schedules, freeing up personnel to work on other pressing issues that provide business value.
With Automation runbooks you can integrate with any system that has a public internet API. Runbooks based on PowerShell can take advantage of the large, public library of integration modules that enable easy process integration with all of the apps and services that you use in your datacenters and clouds. For instance, integration modules exist for Azure, AWS, System Center, and most other common systems.
OMS alert remediation with Automation
OMS Automation can be used to automatically remediate issues that are found through log analysis. Automation runbooks can also be part of a larger troubleshooting and remediation process that includes gathering more data on an issue, communicating with stakeholders and incident-management systems, and after approval, applying a remediation for the problem.
OMS Automation is currently integrated into OMS through alerts on saved log searches. When you create an alert that will trigger based on a data pattern found in a log search, you can configure an Automation runbook to run when the alert is triggered. With this capability, you can create runbooks that will automatically do work for you – including quickly remediating the issue to get your system working correctly again.
The process of setting up automated alert remediation involves these steps:
- Determine the scenario you want to manage.
- Assure that the machines you want to manage have the MMA agent installed and a hybrid runbook worker configured.
- Assure that the essential logs from the managed machines are being collected in OMS.
- Create a log search query that will discover the particular issue.
- Assure that the Automation solution is added to your OMS workspace from the Solutions Gallery.
- Create an Automation runbook that will remediate the issue.
- Create an alert that will trigger the runbook whenever the log search finds the issue.
Determine the scenario you want to manage
The first step is to determine the scenario that you want to monitor and remediate. For example, say you have an app that has a memory leak that slowly fills memory and degrades performance of the machine it is on. In this case, the scenario is that you want to detect when the machine crosses a memory threshold and then restart the app or reboot the machine to reset the memory consumption. In another example, you have a service that doesn’t automatically restart after the machine it is on reboots after a system update. In this case, the scenario is that you want to detect when the service is stopped and then restart it.
Once you understand the scenario you are trying to monitor and remediate you can then proceed to the next step of assuring that the correct logs are being collected by OMS to allow you to detect the issue.
Assure that the managed machines have the MMA agent and hybrid runbook worker group
If you are managing machines with OMS, you already have configured your Connected Sources and installed the MMA agent on each machine so that the logs can be uploaded to OMS. The next important step is to configure the Automation hybrid runbook work group on each machine. When you already have the MMA agent installed on a machine you just need to run a couple of simple PowerShell commands on the machine to get the hybrid runbook worker group ready to use.
Having an Automation hybrid runbook worker group installed on each managed machine is key to the ability to easily manage the machine using Automation runbooks. During the hybrid worker group installation process a secure communication channel is established between the managed machine and the Automation service. This means that once the hybrid worker group is installed and registered, the runbooks you have in the Automation account can run within the hybrid worker and perform actions against the managed machine without any further authorization. As we will see later, this will be used to automate the remediation of issues discovered by log analytics.
Assure that essential logs are being collected
In the OMS portal go to the Settings page and select the Data tab. In this page you can select the logs that will be imported to OMS for analysis. For example, for the memory leak scenario you would want to assure that in the Windows Performance Counters area the “Memory(*)\% Committed Bytes In Use” logs are collected. And for the stopped service scenario you would want to assure that in the Windows Event Logs area, the “System” logs are being collected.
Create the log search query
In OMS, the analysis of logs to find incidents involves creating search queries against the log data. The details of a query are determined by the information pattern that is being looked for. You can learn all sorts of details about constructing search queries in this blog post series and in this help article.
For example, for the memory leak scenario, to detect the percent memory consumption you can do a search like this:
Type=Perf ObjectName=Memory CounterName=”% Committed Bytes In Use” Computer=”computer1.contoso.com”
In the stopped service scenario, to detect when this service is stopped you can do a search like this:
Type=Event EventLog=System Source=”Service Control Manager” Computer=”computer2.contoso.com” “stopped” AND “SomeServiceDisplayName”
In these examples, the queries are tightly scoped to find a particular state on a particular computer. For your scenarios, you will need to determine how tightly scoped you want your search query to be. In general, you will probably get more search results if you create a query that is more general. For example, in the above queries if you remove the Computer expression you will get search results across all of your computers, not just one. Keep this in mind, for it will be important when you create your alert-remediation runbook – it will determine how general or specific the runbook logic needs to be.
Add the Automation solution to your OMS workspace
In order to use Automation runbooks in alert remediation you first need to assure two things:
- Create an Azure Automation account in your Azure subscription.
- From the OMS Solutions Gallery, add the Automation solution to your OMS workspace, and configure the solution to integrate with the Automation account.
Once you have your Automation account you can start creating runbooks, and once the Automation solution is added to your OMS workspace you can start configuring your runbooks to run from alerts.
Create a remediation runbook
The runbook that you create in your Automation account will contain the logic for remediating or acting on the alert condition. You need to create and publish this runbook first before you create the alert, because it is during alert creation that you are able to associate a runbook with an alert.
In Automation there are several types of runbooks you can create: PowerShell, Graphical, and PowerShell Workflow. Unless you need advanced features, like checkpointing, we suggest that you create either Graphical or PowerShell runbooks.
One of the key tasks that your runbook will need to do is to parse the data that is passed to it in input parameters. In particular, for runbooks started by OMS alerts the search results data is passed to the runbook in a parameter called $WebhookData. You will need to parse this object to get the properties of each search result that triggered the alert. Details on the schema of the $WebhookData object and example PowerShell script to parse it can be found in this blog post.
Below are examples of graphical runbooks that handle the scenario of a stopped service. In this case, the OMS alert triggers the HandleStoppedServiceAlert runbook when a stopped service is found in the log search. HandleStoppedServiceAlert runbook then parses the $WebhookData to determine which service is stopped and on which managed machine. It then starts a child runbook, StartServiceOnHybrid, on the hybrid worker group on the managed machine, and the runbook restarts the stopped service. The parent runbook then gets the results from the child runbook and sends email to interested parties with notification of the remediation results.
Create an OMS log search alert
Now that everything is ready to go, the final step is to create the alert. This is a relatively simple operation in the OMS portal. Once you have set the rules that govern when the alert will trigger you will finally select a runbook to start. This will be the runbook that you created in your Automation account to handle this alert.
When you save the new alert, a webhook is created for the runbook. A webhook is just a special URL that can be used to start the runbook. Each time the OMS alert triggers, it will create a POST request with the webhook to start the runbook and will pass the search results to the runbook in the $WebhookData object.
How it all works together – one example
To wrap this post up, the diagram below illustrates how the pieces all work together in the “stopped service” scenario. It is actually quite simple, with the managed machine posting logs to OMS (via the MMA agent), then OMS search query analyzes the logs to find an incident, which triggers an alert runbook in the Automation service, which runs a remediation runbook on the hybrid runbook worker on the managed machine, which resolves the incident and sends you a notification. And all while you were working on something else! If this all looks good to you, check out our free trial.