Home Blog DevOps The Tool Sprawl Problem in Monitoring One of the biggest KPIs in the DevOps space is monitoring. There are so many tools to help any organization to complete their monitoring picture, but no tool does everything and most organizations use many tools to help complete their monitoring solution. Mashing tools together often creates a problem of its own ― the tool sprawl problem.
More on the subject:AWS GuardDuty Monitoring with Logz.io Security Analytics and the ELK Stack MongoDB Performance Monitoring Using The ELK Stack OpenStack Monitoring With Elasticsearch, Logstash, and Kibana
In modern computing, it’s not how much data you collect and report, or how efficient, or how durable your monitoring solution is. Sure, those are all important considerations, but it’s how effective and useful your monitoring is that makes the difference. It’s how much value to the business it creates, and how well the data can be exploited to identify and resolve critical issues. Monitoring is never a completed effort.
It evolves. It is enhanced by tools and by integrations. Often enough, the journey to improve monitoring is what creates and accentuates the tool sprawl problem. In this article, I’d like to examine how monitoring tool sprawl can become a serious issue for modern, engineering-driven companies.Monitoring Challenges
The task of monitoring modern IT environments is too complex to properly handle without tools. The days of allowing logs to sit on servers and fishing through them to find answers are long gone. Alerting on an operating system issue and manually clearing out all the noise from old vendor solutions for sysadmins (think HP, Dell, IBM) no longer scales in the world of cloud computing.
Luckily, there are plenty of modern tools to solve modern issues. But like any type of software, every monitoring tool has weaknesses and strengths in their own right. Organizations will often patch together multiple monitoring tools based on their strengths and just deal with the sprawl.
So what are the modern problems to solve and tools to solve them?Logs
Log data is considered an extremely valuable data source for monitoring and troubleshooting both applications and the infrastructure they are installed on. Most log management tools on the market provide analysis capabilities. Some provide advanced analytics such as machine learning and anomaly detection. Most of these tools now include plugins and integrations with cloud vendors to provide greater insight into cloud-based applications.
The world’s leading open source log management tool is, of course, the ELK Stack ― an extremely popular and powerful platform but one that often requires more engineering effort and expertise to scale .Metrics
Metrics, or time-series data, is another type of telemetry data used for monitoring. Used primarily for APM (Application Performance Monitoring), ITIM (IT Infrastructure Monitoring) and NPM (Network Performance Monitoring), metrics introduce another kind of challenge being more verbose in nature and requiring more elaborate data storage and retention strategies as well as analysis features.
Open source solutions are often comprised of a time series database such as Prometheus, InfluxDB or Graphite with Grafana playing the role of the analysis and visualization layer. Plenty of SaaS vendors offer their own APM and monitoring solutions, including premade dashboards for monitoring specific services or platforms.Security
The increase in cyber threats means organizations must operate with security in mind. A big part of security is active monitoring and reactive controls. Triggering alarms on root or administrator login is an example, or signaling a Puppet run when a security-controlled configuration is changed via an automated response to a security incident. To be able to build this kind of solution requires a very specific kind of tool, usually falling under the category of SIEM or Security Analytics. Again, there are both open source and proprietary solutions on the market but the skills gap is proving to be as big a challenge as integrating and deploying these solutions.Compliance
SOC, PCI, HIPAA, SOX, GDPR, ISO, and CODA are just a few regulatory and compliance certifications companies must contend with to remain in business. All of them require some level of auditable data to show that their required checks and controls are being maintained. This means companies must find tools to capture, store, and retrieve data for compliance. Some tools excel at configuring controls or capturing security data but aren’t as strong at capturing application logs and transforming them into formats that mesh well with security logs to have an overlay picture.Alerting/Reporting
Again, most tools provide canned reports, most also allow you to build your own reports. The key difference is some provider’s reports will be more relevant to an organization than others. An example of where the tool sprawl can become real is an organization with a security team that prefers the tailored security event reports from Alertlogic, an operations team that uses Datadog’s metrics for capacity planning and the developers use the ELK Stack to determine API performance issues. All three tools can create all three reports, but they do not specialize in providing all three. This key difference is what creates a tool sprawl challenge, in this case for reporting and alerting.Multiple solutions mean what?
After reading the previous section, it is easy to see how companies choose multiple tools and vendors to solve their monitoring needs. In the following section, I’d like to examine some of issues that can result from having multiple monitoring solutions.Multiple panes of glass
Having security data flow to one tool, systems performance data to another, and application data to a third makes correlation much more difficult. Even if you are able to have data sources feed multiple frontend tools, it still requires additional “stitching” to deliver the data in a meaningful way and the systems still present information differently. This can force the need to build translation jobs between solutions, or lengthy exports and manual correlation in spreadsheets. Nobody wants to do that.Administration (and cost) is heavier
This means managing permissions through RBAC, customization of data feed sources, plug-in management, and supporting infrastructure must be considered. The resources and cost burden can become extremely heavy pretty quickly when designing for scale, high availability, and storage.Additional automation Every age