Top 5 Best Practices for DevOps Monitoring - DevOps.com (2024)

Today organizations are required to deliver on higher levels of customer satisfaction for their online services. Many, however, are forced to support these initiatives with an interrupt-driven approach as they react to fix things after they break. However, for a more proactive approach and to manage expected high levels of SLAs, organizations can opt to reduce their amount of unscheduled downtime by implementing a continuous delivery (CD) model to their development efforts.

What is critical in the CD model is the ability to monitor and manage systems in a structured way, getting early detection of problems so organizations can make changes before the system is impacted. While unexpected failures make interrupt-driven work unavoidable at times, organizations can become more proactive by examining their current approach and toolset against business needs to help create a path to continuous service delivery optimization.

As a starting point, organizations need control and visibility into their DevOps environment by collecting and instrumenting everything. Considering the amounts of data, this can be an insurmountable challenge for most organizations. To get started, follow these five best practices to perform DevOps monitoring efficiently and quickly in a measurable and scalable way.

Step 1: Collect the Data. You can’t manage what you don’t measure! It is important to take inventory of what is being collected today and align with business and executive teams to collaborate on the aim of the services delivered. Analyze the inventory collection of metrics with questions such as: “Why aren’t we collecting this? How does this fit into our goals? Have you ever seen a failure in X? How often should we be measuring this? How long should we keep it for? Is that important?” Teams also should evaluate how they are collecting the information and consider the best architectural approach for collecting their metrics, including whether pull, push or pull collection methods are better. Once goals are understood and the inventory of data is collected, look at what else your organization should consider collecting.

Step 2: Correlate and Triage. Correlation of data is needed to understand it, but data comes in at different frequencies, in different time frames and from different sources. Work to normalize the data to understand it, perform comparisons against the different incoming metrics and establish baselines for basic service availability. Since organizations aim to go beyond basic service level agreements (SLAs) and offer a high-performing solution, constantly ask what the organization is missing from a data perspective and how it relates to business initiatives. Asking that question from the collection, correlation and triage perspective is critical.

Step 3: Identify Trends. Organizations need to examine historical data to identify trends and take action before issues arise and customers complain. Establish alerting thresholds by outlining what a normal day looks like from a monitoring performance and customer perspective, and then identifying examples of what makes an abnormal day. This ties in with managing infrastructure inventory and understanding safety thresholds for each of the components that potentially could impact the service offering. It’s critical to communicate these findings with teams and business-line managers to prevent service delivery problems from happening and optimize offerings based on identified trends.

Step 4: Notify and Act (Automation). In manual mode, a notification is delivered and then the team reacts. But teams are continually pushed to do things faster, and automation can help. To get to that point, organizations must understand where best to add more automation. How do you gather the right telemetry that delivers consistent answers that the machine can operate rules against, and do you leverage an automated process or notify a person? The desire to improve to a faster process requires a shift from manual to automated practices.

Step 5: Predict (What-if Analysis). If you don’t take the time to go through the first four steps in a methodical way, it’s very difficult to reach this final step without always remaining in a manual mode. To balance costs and availability, it’s critical to discuss with the executive team how to predict customer service consumption (revenue) against the amount your services are going to cost the business.

For example, developing a service that can alert customers of a potential service disruption due to disk space can be accomplished only by clearly setting business goals, then using the right metrics and events platform leveraging a time series database. In this case, the business goal is to ensure no service disruption due to failures such as low disk space; the metric monitoring is for disk space for each customer instance with the appropriate threshold alert set off by an automated trigger; and the action is an email to the customer letting them know of the situation and what action they can take (reduce load or upgrade for more disk space). In this example, it is important to marry business logic with the monitoring practice to make service a successful experience for customers. Furthermore, this approach will help teams to not only predict user experience but also help to predict OpEx and CapEx better into the future.

That’s the benefit of the ability to predict with the right monitoring approach—to make informed business decisions based on results. Customers appreciate that, and proactive support is much better than being reactive.

Emerging trends such as microservices, containerization, elastic storage, software defined networking and hybrid clouds are pushing the boundaries of DevOps monitoring. The right monitoring plan can identify and resolve problems before they affect critical business processes and enables customers to plan for upgrades before outdated systems begin to cause failures and outages for users.

About the Author / Mark Herring

Top 5 Best Practices for DevOps Monitoring - DevOps.com (2)Mark Herring is CMO at InfluxData. He is a well-rounded silicon-valley executive with proven experience in taking complex technology and making it understandable to the broader audience. His deep developer roots are never far from his mind as he looks at trends and asks the tough questions on whether the technology is here to stay or just another fad. Currently Mark is Chief Marketing Officer at InfluxData. Follow him on Twitter and connect with him on LinkedIn.

Top 5 Best Practices for DevOps Monitoring - DevOps.com (2024)
Top Articles
Latest Posts
Article information

Author: Dong Thiel

Last Updated:

Views: 5971

Rating: 4.9 / 5 (79 voted)

Reviews: 94% of readers found this page helpful

Author information

Name: Dong Thiel

Birthday: 2001-07-14

Address: 2865 Kasha Unions, West Corrinne, AK 05708-1071

Phone: +3512198379449

Job: Design Planner

Hobby: Graffiti, Foreign language learning, Gambling, Metalworking, Rowing, Sculling, Sewing

Introduction: My name is Dong Thiel, I am a brainy, happy, tasty, lively, splendid, talented, cooperative person who loves writing and wants to share my knowledge and understanding with you.