DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Low-Code Development: Leverage low and no code to streamline your workflow so that you can focus on higher priorities.

DZone Security Research: Tell us your top security strategies in 2024, influence our research, and enter for a chance to win $!

Launch your software development career: Dive head first into the SDLC and learn how to build high-quality software and teams.

Open Source Migration Practices and Patterns: Explore key traits of migrating open-source software and its impact on software development.

Related

  • Too Many Tools? Streamline Your Stack With AIOps
  • Understanding the New SEC Rules for Disclosing Cybersecurity Incidents
  • Scaling SRE Teams: The Challenges and How To Build a Successful Scaling Framework
  • 10 Best Practices to Launch APIs in 2023

Trending

  • The AI Revolution: Empowering Developers and Transforming the Tech Industry
  • Unleashing the Power of Redis for Vector Database Applications
  • Handling “Element Is Not Clickable at Point” Exception in Selenium
  • A Comprehensive Guide To Building and Managing a White-Label Platform
  1. DZone
  2. Testing, Deployment, and Maintenance
  3. Maintenance
  4. How To Reduce MTTR

How To Reduce MTTR

This article discusses ten ideas that can help reduce Mean Time To Recover for your critical production services.

By 
Krishna Vinnakota user avatar
Krishna Vinnakota
·
May. 22, 24 · Opinion
Like (6)
Save
Tweet
Share
1.8K Views

Join the DZone community and get the full member experience.

Join For Free

As a Site Reliability Engineer, one of the key metrics that I use to track the effectiveness of incident management is Mean Time To Recover (MTTR). Based on Wikipedia, MTTR is defined as the average time that a service or system will take to recover from any failure. Trying to achieve a low MTTR is key to achieving service level objectives and in turn, service level agreements of any critical production service.

10 Things That Can Help Reduce the Mean Time to Recovery (MTTR)

1. Clearly Defined SLIs

Service level indicators or SLIs are the key indicators that measure the health of your service. A few examples of SLIs are error rate, latency, throughput, etc.

2. Actionable Alerts Based on SLIs

The alert strategy should include improving the signal-to-noise ratio of the alerts. The goal with alerting is that every alert that your team gets should be actionable. Sending too many alerts will cause alert fatigue and will have the risk of the on-call person ignoring alerts that indicate real issues with the service.

3. Troubleshooting Guides Associated With Alerts

Every alert should have a clearly defined troubleshooting guide on how to triage and mitigate the issue the alert identifies. A good methodology to use while writing these troubleshooting guides is the USE methodology, suggested by Brendan Gregg in his book, "Systems Performance." USE stands for Usage, Saturation, and Errors.

4. Practice Troubleshooting Guides

Practicing troubleshooting guides periodically will help mitigate incidents when they occur. It will also help identify gaps with the TSGs since services evolve over time. A few examples of a good time to practice troubleshooting guides is when a new team member joins the team so that they can give a fresh perspective of the TSG. This will reduce assumptions about the knowledge of the system.

5. Usable Dashboards

The observability strategy should include creating easy-to-use dashboards. The dashboards should have panels to include the key metrics of the services and the health of dependent services such as upstream and downstream services. A few examples of important metrics that should be included in the dashboards are the golden signals suggested by the Google SRE book such as latency, throughput, error rate, and saturation metrics.

6. Automated Actions To Mitigate Issues

Automating certain actions based on the metrics and events is key to reducing MTTR. An example of this is taking certain servers out of rotation if packet loss is observed from these servers. This will help reduce the impact on user experience and reduce MTTR.

7. Failovers Rehearsals

In the case of multi-data center architectures, it is crucial to have failover plans defined to make sure to recover from an outage of a specific data center quickly.  Practicing these failover scenarios periodically will help to quickly execute them during an outage. This will also help in identifying any gaps in the failover plans and give the chance to update and fix the failover plans.

8. Automated Failovers

Once the failover plans are defined, implemented, and practiced, the next step is to automate these failover scenarios based on the health checks of the service on a given data center. This will help to mitigate the issues faster and thus reduce the MTTR.

9. Change Management Process

Changes to production systems are a major cause of outages. It is important to have a well-thought-out change management process in place. A few key elements of the change management process should include clearly defined checklists, change review and approval procedures, automated deployment pipelines with built-in monitoring, and the ability to quickly roll back the changes if any issues are observed.

10. Easy To Identify Change List and Automated Rollbacks

There can be multiple changes continuously done in distributed systems where services are designed as microservices.  Having a central system where one can easily identify which changes have been done during a given period of time will help to identify if a specific change has caused an outage and is thus easy to roll back.

Conclusion

In this article, I have discussed 10 things that can help reduce the Mean Time To Recovery of any critical production service. This is not an exhaustive list, but a list of best practices based on my years of experience working as a Site Reliability Engineer on services such as TikTok, Microsoft Teams, Xbox, and Microsoft Dynamics.

Incident management Site reliability engineering Data (computing) systems teams

Opinions expressed by DZone contributors are their own.

Related

  • Too Many Tools? Streamline Your Stack With AIOps
  • Understanding the New SEC Rules for Disclosing Cybersecurity Incidents
  • Scaling SRE Teams: The Challenges and How To Build a Successful Scaling Framework
  • 10 Best Practices to Launch APIs in 2023

Partner Resources


Comments

ABOUT US

  • About DZone
  • Send feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: