Best Practices When the $#IT Hits the Fan
Notes from #HEWeb2015 Technical Academy
These are notes from a talk by Dave Cliffe during the HighEdWeb 2015 Technology Academy. Dave is from PagerDuty which ingests alerts from a lot of monitoring projects.
- Running a service in production
- At PagerDuty, they inject failure every Friday to test their process. It also desensitizes their initial response to failure (Love that idea, ‘Don’t Panic’).
- [DevOps] The hard part is other human, social
- Tools can help you automate, measure, and share, but they cannot define your culture.
- Sharing Operational Responsibility
- Why: Giving developers operational responsibilities improves the quality of what they build. “You build it, you run it.” - Werner Vogels, CTO Amazon
- Eliminate the idea that it’s someone else’s problem.
- Goals: Keep customers happy, this is more than just uptime (user experience); Keep engineers happy, don’t burn people out, responsibility is a positive.
- How: Engineers and Operations share Monitoring and Incident Management
- Impacts: Monitoring strategy; what to alert on; Who’s on call; Incident Response
- Choose the right tool for the job, don’t try to find a do-all suite
- Optimizing Incident Response
- Triage: What is the business/customer impact (operationalize your business metrics)? Is my stuff busted? Should I escalate, if yes to whom?
- If Management only sees operations as a cost center, it will only ever be about uptime. Learn to “speak their terminology”.
- Collaboration: ChatOps
- Act: OODA Loop - Observe, Orient, Decide, Act
- Learn: failure should make you better
- Individual Events - Was the alert actionable? Right Context? Right Person?
- Individual Incidents - What was surprising about the event? Indicated more help was needed? How do we collaborate as a team?
- “Remember that failure is an events, not a person.” - Zig Ziglar
- Learn from trends in your operational metrics
- Responding quicker to issues tended to lead to teams that were performing better over time.
- Value on-call quality of life, this leads to people managing the system better and lead to higher quality.
- Practice failure often; find out how well your application responds and how well the team responds to an incident.
Peace
– KS
Web Developer at Benedictine University near Chicago