Awesome

Awesome Site Reliability Engineering

A curated list of awesome Site Reliability and Production Engineering resources.

What is Site Reliability Engineering?

"Fundamentally, it's what happens when you ask a software engineer to design an operations function." - Ben Treynor Sloss, VP Google Engineering, founder of Google SRE

Contributing

Please take a look at the contribution guidelines first. Contributions are always welcome!

Culture
Education
Books
Hiring
Reliability
Monitoring & Observability & Alerting
On-Call
Post-Mortem
Capacity Planning
Service Level Agreement
Performance
Programming
Misc Articles
Real-time Messaging
Blogs
Newsletters
Conferences & Meetups
Twitter
SRE Tools
SRE Podcasts

Culture

Education

Books

Hiring

Reliability

Monitoring & Observability & Alerting

On-Call

Post-Mortem

Capacity Planning

Service Level Agreement

Performance

Programming

Misc Articles

Real-time Messaging

#sre channel at Hangops Slack - Discussion of Site Reliability Engineering generally.
#incident_response channel at Hangops Slack - Discussion about Incident Response.
USENIX SREcon Slack

Blogs

Brendan Gregg's Blog - Highly Technical Blog Posts About Systems Internals, Performance and SRE.
Everything Sysadmin - Blog Posts About SysAdmin/DevOps/SRE by Tom Limoncelli.
High Scalability - Technical Blog Posts About Systems Architecture.
rachelbythebay - Techincal Blog Posts.
Susan J. Fowler - Various blog posts about SRE, Software Engineering and Microservices.
SysAdvent - One article for each day of December, ending on the 25th article.
Stephen Thorne's Blog - Blog Posts About SRE
Increment - A digital magazine about how teams build and operate software systems at scale.
GopherSRE - Blog Posts about Go and SRE.
Cindy Sridharan - Blog posts about distributed systems and their management.
Blameless Blog - Blog posts about SRE culture and practices.
Resilience Roundup - Weekly analysis of Resilience Engineering and Human Factors research designed for software systems
Squadcast Blog - Blog posts about SRE best practices, reliability, on-call and incident management.
FireHydrant Blog - Posts about complex systems, incident response, and SRE best practices.
Rootly Blog - Incident management best practices and guides.
incident.io Blog - Guides, advice and resources on incident management and response.
Logit.io Blog - Resources on log management, SRE and devOps.

Newsletters

DevOpsLinks - A weekly newsletter about SRE, SysAdmin and DevOps news, tools, tutorials and opinions.
KubeWeekly - The weekly newsletters for all things Kubernetes. KubeWeekly is curated by Bob Killen, Chris Short, Craig Box, Kim McMahon and Michael Hausenblas
SRE Weekly - Weekly Site Reliability Newsletter.
O’Reilly Systems Engineering and Operations Newsletter - Weekly systems engineering and operations news and insights from industry insiders.
ChaosEngineering.news - Chaos Engineering newsletter. All things Chaos Engineering, directly to your inbox!
Monitoring Weekly - What's new in monitoring? Curated monitoring articles to your inbox each week.
Observability news - Updates around observability (o11y) with a special focus on open source.

Conferences & Meetups

SRECon Conferences - The Official SRE Conference.
LISA Conferences - Prominent Conference About SysAdmin/DevOps/SRE.
SRE Tech Talks - SRE Talks Hosted by Google.
South Bay Site Reliability Engineering (Sunnyvale, CA) Meetup - A Group For Individuals Who Tackle Reliability Challenges For Web-Scale Systems.
San Francisco Reliability Engineering - A Group Of People Who Are Passionate About Reliable, Performant Software Systems.
Site Reliability Engineering Munich, Germany - SRE Meetup in the greater area of Oktoberfest city.
ADDO - All Day DevOps - A 24 hour conference that is completely online and free.
Site Reliability Engineering Paris, France - SRE Meetup in the city of light.
Site Reliability Engineering India - SRE Meetup India

Twitter

Google SRE Twitter Account - Google's SRE Twitter Account.
SREBook - The Official Twitter Account of Site Reliability Engineering Book.
SREcon - SRECon's Official Twitter Account.
SREWorkbook - The Official Twitter Account of Site Reliability Workbook.
The SRE Dev - SRE-related Posts from dev.to.
Twitter SRE - The Official Twitter Account of Twitter's SRE team.
Twitter SRE Weekly - The Official Twitter Account of SRE Weekly Newsletter.
USENIX Association - The Official USENIX Twitter Account.

SRE Tools

Awesome SRE Tools - A curated list of Site Reliability and Production Engineering tools
List of Continuous Integration services
SRE cheat sheet - A cheat sheet for Site Reliability Engineering principles and numbers

Awesome

Awesome Site Reliability Engineering

What is Site Reliability Engineering?

Contributing

Contents

Culture

Education

Books

Hiring

Reliability

Monitoring & Observability & Alerting

On-Call

Post-Mortem

Capacity Planning

Service Level Agreement

Performance

Programming

Misc Articles

Real-time Messaging

Blogs

Newsletters

Conferences & Meetups

Twitter

SRE Tools

Podcasts