Awesome
Talk: Observing Distributed Systems
Presented at:
Observability is the ability to understand what is going on with your systems, not only from the point of view of how the system looks from outside, but been able to answer more granular questions, like where did this message goes?
Metrics
, Logging
and Tracing
are called the three pillars of Observability.
In this presentation we will see how we can use these tools and how they are related to be able to observe our systems.
-
Logging and Metrics
-
OpenTracing API
-
Demo: Tweets App
Tools
-
JDK 8
-
Docker (Docker-Machine, host: docker-vm)
-
Logging: Fluentd, Elasticsearch Kibana
-
Metrics: Prometheus
-
Tracing: OpenTracing, Jaeger, Zipkin
-
Frameworks/Libraries: Dropwizard, JOOQ, Kafka Clients, HTTP Client, Elasticsearch, Postgresql.
Key takeaways
-
Distributed Tracing is just one more tool for your toolkit, and is not mean to replace metrics and logging, and it could be seen as an abstraction of them.
-
OpenTracing is an effort to standarize how to instrument your applications, so you can upgrade/migrate your infrastructure without changing your implementations.
-
OpenTracing is a young project, go ahead, try out and give feedback to the community, or contribute to make it better.
Resources
Papers:
- dapper https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36356.pdf
- canopy http://cs.brown.edu/~jcmace/papers/kaldor2017canopy.pdf
- automating failure testing research at internet scale https://people.ucsc.edu/~palvaro/socc16.pdf
- data on the outside vs data on the inside http://cidrdb.org/cidr2005/papers/P12.pdf
- pivot tracing http://sigops.org/sosp/sosp15/current/2015-Monterey/printable/122-mace.pdf
Blog posts:
- ok log https://peter.bourgon.org/ok-log/
- logs - 12 factor application https://12factor.net/logs
- the problem with logging https://blog.codinghorror.com/the-problem-with-logging/
- logging v. instrumentation https://peter.bourgon.org/blog/2016/02/07/logging-v-instrumentation.html
- logs and metrics https://medium.com/@copyconstruct/logs-and-metrics-6d34d3026e38
- measure anything, measure everything https://codeascraft.com/2011/02/15/measure-anything-measure-everything/
- metrics, tracing and logging https://peter.bourgon.org/blog/2017/02/21/metrics-tracing-and-logging.html
- monitoring and observability https://medium.com/@copyconstruct/monitoring-and-observability-8417d1952e1c
- monitoring in the time of cloud native https://medium.com/@copyconstruct/monitoring-in-the-time-of-cloud-native-c87c7a5bfa3e
- sre book https://landing.google.com/sre/book/index.html
- distributed tracing at uber https://eng.uber.com/distributed-tracing/
- spigo and simianviz https://github.com/adrianco/spigo
- observability: what’s in a name? https://honeycomb.io/blog/2017/08/observability-whats-in-a-name/
- wtf is operations? #serverless https://charity.wtf/2016/05/31/wtf-is-operations-serverless/
- event foo: what should i add to an event https://honeycomb.io/blog/2017/08/event-foo-what-should-i-add-to-an-event/
- “The Verification of A Distributed System” - Caitie McCaffrie https://github.com/CaitieM20/Talks/tree/master/TheVerificationOfADistributedSystem
- “Testing in Production” by Charity Majors https://opensource.com/article/17/8/testing-production
- “Data on the outside vs Data on the inside - Review” by Adrian Colyer https://blog.acolyer.org/2016/09/13/data-on-the-outside-versus-data-on-the-inside/
- Google’s approach to Observability https://medium.com/@rakyll/googles-approach-to-observability-frameworks-c89fc1f0e058
- Microservices and Observability https://medium.com/@rakyll/microservices-observability-26a8b7056bb4
- Best Practices for Observability https://honeycomb.io/blog/2017/11/best-practices-for-observability/
- https://thenewstack.io/dev-ops-doesnt-matter-need-observability/
talks
- "Observability for Emerging Infra: What Got You Here Won't Get You There" by Charity Majors https://www.youtube.com/watch?v=1wjovFSCGhE
- “The Verification of a Distributed System” by Caitie McCaffrey https://www.youtube.com/watch?v=kDh5BrqiGhI
- “Mastering Chaos - A Netflix Guide to Microservices” by Josh Evans https://www.youtube.com/watch?v=CZ3wIuvmHeM
- “Monitoring Microservices” by Tom Wilkie https://www.youtube.com/watch?v=emaPPg_zxb4
- “Microservice application tracing standards and simulations” by Adrian Cole and Adrian Cockcroft https://www.slideshare.net/adriancockcroft/microservices-application-tracing-standards-and-simulators-adrians-at-oscon
- “Intuition Engineering at Netflix” by Justin Reynolds https://vimeo.com/173607639
- Distributed Tracing: Understanding how your all your components work together by José Carlos Chávez https://speakerdeck.com/jcchavezs/distributed-tracing-understanding-how-your-all-your-components-work-together
- “Monitoring isn't just an accident” https://docs.google.com/presentation/d/1IEJIaQoCjzBsVq0h2Y7qcsWRWPS5lYt9CS2Jl25eurc/edit#slide=id.g327c9fd948_0_534
- Orchestrating Chaos Applying Database Research in the Wild - Peter Alvaro https://www.youtube.com/watch?v=YplkQu6a80Q
Books:
-
Martin Kleppmann - “Design Data-Intensive Applications” https://dataintensive.net/
-
Google - "Site Reliability Engineering” https://landing.google.com/sre/book/index.html