Home

Awesome

Awesome Observability Awesome

Monitoring as defined by the Oxford dictionary is to "Observe and check the progress or quality of (something) over a period of time; keep under systematic review".

For systems monitoring that means being able to give an overview over the state of a system by exposing key metrics about the systems. The monitoring can be implemented in different ways:

Furthermore, the concept of observability can be seen a superset of monitoring where it is a part of giving visibility into the system. Providing the ability to reason about the system health in a better way.

It can be said to consist of three parts:

Metrics, Logs and Traces: The Golden Triangle of Observability in Monitoring

This repo is not only for monitoring. As said Adrian Cole's in the talk about "Observability 3 Ways" we are going to focus on the three types of systems necessaries to understand how your applications behave: Logging, Metrics & Tracing.

Contents

1. Best Practices

2. General Tools

Before to start with huge observability solution. If you just need to control some application aspects, visualize how is working your system, or just identify a problem, may be usefull start with one, or a collection application, that help you to get this information in a easy and cheap way.

Additional to this, start with tools to get information about your system to determine if it's working well, can help you to define the final stack if you want to install a corporative solution to any project. I know some stories abot people that install, configure and even evolution some monitoring tools as a corporative solution, an when the solution is in production, they realize that the tools don't cover all the necessaries to control their applications :-D

Following you can see an interesting post from Netflix writteb by Brendan Gregg that show this very clear.

https://netflixtechblog.com/linux-performance-analysis-in-60-000-milliseconds-accc10403c55

In the article you can see how with a few tools and in a short time, you can get a lot of information about your system ;-)

 $ uptime
 $ dmesg | tail
 $ vmstat 1
 $ mpstat -P ALL 1
 $ pidstat 1
 $ iostat -xz 1
 $ free -m
 $ sar -n DEV 1
 $ sar -n TCP,ETCP 1
 $ top

There are many more commands and methodologies you can apply to drill deeper.

3. Collect

Get any data – metrics, events, logs, traces – from everywhere – systems, sensors, queues, databases and networks.

Metrics

<!--lint ignore double-link--> <!--lint ignore double-link-->

Tracing

<!--lint ignore double-link-->

Logging

<!--lint ignore double-link-->

Events & Problems

4. Load Generators and Synthetic Traffic

5. Transport

The transport tools simply serve as transport pipelines for data. This includes messaging systems, proprietary protocols and exchange formats.

6. Collector

Receive data from the agents or instrumentation frameworks. The received data is usually persisted to some kind of storage or piped to another tool.

Depending on the collector type, performance data enhancement and modification is also possible inside of the collector.

In addition, collectors can have other responsibilities. For example, some expose the data access API, configuration points for the agents or user interface for interaction with the stored data.

Metrics

<!--lint ignore double-link-->

Logging

<!--lint ignore double-link-->

Events

7. Storage

Time Series Database

<!--lint ignore double-link-->

Time Series Cache

"Meta Projects" (data storage, multi-tenant, aggregation, high availability, etc)

Tracing

<!--lint ignore double-link--> <!--lint ignore double-link-->

Search Engine

Graph Database

SQL Database

NoSQL Database (The Others :-P)

8. Visualization

General & Tools

Dashboarding

<!--lint ignore double-link-->

Tracing

<!--lint ignore double-link--> <!--lint ignore double-link-->

Graph of Nodes

Uptime

9. Processing and Analyze and Act

Tools for rocessing the system data.

Processing

<!--lint ignore double-link-->

Alerts

<!--lint ignore double-link--> <!--lint ignore double-link--> <!--lint ignore double-link-->

Triggers

Anomalies Detection

10. Application Performance Monitoring Solutions (APM)

<!--lint ignore double-link--> <!--lint ignore double-link-->

11. Service Mesh

12. Observability as a Service

<!--lint ignore double-link--> <!--lint ignore double-link--> <!--lint ignore double-link-->

13. Examples and Sandbox's

14. References

15. License

CC0

16. Contributing

Contributions welcome! Read the contribution guidelines first.

Thank you!