Home

Awesome

kmoncon - Monitoring connectivity between your kubernetes nodes

A Kubernetes node connectivity tool that preforms frequent tests (tcp, udp and dns), and exposes Prometheus metrics that are enriched with the node name, and the locality information (such as zone), enabling you to correlate issues between availability zones or nodes.

The idea is this information supplements any other L7 monitoring you use, such as Istio observability or Kube State Metrics, to help you get to the root cause of a problem faster.

It's really performant, considering the number of tests it is doing, on my clusters of 75 nodes, the agents have a mere 60m CPU/40mb RAM resource request.

Once you've got it up and going, you can plot some pretty dashboards like this:

grafana

PS. I've included a sample dashboard here to get you going

Known Issues:

Architecture

The application consists of two components.

Agent

This agent runs a Daemonset agent on Kubernetes clusters, and requires minimal permissions to run. The agents purpose is to periodically run tests against the other agents, and expose the results as metrics.

The agent also spawns with an initContainer, which sets some sysctl tcp optimisations. You can disable this behaviour in the the helm values file.

Controller

In order to discover other agents, and enrich the agent information with metadata about the node and availability zone, the controller constantly watches the kubernetes API and maintains the current state in memory. The agents connect to the controller when they start, to get their own metadata, and then every 5 seconds in order to get an up to date agent list.

NOTE: Your cluster needs RBAC enabled as the controller uses in-cluster service-account authentication with the kubernetes master.

Testing

kconmon does a variety of different tests, and exposes the results as prometheus metrics enriched with the node and locality information. The interval is configurable in the helm chart config, and is subject to a 50-500ms jitter to spread the load.

UDP Testing

kmoncon agents by default will perform 5 x 4 byte UDP packet tests between every other agent, every 5 seconds. Each test waits for a response from the destination agent. The RTT timeout is 250ms, anything longer than that and we consider the packets lost in the abyss. The metrics output from UDP tests are:

TCP Testing

kmoncon angets will perform a since HTTP GET request between every other agent, every 5 seconds. Each connection is terminated with Connection: close and Nagle's Algorithm as disabled to ensure consistency across tests.

The metrics output from TCP tests are:

DNS Testing

kconmon agents will perform DNS tests by defualt every 5 seconds. It's a good idea to have tests for a variety of different resolvers (eg kube-dns, public etc).

The metrics output from DNS tests are:

Prometheus Metrics

The agents expose a metric endpoint on :8080/metrics, which you'll need to configure Prometheus to scrape. Here is an example scrape config:

- job_name: 'kconmon'
  honor_labels: true
  kubernetes_sd_configs:
  - role: pod
    namespaces:
      names:
      - kconmon
  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_label_app, __meta_kubernetes_pod_label_component]
    action: keep
    regex: "(kconmon;agent)"
  - source_labels: [__address__]
    action: replace
    regex: ([^:]+)(?::\d+)?
    replacement: $1:8080
    target_label: __address__
  metric_relabel_configs:
  - regex: "(instance|pod)"
    action: labeldrop
  - source_labels: [__name__]
    regex: "(kconmon_.*)"
    action: keep

Your other option if you're using the prometheus operator, is to install the helm chart with --set prometheus.enableServiceMonitor=true. This will create you a Service and a ServiceMonitor.

Alerting

You could configure some alerts too, like this one which fires when we have consistent TCP test failures between zones for 2 minutes:

groups:
- name: kconmon.alerting-rules
  rules:
  - alert: TCPInterZoneTestFailure
    expr: |
      sum(increase(kconmon_tcp_results_total{result="fail"}[1m])) by (source_zone, destination_zone) > 0
    labels:
      for: 2m
      severity: warning
      source: '{{ "{{" }}$labels.source_zone{{ "}}" }}'
    annotations:
      instance: '{{ "{{" }}$labels.destination_zone{{ "}}" }}'
      description: >-
        TCP Test Failures detected between one or more zones
      summary: Inter Zone L7 Test Failure

Deployment

The easiest way to install kconmon is with Helm. Head over to the releases page to download the latest chart. Check out the values.yaml for all the available configuration options.