Home

Awesome

#Tartiflette

Near Real-Time Anomaly Detection from RIPE Atlas Stream

Participants

Consultants & Groupies

The IMC submission on which this is based is Pinpointing Delay and Forwarding Anomalies Using Large-Scale Traceroute Measurements Romain Fontugne, Emile Aben, Cristel Pelsser, Randy Bush

The goal was to use the RIPE Atlas streaming data to analyse and detect anomalies using the Tartiflette code from Romain.

The project github

We started from Daniel's data collector attaching to Massimo's stream. Romain's code from IMC paper which used static data. And Romain's code for the webpage.

The analysis code wanted raw traceroute data. Some fun was had interpreting what the Atlas stream delivered.

Facebook's anchor was down (since fixed) so we chose Comcast which had two anchors up. For the record, the IP addresses of the Comcast anchors are

ProbeIPv4IPv6
607276.26.120.982001:558:6010:2::2
608076.26.115.1942001:558:6000:4::2

The Atlas Streaming API would not let us filter by "all traceroutes which pass through one or mode links in AS X." So we had to accept the full stream and do our own filtering on the client side. Therefore we gathered the list of prefixes in Comcast's ASs. Jason gave us a list of Comcast prefixes; it was highly un-aggregated, but we aggregated them.

Where do we store the results? For starters, just in memory. This is one of 42 things that the next stages could improve. But we decided to take the minimal non-damaging path to results.

With ten processes, in 13 seconds we extract ten Comcast traceroute results from the full stream. Daniel and Massimo convinced us that this was not going to stand up to peak loads. The front of the funnel was getting on the order fo 50,000 traceroutes per minute.

It seems that the network is the bottleneck between the Atlas producer and our client consumer. Below the socket level. So Massimo hacked the server-side producer to filter on a prefix list, but we had to load it one prefix at a time. We migrated to this.

We had to decide whether to leave the code dealing with RTTs and path changes, as inherited, binning every hour. We could adjust the bin size, say to ten or 20 minutes. But going to a sliding window stream would be a non-trivial code change. We decided to do leave the 60 minute bin size and come back later.

Things to do Later