Home

Awesome

Community ID Flow Hashing

When processing flow data from a variety of monitoring applications (such as Zeek and Suricata), it's often desirable to pivot quickly from one dataset to another. While the required flow tuple information is usually present in the datasets, the details of such "joins" can be tedious, particular in corner cases. This spec describes "Community ID" flow hashing, standardizing the production of a string identifier representing a given network flow, to reduce the pivot to a simple string comparison.

Pseudo code

function community_id_v1(ipaddr saddr, ipaddr daddr, port sport, port dport, int proto, int seed=0)
{
    # Get seed and all tuple parts into network byte order
    seed = pack_to_nbo(seed); # 2 bytes
    saddr = pack_to_nbo(saddr); # 4 or 16 bytes
    daddr = pack_to_nbo(daddr); # 4 or 16 bytes
    sport = pack_to_nbo(sport); # 2 bytes
    dport = pack_to_nbo(dport); # 2 bytes

    # Abstract away directionality: flip the endpoints as needed
    # so the smaller IP:port tuple comes first.
    saddr, daddr, sport, dport = order_endpoints(saddr, daddr, sport, dport);

    # Produce 20-byte SHA1 digest. "." means concatenation. The
    # proto value is one byte in length and followed by a 0 byte
    # for padding.
    sha1_digest = sha1(seed . saddr . daddr . proto . 0 . sport . dport)

    # Prepend version string to base64 rendering of the digest.
    # v1 is currently the only one available.
    return "1:" + base64(sha1_digest)
}

function community_id_icmp(ipaddr saddr, ipaddr daddr, int type, int code, int seed=0)
{
    port sport, dport;

    # ICMP / ICMPv6 endpoint mapping directly inspired by Zeek
    sport, dport = map_icmp_to_ports(type, code);

    # ICMP is IP protocol 1, ICMPv6 would be 58
    return community_id_v1(saddr, daddr, sport, dport, 1, seed); 
}

Technical details

Reference implementation

A complete implementation is available in the pycommunityid package. It includes a range of tests to verify correct computation for the various protocols. We recommend it to guide new implementations.

A smaller implementation is also available via the community-id.py script in this repository, including the byte layout of the hashed values (see packet_get_comm_id()). See --help and make.sh to get started:

  $ ./community-id.py --help
  usage: community-id.py [-h] [--seed NUM] PCAP [PCAP ...]

  Community flow ID reference

  positional arguments:
    PCAP         PCAP packet capture files

  optional arguments:
    -h, --help   show this help message and exit
    --seed NUM   Seed value for hash operations
    --no-base64  Don't base64-encode the SHA1 binary value
    --verbose    Show verbose output on stderr

For troubleshooting, the implementation supports omitting the base64 operation, and can provide additional detail about the exact sequence of bytes going into the SHA1 hash computation.

Reference data

The baseline directory in this repo contains datasets to help you verify that your implementation of Community ID functions correctly.

Reusable modules/libraries

Production implementations

Feature requests in other projects

Talks

Blog posts and other resources

Discussion

Feel free to discuss aspects of the Community ID via GitHub here: https://github.com/corelight/community-id-spec/issues