Home

Awesome

Common Index File Format

The Common Index File Format (CIFF) represents an attempt to build a binary data exchange format for open-source search engines to interoperate by sharing index structures. For more details, check out:

All data are contained in a single file, with the extension .ciff. The file comprises a sequence of delimited protobuf messages defined here, exactly as follows:

See our design rationale for additional discussion.

Explained in terms of xkcd, we're trying to avoid this. Instead, CIFF aims to be this.

Getting Started

After cloning this repo, build CIFF with Maven:

mvn clean package appassembler:assemble

Reference Lucene Indexes

Currently, this repo provides an utility to export CIFF from Lucene, via Anserini. For reference, we provide exports from the Robust04 and ClueWeb12-B13 collections:

CollectionConfigurationSizeMD5Download
Robust04CIFF export, complete162M01ce3b9ebfd664b48ffad072fbcae076[Dropbox]
Robust04CIFF export, queries only16M0a8ea07b6a262639e44ec959c4f53d44[Dropbox]
Robust04Source Lucene index135Mb993045adb24bcbe292d6ed73d5d47b6[Dropbox]
ClueWeb12-B13CIFF export, complete25G8fff3a57b9625eca94a286a61062ac82[Dropbox]
ClueWeb12-B13CIFF export, queries only1.2G45063400bd5823b7f7fec2bc5cbb2d36[Dropbox]
ClueWeb12-B13Source Lucene index21G6ad327c9c837787f7d9508462e5aa822[Dropbox]

The follow invocation can be used to examine an export:

target/appassembler/bin/ReadCIFF -input robust04-complete-20200306.ciff.gz

We provide a full guide on how to replicate the above results here.

CIFF Importers

A CIFF export can be ingested into a number of different search systems.

Tips for writing your own CIFF Importer / Exporter

The systems above all provide concrete examples of taking an existing CIFF structure and converting it into a different (internal) index format. Most of the data/structures within the CIFF are quite straightforward and self-documenting. However, there are a few important details which should be noted.

  1. The default CIFF exports come from Anserini. Those exports are engineered to encode document identifiers as deltas (d-gaps). Hence, when decoding a CIFF structure, care needs to be taken to recover the original identifiers by computing a prefix sum across each postings list. See the discussion here.

  2. Since Anserini is based on Lucene, it is important to note that document lengths are encoded in a lossy manner. This means that the document lengths recorded in the DocRecord structure are approximate - see the discussion here.

  3. Multiple records are stored in a single file using Java protobuf's parseDelimitedFrom() and writeDelimitedTo() methods. Unfortunately, these methods are not available in the bindings for other languages. These can be trivially reimplemented be reading/writing the bytesize of the record using varint - see the discussion here.