Home

Awesome

kff-tools

This repository contains a list of tools that are used to manipulate kff files. kff file format is described here.

kff-tools is a program containing a set of small programs allowing kff files manipulations. Each following part describes one of these tools.

Install

git clone https://github.com/Kmer-File-Format/kff-tools.git --recursive
cd kff-tools && mkdir build && cd build && cmake .. && make -j 4

kff-tools instr

Convert a text kmer file or a text sequence file into a kff file. ACTG encoding is used. Kmer files must contain 1 kmer per line. Example (k=3):

  ACC
  CCT
  GTA

Sequence files must contains 1 sequence of compacted kmers per line. Example (k=3):

  ACCT   # contains the 2 kmers ACC and CCT
  TGC

If data_size is non 0, then instr is looking for n integers at the end of the line (n: the number of kmer in the sequence). The sequence and the data are separated by a one char delimiter (default ' '). kmer data are separated by another one char delimiter (default ','). Example (k=3):

  ACCC 12,31
  CCT 1
  GTAA 42,3

Parameters:

Usage:

  # Read 1 kmer per line
  kff-tools instr -i kmers.txt -o kmers.kff -k 12
  # Read 1 kmer per line with its counts (up to 255)
  kff-tools instr -i counts.txt -o counts.kff -k 12 -c -d 1
  # Read sequences and split them if the contains more than 256 kmers
  kff-tools instr -i sequences.txt -o sequences.kff -k 12 -m 256

kff-tools outstr

Read a kff file and print to stdout the kmers and data as strings (one kmer per line)

Parameters:

Usage:

  kff-tools outstr -i file.kff

kff-tools validate

Read a kff file and exit raising an error if a file corruption is detected. Print details of the file on verbose mode.

Parameters:

Usage:

  kff-tools validate -i file.kff -v

kff-tools bucket

A tool that split each raw section into multiple minimizer sections. Each section contains only kmers sharing the same minimizer (ie the same substring of size m minimizing the encoding order).

kff-tools compact

Compact kmers into super-kmers (group of overlapping kmers sharing a minimizer). One block per super-kmer generated is written. Only the kmers inside of minimizer sections are compacted. Each minimizer section is compacted separatly. The compaction is linear in time and needs an amount of memory proportional to the largest minimizer section (larger in terms of number of kmers).

Parameters:

Usage:

  kff-tools compact -i to_compact.kff -o compacted.kff

kff-tools disjoin

The disjoin tool is the opposite of the compact tool. Each block containing a sequence of n kmers will be splitted in n blocks of 1 kmer. The number of kmers inside of each section is preserved.

Parameters:

Usage:

  kff-tools disjoin -i input.kff -o disjoin.kff

kff-tools split

Split a kff file into one kff file per section.

Parameters:

Usage:

  kff-tools split -i to_split.kff -o split_dir/

kff-tools merge

Merge a list of kff files into only one. The order of the input file will be preserved in the merged output.

Parameters:

Usage:

  kff-tools merge -i to_merge_1.kff to_merge_2.kff to_merge_3.kff -o merged.kff

kff-tools translate

Read and rewrite a kff file changing the nucleotide encoding.

Parameters:

Usage:

  kff-tools translate -i to_encode.kff -o encoded.kff -e AGTC

kff-tools data-rm

Read a kff file and write the same one with a data size of 0. It means that all the data are removed and the file only preserve sequences.

Parameters:

Usage:

  kff-tools data-rm -i file.kff -o file_nodata.kff

Testing the code

Run functional tests from the root of the project

  python3 -m unittest discover -s tests/

Run unit tests from the root of the project

  ./bin/tests