Home

Awesome

sylph-utils: utility scripts and taxonomy for sylph

This repository contains scripts for incoporating taxonomic information into the output of sylph.

Taxonomy integration - available databases

The following databases are currently supported (with pre-built sylph databases available here) and can be found in the prokaryote, eukaryote, virus subfolders within this repository.

  1. GTDB-R220 (April 2024) - prokaryote/gtdb_r220_metadata.tsv.gz
  2. GTDB-R214 (April 2023) - prokaryote/gtdb_r214_metadata.tsv.gz
  3. OceanDNA - prokaryote/ocean_dna_metadata.tsv.gz
  4. Soil MAGs (SMAG) from Ma et al. - prokaryote/smag_metadata.tsv.gz
  5. Refseq fungi representative genomes - eukaryote/fungi_refseq_2024-07-25_metadata.tsv.gz
  6. TARA eukaryotic SMAGs - eukaryote/tara_SMAGs_metadata.tsv.gz
  7. IMG/VR 4.1 high-confidence viral OTU genomes - virus/IMGVR_4.1_metadata.tsv.gz

[!IMPORTANT] (2024-11-05): fungi directory renamed -> eukaryote. TARA eukaryote taxonomy now available. See CHANGELOG.md for details.

Requirements/Install

Run pip install pandas if pandas is not installed.

sylph_to_taxprof.py - obtaining taxonomic profiles from sylph's output

python sylph_to_taxprof.py -m database1_metadata.tsv.gz database2_metadata.tsv.gz -s sylph_output.tsv -o prefix_or_folder/

Use the metadata file corresponding to the database used. E.g. if you use the GTDB-R220 database for sylph, you must use the gtdb_r220_metadata.tsv.gz file.

See here for more information on

  1. taxonomy metadata files definitions
  2. the output format
  3. how to create taxonomy metadata for customized genome databases

[!TIP] In python/pandas, you can read the output with pd.read_csv('output.sylphmpa',sep='\t', comment='#').

merge_sylph_taxprof.py - merge multiple taxonomic profiles

Merge multiple taxonomic profiles from sylph_to_taxprof.py into a TSV table

python merge_sylph_taxprof.py *.sylphmpa --column {ANI, relative_abundance, sequence_abundance} -o output_table.tsv

Output format

clade_name  sample1.fastq.gz  sample2.fastq.gz
d__Archaea  0.0  1.1
d__Archaea|p__Methanobacteriota 0.0     0.0965
...