Home

Awesome

csvtk - a cross-platform, efficient and practical CSV/TSV toolkit

Introduction

Similar to FASTA/Q format in field of Bioinformatics, CSV/TSV formats are basic and ubiquitous file formats in both Bioinformatics and data science.

People usually use spreadsheet software like MS Excel to process table data. However this is all by clicking and typing, which is not automated and is time-consuming to repeat, especially when you want to apply similar operations with different datasets or purposes.

You can also accomplish some CSV/TSV manipulations using shell commands, but more code is needed to handle the header line. Shell commands do not support selecting columns with column names either.

csvtk is convenient for rapid data investigation and also easy to integrate into analysis pipelines. It could save you lots of time in (not) writing Python/R scripts.

Table of Contents

<!-- START doctoc generated TOC please keep comment here to allow auto update --> <!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE --> <!-- END doctoc generated TOC please keep comment here to allow auto update -->

Features

Subcommands

54 subcommands in total.

Information

Format conversion

Set operations

Edit

Transform

Ordering

Ploting

Misc

Installation

Download Page

csvtk is implemented in Go programming language, executable binary files for most popular operating systems are freely available in release page.

Method 1: Download binaries (latest stable/dev version)

Just download compressed executable file of your operating system, and decompress it with tar -zxvf *.tar.gz command or other tools. And then:

  1. For Linux-like systems

    1. If you have root privilege simply copy it to /usr/local/bin:

       sudo cp csvtk /usr/local/bin/
      
    2. Or copy to anywhere in the environment variable PATH:

       mkdir -p $HOME/bin/; cp csvtk $HOME/bin/
      
  2. For windows, just copy csvtk.exe to C:\WINDOWS\system32.

Method 2: Install via conda (latest stable version) Anaconda Cloud

# >= v0.31.0
conda install -c conda-forge csvtk

# <= v0.31.0
conda install -c bioconda csvtk

Method 3: Install via homebrew

brew install csvtk

Method 4: For Go developer (latest stable/dev version)

go get -u github.com/shenwei356/csvtk/csvtk

Method 5: For ArchLinux AUR users (may be not the latest)

yaourt -S csvtk

Command-line completion

Bash:

# generate completion shell
csvtk genautocomplete --shell bash

# configure if never did.
# install bash-completion if the "complete" command is not found.
echo "for bcfile in ~/.bash_completion.d/* ; do source \$bcfile; done" >> ~/.bash_completion
echo "source ~/.bash_completion" >> ~/.bashrc

Zsh:

# generate completion shell
csvtk genautocomplete --shell zsh --file ~/.zfunc/_csvtk

# configure if never did
echo 'fpath=( ~/.zfunc "${fpath[@]}" )' >> ~/.zshrc
echo "autoload -U compinit; compinit" >> ~/.zshrc

fish:

csvtk genautocomplete --shell fish --file ~/.config/fish/completions/csvtk.fish

Compared to csvkit

csvkit, attention: this table wasn't updated for many years.

FeaturescsvtkcsvkitNote
Read GzipYesYesread gzip files
Fields rangesYesYese.g. -f 1-4,6
Unselect fieldsYes--e.g. -1 for excluding first column
Fuzzy fieldsYes--e.g. ab* for columns with name prefix "ab"
Reorder fieldsYesYesit means -f 1,2 is different from -f 2,1
Rename columnsYes--rename with new name(s) or from existed names
Sort by multiple keysYesYesbash sort like operations
Sort by numberYes--e.g. -k 1:n
Multiple sortYes--e.g. -k 2:r -k 1:nr
Pretty outputYesYesconvert CSV to readable aligned table
Unique dataYes--unique data of selected fields
frequencyYes--frequencies of selected fields
SamplingYes--sampling by proportion
Mutate fieldsYes--create new columns from selected fields
ReplaceYes--replace data of selected fields

Similar tools:

Examples

More examples and tutorial.

Attention

  1. By default, csvtk assumes input files have header row, if not, switch flag -H on.

  2. By default, csvtk handles CSV files, use flag -t for tab-delimited files.

  3. Column names should be unique.

  4. By default, lines starting with # will be ignored, if the header row starts with #, please assign flag -C another rare symbol, e.g. $.

  5. Do not mix use field (column) numbers and names to specify columns to operate.

  6. The CSV parser requires all the lines have same numbers of fields/columns. Even lines with spaces will cause error. Use -I/--ignore-illegal-row to skip these lines if neccessary. You can also use "csvtk fix" to fix files with different numbers of columns in rows.

  7. If double-quotes exist in fields not enclosed with double-quotes, e.g.,

     x,a "b" c,1
    

    It would report error:

     bare `"` in non-quoted-field.
    

    Please switch on the flag -l or use csvtk fix-quotes to fix it.

  8. If somes fields have only a double-quote either in the beginning or in the end, e.g.,

     x,d "e","a" b c,1
    

    It would report an error:

     extraneous or missing " in quoted-field
    

    Please use csvtk fix-quotes to fix it, and use csvtk del-quotes to reset to the original format as needed.

Examples

  1. Pretty result

     $ csvtk pretty names.csv
     id   first_name   last_name   username
     --   ----------   ---------   --------
     11   Rob          Pike        rob
     2    Ken          Thompson    ken
     4    Robert       Griesemer   gri
     1    Robert       Thompson    abc
     NA   Robert       Abel        123
    
     $ csvtk pretty names.csv -S 3line
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
      id   first_name   last_name   username
     ----------------------------------------
      11   Rob          Pike        rob
      2    Ken          Thompson    ken
      4    Robert       Griesemer   gri
      1    Robert       Thompson    abc
      NA   Robert       Abel        123
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    
     $ csvtk pretty names.csv -S bold -w 5 -m 1-
     ┏━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┓
     ┃  id   ┃ first_name ┃ last_name ┃ username ┃
     ┣━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━━╋━━━━━━━━━━┫
     ┃  11   ┃    Rob     ┃   Pike    ┃   rob    ┃
     ┣━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━━╋━━━━━━━━━━┫
     ┃   2   ┃    Ken     ┃ Thompson  ┃   ken    ┃
     ┣━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━━╋━━━━━━━━━━┫
     ┃   4   ┃   Robert   ┃ Griesemer ┃   gri    ┃
     ┣━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━━╋━━━━━━━━━━┫
     ┃   1   ┃   Robert   ┃ Thompson  ┃   abc    ┃
     ┣━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━━╋━━━━━━━━━━┫
     ┃  NA   ┃   Robert   ┃   Abel    ┃   123    ┃
     ┗━━━━━━━┻━━━━━━━━━━━━┻━━━━━━━━━━━┻━━━━━━━━━━┛
    
  2. Summary of selected numeric fields, supporting "group-by"

     $ cat testdata/digitals2.csv \
         | csvtk summary -i -f f4:sum,f5:sum -g f1,f2 \
         | csvtk pretty
     f1    f2     f4:sum   f5:sum
     bar   xyz    7.00     106.00
     bar   xyz2   4.00     4.00
     foo   bar    6.00     3.00
     foo   bar2   4.50     5.00
    
  3. Select fields/columns (cut)

    • By index: csvtk cut -f 1,2
    • By names: csvtk cut -f first_name,username
    • Unselect: csvtk cut -f -1,-2 or csvtk cut -f -first_name
    • Fuzzy fields: csvtk cut -F -f "*_name,username"
    • Field ranges: csvtk cut -f 2-4 for column 2,3,4 or csvtk cut -f -3--1 for discarding column 1,2,3
    • All fields: csvtk cut -f 1- or csvtk cut -F -f "*"
  4. Search by selected fields (grep) (matched parts will be highlighted as red)

    • By exactly matching: csvtk grep -f first_name -p Robert -p Rob
    • By regular expression: csvtk grep -f first_name -r -p Rob
    • By pattern list: csvtk grep -f first_name -P name_list.txt
    • Remore rows containing missing data (NA): csvtk grep -F -f "*" -r -p "^$" -v
  5. Rename column names (rename and rename2)

    • Setting new names: csvtk rename -f A,B -n a,b or csvtk rename -f 1-3 -n a,b,c
    • Replacing with original names by regular express: csvtk rename2 -f 1- -p "(.*)" -r 'prefix_$1' for adding prefix to all column names.
  6. Edit data with regular expression (replace)

    • Remove Chinese charactors: csvtk replace -F -f "*_name" -p "\p{Han}+" -r ""
  7. Create new column from selected fields by regular expression (mutate)

    • In default, copy a column: csvtk mutate -f id
    • Extract prefix of data as group name (get "A" from "A.1" as group name): csvtk mutate -f sample -n group -p "^(.+?)\." --after sample
  8. Sort by multiple keys (sort)

    • By single column : csvtk sort -k 1 or csvtk sort -k last_name
    • By multiple columns: csvtk sort -k 1,2 or csvtk sort -k 1 -k 2 or csvtk sort -k last_name,age
    • Sort by number: csvtk sort -k 1:n or csvtk sort -k 1:nr for reverse number
    • Complex sort: csvtk sort -k region -k age:n -k id:nr
    • In natural order: csvtk sort -k chr:N
  9. Join multiple files by keys (join)

    • All files have same key column: csvtk join -f id file1.csv file2.csv
    • Files have different key columns: csvtk join -f "username;username;name" names.csv phone.csv adress.csv -k
  10. Filter by numbers (filter)

    • Single field: csvtk filter -f "id>0"
    • Multiple fields: csvtk filter -f "1-3>0"
    • Using --any to print record if any of the field satisfy the condition: csvtk filter -f "1-3>0" --any
    • fuzzy fields: csvtk filter -F -f "A*!=0"
  11. Filter rows by awk-like arithmetic/string expressions (filter2)

    • Using field index: csvtk filter2 -f '$3>0'
    • Using column names: csvtk filter2 -f '$id > 0'
    • Both arithmetic and string expressions: csvtk filter2 -f '$id > 3 || $username=="ken"'
    • More complicated: csvtk filter2 -H -t -f '$1 > 2 && $2 % 2 == 0'
  12. Plotting

    • plot histogram with data of the second column:

        csvtk -t plot hist testdata/grouped_data.tsv.gz -f 2 | display
      

      histogram.png

    • plot boxplot with data of the "GC Content" (third) column, group information is the "Group" column.

        csvtk -t plot box testdata/grouped_data.tsv.gz -g "Group" \
            -f "GC Content" --width 3 --title "Box plot" | display
        
      

      boxplot.png

    • plot horiz boxplot with data of the "Length" (second) column, group information is the "Group" column.

       csvtk -t plot box testdata/grouped_data.tsv.gz -g "Group" -f "Length"  \
           --height 3 --width 5 --horiz --title "Horiz box plot" | display
      

    boxplot2.png

    • plot line plot with X-Y data

        csvtk -t plot line testdata/xy.tsv -x X -y Y -g Group | display
        
      

      lineplot.png

    • plot scatter plot with X-Y data

        csvtk -t plot line testdata/xy.tsv -x X -y Y -g Group --scatter | display
        
      

      scatter.png

Acknowledgements

We are grateful to Zhiluo Deng and Li Peng for suggesting features and reporting bugs.

Thanks Albert Vilella for feature suggestions, which makes csvtk feature-rich。

Contact

Create an issue to report bugs, propose new functions or ask for help.

Or leave a comment.

License

MIT License

Starchart

<img src="https://starchart.cc/shenwei356/csvtk.svg" alt="Stargazers over time" style="max-width: 100%">