Home

Awesome

🪓 hck

<p align="center"> <a href="https://github.com/sstadick/hck/actions?query=workflow%3ACheck"><img src="https://github.com/sstadick/hck/workflows/Check/badge.svg" alt="Build Status"></a> <img src="https://img.shields.io/crates/l/hck.svg" alt="license"> <a href="https://crates.io/crates/hck"><img src="https://img.shields.io/crates/v/hck.svg?colorB=319e8c" alt="Version info"></a><br> A sharp <i>cut(1)</i> clone. </p>

hck is a shortening of hack, a rougher form of cut.

A close to drop in replacement for cut that can use a regex delimiter instead of a fixed string. Additionally this tool allows for specification of the order of the output columns using the same column selection syntax as cut (see below for examples).

No single feature of hck on its own makes it stand out over awk, cut, xsv or other such tools. Where hck excels is making common things easy, such as reordering output fields, or splitting records on a weird delimiter. It is meant to be simple and easy to use while exploring datasets. Think of this as filling a gap between cut and awk.

hck is dual-licensed under MIT or the UNLICENSE.

Features

Non-goals

Install

brew tap sstadick/hck
brew install hck
# Note, this version lags by about a day
conda install -c conda-forge hck
# Note, version may lag behind latest
sudo port selfupdate
sudo port install hck
curl -LO https://github.com/sstadick/hck/releases/download/<latest>/hck-linux-amd64.deb
sudo dpkg -i hck-linux-amd64.deb

* Built with profile guided optimizations

export RUSTFLAGS='-C target-cpu=native'
cargo install hck
# Assumes you are on stable rust
# NOTE: this won't work on windows, see CI for linked issue
cargo install just
git clone https://github.com/sstadick/hck
cd hck
just install-native

Packaging status

Packaging status

Examples

Splitting with a string literal

❯ hck -Ld' ' -f1-3,5- ./README.md | head -n4
#       🪓      hck

<p      align="center">
                <a      src="https://github.com/sstadick/hck/workflows/Check/badge.svg" alt="Build      Status"></a>

Splitting with a regex delimiter

# note, '\s+' is the default
❯ ps aux | hck -f1-3,5- | head -n4
USER    PID     %CPU    VSZ     RSS     TTY     STAT    START   TIME    COMMAND
root    1       0.0     169452  13472   ?       Ss      Jun21   0:19    /sbin/init      splash
root    2       0.0     0       0       ?       S       Jun21   0:00    [kthreadd]
root    3       0.0     0       0       ?       I<      Jun21   0:00    [rcu_gp]

Reordering output columns

❯ ps aux | hck -f2,1,3- | head -n4
PID     USER    %CPU    %MEM    VSZ     RSS     TTY     STAT    START   TIME    COMMAND
1       root    0.0     0.0     169452  13472   ?       Ss      Jun21   0:19    /sbin/init      splash
2       root    0.0     0.0     0       0       ?       S       Jun21   0:00    [kthreadd]
3       root    0.0     0.0     0       0       ?       I<      Jun21   0:00    [rcu_gp]

Excluding output columns

❯ ps aux | hck -e3,5 | head -n4
USER    PID     %MEM    RSS     TTY     STAT    START   TIME    COMMAND
root    1       0.0     14408   ?       Ss      Jun21   0:27    /sbin/init      splash
root    2       0.0     0       ?       S       Jun21   0:01    [kthreadd]
root    3       0.0     0       ?       I<      Jun21   0:00    [rcu_gp]

Excluding output columns by header regex

❯  ps aux | hck -r -E "CPU" -E "^ST.*" | head -n4
USER    PID     %MEM    VSZ     RSS     TTY     TIME    COMMAND
root    1       0.0     170224  14408   ?       0:27    /sbin/init      splash
root    2       0.0     0       0       ?       0:01    [kthreadd]
root    3       0.0     0       0       ?       0:00    [rcu_gp]

Changing the output record separator

❯ ps aux | hck -D'___' -f2,1,3 | head -n4
PID___USER___%CPU
1___root___0.0
2___root___0.0
3___root___0.0

Select columns with regex

# Note the order match the order of the -F args
ps aux | hck -r -F '^ST.*' -F '^USER$' | head -n4
STAT    START   USER
Ss      Jun21   root
S       Jun21   root
I<      Jun21   root

Automagic decompresion

❯ gzip ./README.md
❯ hck -Ld' ' -f1-3,5- -z ./README.md.gz | head -n4
#       🪓      hck

<p      align="center">
                <a      src="https://github.com/sstadick/hck/workflows/Check/badge.svg" alt="Build      Status"></a>

Splitting on multiple characters

# with string literal
❯ printf 'this$;$is$;$a$;$test\na$;$b$;$3$;$four\n' > test.txt
❯ hck -Ld'$;$' -f3,4 ./test.txt
a       test
3       four
# with an interesting regex
❯ printf 'this123__is456--a789-test\na129_-b849-_3109_-four\n' > test.txt
❯ hck -d'\d{3}[-_]+' -f3,4 ./test.txt
a       test
3       four

Splitting by-index and by-header

This one requires some explaining first. Basically, by-index and by-header selections each have their own "order", and then the orders are merged ex:

❯ printf 'a,b,c,d,e\n1,2,3,4,5\n' | hck -d, -D: -f3 -F 'b' -F 'a'
b:c:a
2:3:1

In the by-index group, we've specified column 3 to be in output position 0. In the by-header group, we've specified that column b be in position 0. They by-index and by-header selections are merged together and when merging, if there are two outputs specified to be in the same output position the are output in input order (input meaning the order of columns in the input data).

This can lead to unexpected outcomes, such as the following example where a now comes first in the output when compared to the example above.

❯ printf 'a,b,c,d,e\n1,2,3,4,5\n' | hck -d, -D: -f3 -F 'a'
a:c
1:3

Takeaway: be careful when a specific output order is desired and you are mixing and matching by-index and by-header field selections.

Benchmarks

This set of benchmarks is simply meant to show that hck is in the same ballpark as other tools. These are meant to capture real world usage of the tools, so in the multi-space delimiter benchmark for gcut, for example, we use tr to convert the space runs to a single space and then pipe to gcut.

Note this is not meant to be an authoritative set of benchmarks, it is just meant to give a relative sense of performance of different ways of accomplishing the same tasks.

Hardware

Ubuntu 20 AMD Ryzen 9 3950X 16-Core Processor w/ 64 GB DDR4 memory and 1TB NVMe Drive

Data

The all_train.csv data is used.

This is a CSV dataset with 7 million lines. We test it both using , as the delimiter, and then also using \s\s\s as a delimiter.

PRs are welcome for benchmarks with more tools, or improved (but still realistic) pipelines for commands.

Tools

cut:

mawk:

xsv:

tsv-utils:

choose:

Single character delimiter benchmark

CommandMean [s]Min [s]Max [s]Relative
hck -Ld, -f1,8,19 ./hyper_data.txt > /dev/null1.198 ± 0.0151.1851.2151.00
hck -Ld, -f1,8,19 --no-mmap ./hyper_data.txt > /dev/null1.349 ± 0.0291.3201.3891.13 ± 0.03
hck -d, -f1,8,19 ./hyper_data.txt > /dev/null1.649 ± 0.0231.6241.6731.38 ± 0.03
hck -d, -f1,8,19 --no-mmap ./hyper_data.txt > /dev/null1.869 ± 0.0191.8421.8941.56 ± 0.02
tsv-select -d, -f 1,8,19 ./hyper_data.txt > /dev/null1.702 ± 0.0211.6871.7341.42 ± 0.02
choose -f , -i ./hyper_data.txt 0 7 18 > /dev/null4.285 ± 0.0924.2144.4283.58 ± 0.09
xsv select -d, 1,8,19 ./hyper_data.txt > /dev/null5.693 ± 0.0425.6355.7454.75 ± 0.07
awk -F, '{print $1, $8, $19}' ./hyper_data.txt > /dev/null4.993 ± 0.0294.9595.0304.17 ± 0.06
cut -d, -f1,8,19 ./hyper_data.txt > /dev/null7.541 ± 1.2506.8279.7696.30 ± 1.05

Multi-character delimiter benchmark

CommandMean [s]Min [s]Max [s]Relative
hck -Ld' ' -f1,8,19 ./hyper_data_multichar.txt > /dev/null1.718 ± 0.0031.7151.7221.00
hck -Ld' ' -f1,8,19 --no-mmap ./hyper_data_multichar.txt > /dev/null2.191 ± 0.0722.1352.2911.28 ± 0.04
hck -d' ' -f1,8,19 ./hyper_data_multichar.txt > /dev/null2.180 ± 0.0292.1352.2081.27 ± 0.02
hck -d' ' --no-mmap -f1,8,19 ./hyper_data_multichar.txt > /dev/null2.542 ± 0.0142.5292.5651.48 ± 0.01
hck -d'[[:space:]]+' -f1,8,19 ./hyper_data_multichar.txt > /dev/null8.597 ± 0.0238.5758.6315.00 ± 0.02
hck -d'[[:space:]]+' --no-mmap -f1,8,19 ./hyper_data_multichar.txt > /dev/null8.890 ± 0.0138.8718.9035.17 ± 0.01
hck -d'\s+' -f1,8,19 ./hyper_data_multichar.txt > /dev/null10.014 ± 0.2479.84410.4495.83 ± 0.14
hck -d'\s+' -f1,8,19 --no-mmap ./hyper_data_multichar.txt > /dev/null10.173 ± 0.03510.11110.1935.92 ± 0.02
choose -f ' ' -i ./hyper_data_multichar.txt 0 7 18 > /dev/null6.537 ± 0.1486.4526.7993.80 ± 0.09
choose -f '[[:space:]]' -i ./hyper_data_multichar.txt 0 7 18 > /dev/null10.656 ± 0.21910.48410.9206.20 ± 0.13
choose -f '\s' -i ./hyper_data_multichar.txt 0 7 18 > /dev/null37.238 ± 0.15337.00737.38321.67 ± 0.10
awk -F' ' '{print $1, $8 $19}' ./hyper_data_multichar.txt > /dev/null6.673 ± 0.0646.5956.7343.88 ± 0.04
awk -F' ' '{print $1, $8, $19}' ./hyper_data_multichar.txt > /dev/null5.947 ± 0.0985.8966.1213.46 ± 0.06
awk -F'[:space:]+' '{print $1, $8, $19}' ./hyper_data_multichar.txt > /dev/null11.080 ± 0.21510.88111.3766.45 ± 0.13
< ./hyper_data_multichar.txt tr -s ' ' | cut -d ' ' -f1,8,19 > /dev/null7.471 ± 0.0667.3977.5614.35 ± 0.04
< ./hyper_data_multichar.txt tr -s ' ' | xsv select -d ' ' 1,8,19 --no-headers > /dev/null6.172 ± 0.0686.0716.2353.59 ± 0.04
< ./hyper_data_multichar.txt tr -s ' ' | hck -Ld' ' -f1,8,19 > /dev/null6.171 ± 0.1125.9756.2433.59 ± 0.07
< ./hyper_data_multichar.txt tr -s ' ' | tsv-select -d ' ' -f 1,8,19 > /dev/null6.202 ± 0.1305.9846.2903.61 ± 0.08

Decompression

The following table indicates the file extension / binary pairs that are used to try to decompress a file when the -z option is specified:

ExtensionBinaryType
*.gzNativegzip
*.tgzgzip -d -cgzip
*.bz2bzip2 -d -cbzip2
*.tbz2bzip2 -d -cbzip2
*.xzxz -d -cxz
*.txzxz -d -cxz
*.lz4lz4 -d -clz4
*.lzmaxz --format=lzma -d -clzma
*.brbrotli -d -cbrotli
*.zstzstd -d -czstd
*.zstdzstd -q -d -czstd
*.Zuncompress -cuncompress

When a file with one of the extensions above is found, hck will open a subprocess running the the decompression tool listed above and read from the output of that tool. If the binary can't be found then hck will try to read the compressed file as is. See grep_cli for source code. The end goal is to add a similar preprocessor as ripgrep. Where there are multiple binaries for a given type, they are tried in the order listed above.

Profile Guided Optimization

See the pgo*.sh scripts for how to build this with optimizations. You will need to install the llvm tools via rustup component add llvm-tools-preview for this to work. Building with PGO seems to improve performance anywhere from 5-30% depending on the platform and codepath. i.e. on mac os it seems to have a larger effect, and on the regex codepath it also seems to have a greater effect.

TODO

More packages and builds

https://github.com/sharkdp/bat/blob/master/.github/workflows/CICD.yml

References