Awesome
smash
CLI tool to smash
through to find duplicate files efficiently by slicing a file (or blob) into multiple segments
and computing a hash using a fast non-cryptographic algorithm such as xxhash or murmur3.
Amongst the highlights of smash
:
- Super fast analysis of large files thanks to slicing.
- Suited for finding duplicates on bandwidth constrained networks, devices or very large files but plenty capable on smaller ones!
- Supports a variety of non-cryptographic algorithms (see algorithms supported).
- Read-only view of the underlying filesystem when analysing
- Reports on duplicate files & empty (0 byte) files
- Outputs a report in json, you can use tools like jq to operate on (see examples below or the vhs tapes)
- Used to dedupe multi-TB of astrophysics datasets, images and video content & run regularly to report duplicates
smash
does not support pruning of duplicates or empty files natively and it's encouraged you vet the output report before pruning via automated tools.
The name comes from a prototype tool called SmartHash (written many years ago in C/ASM that's now lost in source & too hard to modernise). It operated on a similar concept of slicing and hashing (with CRC32 then later MD5).
Installation
You can download the latest binaries from Github Releases or via our simple installer script - which currently supports Linux, macos, FreeBSD & Windows:
bash <(curl -s https://raw.githubusercontent.com/thushan/smash/main/install.sh)
It will download the latest version & extract it to its own folder for you.
Alternatively, you can install it via go:
go install github.com/thushan/smash@latest
smash
has been developed on Linux (Pop!_OS & Fedora), tested on macOS, FreeBSD & Windows.
Usage
[!IMPORTANT]
Starting from v0.9.0+,
smash
will only look for duplicates in the current folder, to smash sub-folders, use the--recurse
or-r
switch.
Usage:
smash [flags] [locations-to-smash]
Flags:
--algorithm algorithm Algorithm to use to hash files. Supported: xxhash, murmur3, md5, sha512, sha256 (full list, see readme) (default xxhash)
--base strings Base directories to use for comparison Eg. --base=/c/dos,/c/dos/run/,/run/dos/run
--disable-autotext Disable detecting text-files to opt for a full hash for those
--disable-meta Disable storing of meta-data to improve hashing mismatches
--disable-slicing Disable slicing & hash the full file instead
--exclude-dir strings Directories to exclude separated by comma Eg. --exclude-dir=.git,.idea
--exclude-file strings Files to exclude separated by comma Eg. --exclude-file=.gitignore,*.csv
-h, --help help for smash
--ignore-empty Ignore empty/zero byte files (default true)
--ignore-hidden Ignore hidden files & folders Eg. files/folders starting with '.' (default true)
--ignore-system Ignore system files & folders Eg. '$MFT', '.Trash' (default true)
-L, --max-size int Maximum file size to consider for hashing (in bytes)
-p, --max-threads int Maximum threads to utilise (default 16)
-w, --max-workers int Maximum workers to utilise when smashing (default 16)
-G, --min-size int Minimum file size to consider for hashing (in bytes)
--nerd-stats Show nerd stats
--no-output Disable report output
--no-progress Disable progress updates
--no-top-list Hides top x duplicates list
-o, --output-file string Export analysis as JSON (generated automatically otherwise)
--profile Enable Go Profiler - see localhost:1984/debug/pprof
--progress-update int Update progress every x seconds (default 5)
-r, --recurse Recursively search directories for files
--show-duplicates Show full list of duplicates
--show-top int Show the top x duplicates (default 10)
-q, --silent Run in silent mode
--slice-size int Size of a Slice (in bytes) (default 8192)
--slice-threshold int Threshold to use for slicing (in bytes) - if file is smaller than this, it won't be sliced (default 102400)
--slices int Number of Slices to use (default 4)
--verbose Run in verbose mode
-v, --version Show version information
See the full list of algorithms supported.
Examples
Examples are given in Unix format, but apply to Windows as well.
[!TIP]
To recursively smash through directories, use the
--recursive
or-r
switch.By default,
smash
will only look in the current folder (from v0.9+)
Basic
To check for duplicates in a single path (Eg. ~/media/photos
) & output report to report.json
$ ./smash ~/media/photos -r -o report.json
You can then look at report.json
with jq to check duplicates:
$ jq '.analysis.dupes[]|[.location,.path,.filename]|join("/")' report.json | xargs wc -l
Show Empty Files
By default, smash
ignores empty files but can report on them with the --ignore-empty=false
argument:
$ ./smash ~/media/photos -r --ignore-empty=false -o report.json
You can then look at report.json
with jq to check empty files:
$ jq '.analysis.empty[]|[.location,.path,.filename]|join("/")' report.json | xargs wc -l
Show Top 50 Duplicates
By default, smash
shows the top 10 duplicate files in the CLI and leaves the rest for the report, you can change that with the --show-top=50
argument to show the top 50 instead.
$ ./smash ~/media/photos -r --show-top=50
Multiple Directories
To check across multiple directories - which can be different drives, or mounts (Eg. ~/media/photos
and /mnt/my-usb-drive/photos
):
$ ./smash -r ~/media/photos /mnt/my-usb-drive/photos
Smash will find and report all duplicates within any number of directories passed in.
Exclude Files or Directories
You can exclude certain directories or files with the --exclude-dir
and --exclude-file
switches including wildcard characters:
$ ./smash -r --exclude-dir=.git,.svn --exclude-file=.gitignore,*.csv ~/media/photos
For example, to ignore all hidden files on unix (those that start with .
such as .config
or .gnome
folders):
$ ./smash -r --exclude-dir=.config,.gnome ~/media/photos
Disabling Slicing & Getting Full Hash
By default, smash
uses slicing to efficiently slice a file into multiple segments and hash parts of the file.
If you prefer not to use slicing for a run, you can disable slicing with:
$ ./smash -r --disable-slicing ~/media/photos
Changing Hashing Algorithms
By default, smash uses xxhash
, an extremely fast non-cryptographic hash algorithm. However, you can choose a variety
of algorithms as documented.
To use another supported algorithm, use the --algorithm
switch:
$ ./smash -r --algorithm:murmur3 ~/media/photos
Acknowledgements
This project was possible thanks to the following projects or folks.
- @jqlang/jq - without
jq
we'd be a bit lost! - @wader/fq - countless nights of inspecting binary blobs!
- @cespare/xxhash - xxhash implementation
- @spaolacci/murmur3 - murmur3 implementation
- @puzpuzpuz/xsync - Amazingly efficient map implementation
- @pterm/pterm - Amazing TUI framework used
- @spf13/cobra - CLI Magic with Cobra
- @golangci/golangci-lint - Go Linter
- @dkorunic/betteralign - Go alignment checker
Testers - MarkB, JarredT, BenW, DencilW, JayT, ASV, TimW, RyanW, WilliamH, SpencerB, EmadA, ChrisE, AngelaB, LisaA, YousefI, JeffG, MattP
License
Copyright (c) Thushan Fernando and licensed under Apache License 2.0