Home

Awesome

smash

GitHub license CI Go Report Card GitHub release

CLI tool to smash through to find duplicate files efficiently by slicing a file (or blob) into multiple segments and computing a hash using a fast non-cryptographic algorithm such as xxhash or murmur3.

Amongst the highlights of smash:

smash does not support pruning of duplicates or empty files natively and it's encouraged you vet the output report before pruning via automated tools.

<p align="center"> <img src="https://vhs.charm.sh/vhs-6UTX5Yc6CIQ6Y3lzulLKYF.gif" alt="Made with VHS"><br/> <sub> <sup>Find duplicates in the <a href="https://github.com/torvalds/linux">linux/drivers</a> source tree with <code>smash</code> (see our <a href="docs/demos.md">🍿 other demos</a>). Made with <a href="https://vhs.charm.sh" target="_blank">vhs</a>!</sup> </sub> </p>

The name comes from a prototype tool called SmartHash (written many years ago in C/ASM that's now lost in source & too hard to modernise). It operated on a similar concept of slicing and hashing (with CRC32 then later MD5).

Installation

Operating Systems

You can download the latest binaries from Github Releases or via our simple installer script - which currently supports Linux, macos, FreeBSD & Windows:

bash <(curl -s https://raw.githubusercontent.com/thushan/smash/main/install.sh)

It will download the latest version & extract it to its own folder for you.

Alternatively, you can install it via go:

go install github.com/thushan/smash@latest

smash has been developed on Linux (Pop!_OS & Fedora), tested on macOS, FreeBSD & Windows.

Usage

[!IMPORTANT]

Starting from v0.9.0+, smash will only look for duplicates in the current folder, to smash sub-folders, use the --recurse or -r switch.

Usage:
  smash [flags] [locations-to-smash]

Flags:
      --algorithm algorithm    Algorithm to use to hash files. Supported: xxhash, murmur3, md5, sha512, sha256 (full list, see readme) (default xxhash)
      --base strings           Base directories to use for comparison Eg. --base=/c/dos,/c/dos/run/,/run/dos/run
      --disable-autotext       Disable detecting text-files to opt for a full hash for those
      --disable-meta           Disable storing of meta-data to improve hashing mismatches
      --disable-slicing        Disable slicing & hash the full file instead
      --exclude-dir strings    Directories to exclude separated by comma Eg. --exclude-dir=.git,.idea
      --exclude-file strings   Files to exclude separated by comma Eg. --exclude-file=.gitignore,*.csv
  -h, --help                   help for smash
      --ignore-empty           Ignore empty/zero byte files (default true)
      --ignore-hidden          Ignore hidden files & folders Eg. files/folders starting with '.' (default true)
      --ignore-system          Ignore system files & folders Eg. '$MFT', '.Trash' (default true)
  -L, --max-size int           Maximum file size to consider for hashing (in bytes)
  -p, --max-threads int        Maximum threads to utilise (default 16)
  -w, --max-workers int        Maximum workers to utilise when smashing (default 16)
  -G, --min-size int           Minimum file size to consider for hashing (in bytes)
      --nerd-stats             Show nerd stats
      --no-output              Disable report output
      --no-progress            Disable progress updates
      --no-top-list            Hides top x duplicates list
  -o, --output-file string     Export analysis as JSON (generated automatically otherwise)
      --profile                Enable Go Profiler - see localhost:1984/debug/pprof
      --progress-update int    Update progress every x seconds (default 5)
  -r, --recurse                Recursively search directories for files
      --show-duplicates        Show full list of duplicates
      --show-top int           Show the top x duplicates (default 10)
  -q, --silent                 Run in silent mode
      --slice-size int         Size of a Slice (in bytes) (default 8192)        
      --slice-threshold int    Threshold to use for slicing (in bytes) - if file is smaller than this, it won't be sliced (default 102400)
      --slices int             Number of Slices to use (default 4)
      --verbose                Run in verbose mode
  -v, --version                Show version information

See the full list of algorithms supported.

Examples

Examples are given in Unix format, but apply to Windows as well.

[!TIP]

To recursively smash through directories, use the --recursive or -r switch.

By default, smash will only look in the current folder (from v0.9+)

Basic

To check for duplicates in a single path (Eg. ~/media/photos) & output report to report.json

$ ./smash ~/media/photos -r -o report.json

You can then look at report.json with jq to check duplicates:

$ jq '.analysis.dupes[]|[.location,.path,.filename]|join("/")' report.json | xargs wc -l

Show Empty Files

By default, smash ignores empty files but can report on them with the --ignore-empty=false argument:

$ ./smash ~/media/photos -r --ignore-empty=false -o report.json

You can then look at report.json with jq to check empty files:

$ jq '.analysis.empty[]|[.location,.path,.filename]|join("/")' report.json | xargs wc -l

Show Top 50 Duplicates

By default, smash shows the top 10 duplicate files in the CLI and leaves the rest for the report, you can change that with the --show-top=50 argument to show the top 50 instead.

$ ./smash ~/media/photos -r --show-top=50

Multiple Directories

To check across multiple directories - which can be different drives, or mounts (Eg. ~/media/photos and /mnt/my-usb-drive/photos):

$ ./smash -r ~/media/photos /mnt/my-usb-drive/photos

Smash will find and report all duplicates within any number of directories passed in.

Exclude Files or Directories

You can exclude certain directories or files with the --exclude-dir and --exclude-file switches including wildcard characters:

$ ./smash -r --exclude-dir=.git,.svn --exclude-file=.gitignore,*.csv ~/media/photos

For example, to ignore all hidden files on unix (those that start with . such as .config or .gnome folders):

$ ./smash -r --exclude-dir=.config,.gnome ~/media/photos

Disabling Slicing & Getting Full Hash

By default, smash uses slicing to efficiently slice a file into multiple segments and hash parts of the file.

If you prefer not to use slicing for a run, you can disable slicing with:

$ ./smash -r --disable-slicing ~/media/photos

Changing Hashing Algorithms

By default, smash uses xxhash, an extremely fast non-cryptographic hash algorithm. However, you can choose a variety of algorithms as documented.

To use another supported algorithm, use the --algorithm switch:

$ ./smash -r --algorithm:murmur3 ~/media/photos

Acknowledgements

This project was possible thanks to the following projects or folks.

Testers - MarkB, JarredT, BenW, DencilW, JayT, ASV, TimW, RyanW, WilliamH, SpencerB, EmadA, ChrisE, AngelaB, LisaA, YousefI, JeffG, MattP

License

Copyright (c) Thushan Fernando and licensed under Apache License 2.0