Home

Awesome

rling logo

rling - a better rli

rling is similar to the rli utility found in hashcat-utils, but much, much faster

Table of contents

General info

In July, 2020, @tychotithonus asked a simple question - theoretically, could rli be faster? Answering that question took the CynoSurePrime team down several roads, looking for "the better ways" to handle the problem. It ended up in a "how many nanoseconds" race in the end!

The essential task of removing lines from a file (or a database) has been fundamental to computing since the very earliest days, and rli seems "good enough" for most purposes. But when the files get large, the amount of RAM used by rli is high, and the performance was not sufficient to the task at hand. @tychotithonus also wanted a few new features.

The performance of rling is impressive (this done on a Power8 system with 80 cores). 1billion.txt is a ~10gigabyte file containing 1,000,000,000 lines. rem is a file containing 6 lines matching ones scattered throughout the 1billion.txt file.

ProgramInputRemoveMemoryTime
rli1billion.txtrem59.7g12m37s
rli1billion.txt1billion.txt59.7g22m14s
rling1billion.txtrem38.0g22s
rling1billion.txt1billion.txt38.0g1m15s
rling -b1billion.txtrem17.0g55s
rling -b1billion.txt1billion.txt17.0g1m36s

Technologies

Setup

There are several precompiled binaries included with the distribution. If your system is one of these, you are done. If not, here are some things to watch out for\

Examples

There are many common, and not so common uses for rling.
rling big-file.txt new-file.txt /path/to/old-file.txt /path/to/others/*
This will read in big-file.txt, remove any duplicate lines, then check /path/to/old-file.txt and all files matching /path/to/others/*. Any line that is found in these files that also exists in big-file.txt will be removed. Once all files are processed, new-file.txt is written with the lines that don't match. This is a great way to remove lines from a new dictionary file, if already have them in your existing dictionary lists.

rling -nb last-names.txt new-names.txt /path/to/names/[a-f]*
This will read last-names.txt, not remove duplicates (-n switch), and use binary search (-b) to remove any last names that match those lines in the files /path/to/names/[a-f]*.

rling clean-list.txt clean-list.txt
This will read clean-list.txt, remove all duplicate lines, and re-write it (in original order) back to clean-list.txt. This use is permitted (maybe not recommended, but permitted), because all of the input file is read into memory prior to opening the output file for writing. Great if you are short on disk space, too.

find /path/to/names -type f -print0 | xargs -0 gzcat | rling stdin stdout | gzip -9 > all-names.txt.gz
This will look in /path/to/names for all files, use gzcat to decompress or access them, pipe the result to rling which will then de-duplicate them all (keeping original order), and then pipe the resultant output to gzip -9 so as to create a new, de-duplicated name-list in compressed format.

rling -c all-names.txt matching.txt /path/to/names/[a-f]\*
This will read in all-names.txt, then find only names in the input file, and present in one or more of the /path/to/names[a-f] files. If there are no matching lines, no data is output to matching.txt.

Features

I'm looking forward to feedback from the community for new features and options. We're pretty happy with how it works right now.

There are some "hidden features" in rling.

Status

Project is in progress, and is in "beta" release. We don't think there are any bugs left, but I'm sure there will be new features.

Inspiration

Key inspiration for this project came from tychotithonus. He suggested the project, and this quickly developed into a "who's smaller?" measuring contest of some kind between Waffle, blazer and hops. Breaking the 5 minute mark on 1,000,000,000 lines was easy, but breaking the 60 second mark was first done by blazer/hops. blazer actually has a better algorithm (using Bloom filters and Judy) which is substantially better than this one. I figure that this is "good enough" for now

Thank you to blazer for the qsort_mt code!
Thank you to hops for the xxHash integration, and to Cyan4973 for xxHash.
And a substantial thank you to Mark Adler, for his yarn code. That's made my life better for more than 10 years now.

Contact

Created by Waffle.