Home

Awesome

<center><img src="https://github.com/xnl-h4ck3r/urless/blob/main/urless/images/title.png"></center>

About - v2.1

This is a tool used to de-clutter a list of URLs. As a starting point, I took the amazing tool uro by Somdev Sangwan. But I wanted to change a few things, make some improvements (like deal with GUIDs) and make it more customizable.

Installation

urless supports Python 3.

Install urless in default (global) python environment.

pip install urless

OR

pip install git+https://github.com/xnl-h4ck3r/urless.git -v

You can upgrade with

pip install --upgrade urless

pipx

Quick setup in isolated python environment using pipx

pipx install git+https://github.com/xnl-h4ck3r/urless.git

Usage

ArgumentLong ArgumentDescription
-i--inputA file of URLs to de-clutter.
-o--outputThe output file that will contain the de-cluttered list of URLs (default: output.txt). If piped to another program, output will be written to STDOUT instead.
-fk--filter-keywordsA comma separated list of keywords to exclude links (if there no parameters). This will override the FILTER_KEYWORDS list specified in config.yml
-fe--filter-extensionsA comma separated list of file extensions to exclude. This will override the FILTER_EXTENSIONS list specified in config.yml
-rp--remove-paramsA comma separated list of case senistive parameters to remove from ALL URLs. This will override the REMOVE_PARAMS list specified in config.yml. This can be useful to remove cache buster parameters for example.**
-ks--keep-slashA trailing slash at the end of a URL in input will not be removed. Therefore there may be identical URLs output, one with and one without a trailing slash.
-khw--keep-human-writtenBy default, any URL with a path part that contains 3 or more dashes (-) are removed because it is assumed to be human written content (e.g. blog post), and not interesting. Passing this argument will keep them in the output.
-kym--keep-yyyymmBy default, any URL with a path containing 3 /YYYY/MM (where YYYY is a year and MM month) are removed because it is assumed to be blog/news content, and not interesting. Passing this argument will keep them in the output.
-rcid--regex-custom-idUSE WITH CAUTION! Regex for a Custom ID that your target uses. Ensure the value is passed in quotes. See the section below for more details on this.
-iq--ignore-querystringRemove the query string (including URL fragments #) so output is unique paths only.
-fnp--fragment-not-paramDon't treat URL fragments # in the same way as parameters, e.g. if a link has a filter keyword and a fragment (or param) the link is usually kept, but if this argument is passed and a link has a filter word and fragment, the link will be removed. Also, if this arg is passed and -iq / --ignore-querystring is used, the fragment will NOT be removed from links if no query string is in the link.
-lang--languageIf passed and there are multiple URLs with different language codes as a part of the path, only one version of the URL will be output. The codes are specified in the LANGUAGE section of config.yml.
-nb--no-bannerHides the tool banner (it is hidden by default if you pipe input to urless) output.
--versionShow current version number.
-v--verboseVerbose output

What does it do exactly?

You basically pass a list of URLs in (from a file, or pipe from STDIN), and get a de-cluttered file or URLs out. But in what way are they de-cluttered? I'll explain this below, but first here are some terms that will be used:

Here's what happens:

Examples

Basic use

cat target_urls.txt | urless

or

urless -i target_urls.txt

Capture output

cat target_urls.txt | urless > output.txt

or

urless -i target_urls.txt -o output.txt

config.yml

The config.yml file has the keys which can be updated to suit your needs:

Custom Regex

There are currently automatic regex checks for a path part being a Globally Unique ID (GUID) and an Integer ID, but the -rcid / --regex-custom-id argument lets you provide a regular expression to identify a custom ID. For example, if a target has a specific ID format (that isn't a GUID or Integer) then you can specify a regex expression for it, and then only one of those will be returned in the output if the rest of the URL is the same. For example:

IMPORTANT REGEX NOTES:

Issues

If you come across any problems at all, or have ideas for improvements, please feel free to raise an issue on Github. If there is a problem, it will be useful if you can provide the exact command you ran and a detailed description of the problem. If possible, run with -v to reproduce the problem and let me know about any error messages that are given.

TODO

And finally...

Good luck and good hunting! If you really love the tool (or any others), or they helped you find an awesome bounty, consider BUYING ME A COFFEE! ☕ (I could use the caffeine!)

🤘 /XNL-h4ck3r

<p> <a href='https://ko-fi.com/B0B3CZKR5' target='_blank'><img height='36' style='border:0px;height:36px;' src='https://storage.ko-fi.com/cdn/kofi2.png?v=3' border='0' alt='Buy Me a Coffee at ko-fi.com' /></a>