Awesome
fasttld
fasttld is a high performance top level domains (TLD) extraction module based on the compressed trie data structure
implemented with the builtin python dict()
.
Background
The goal of fasttld is to extract top level domains (TLDs) from URLs efficiently. In the other words, we extract com
from URLs like www.google.com
or https://maps.google.com:8080/a/long/path/?query=42
.
Running something like ".".join(domain.split('.')[1::]) is not a viable solution, for example, maps.baidu.com.cn
would give us the wrong result baidu.com.cn
instead of com.cn
.
The fasttld module solves this problem by using the regularly-updated Mozilla Public Suffix List and the trie data structure to efficiently extract subdomains, hostnames, and TLDs from URLs.
fasttld also supports extraction of private domains listed in the Mozilla Public Suffix List like 'blogspot.co.uk' and 'sinaapp.com'.
Installation
You can install fasttld from PyPI.
pip install fasttld
or build from source
git clone https://github.com/jophy/fasttld.git && cd fasttld
python setup.py install
Usage
>>> from fasttld import FastTLDExtract
>>> t = FastTLDExtract()
>>> res = t.extract("https://some-user@a.long.subdomain.ox.ac.uk:5000/a/b/c/d/e/f/g/h/i?id=42")
>>> scheme, userinfo, subdomain, domain, suffix, port, path, domain_name = res
>>> scheme, userinfo, subdomain, domain, suffix, port, path, domain_name
('https://', 'some-user', 'a.long.subdomain', 'ox', 'ac.uk', '5000', 'a/b/c/d/e/f/g/h/i?id=42', 'ox.ac.uk')
extract() returns a tuple (scheme, userinfo, subdomain, domain, suffix, port, path, domain_name)
.
Update the Mozilla Public Suffix List local copy
Whenever fasttld is called, it will automatically update the local copy of the Mozilla Public Suffix List if it is more than 3 days old. You can also run the update process manually via the following commands.
>>> import fasttld
>>> fasttld.update()
or
>>> from fasttld import FastTLDExtract
>>> FastTLDExtract().update()
This option can be disabled setting the environment flag FASTTLD_NO_AUTO_UPDATE
to 1
.
Specify Mozilla Public Suffix List file
You can also specify your own public suffix list file.
>>> from fasttld import FastTLDExtract
>>> FastTLDExtract(file_path='/path/to/psl/file').extract('domain', subdomain=False)
Disable subdomain output
If you do not need to extract subdomains, you can disable subdomain output with subdomain=False
.
>>> from fasttld import FastTLDExtract
>>> FastTLDExtract().extract('domain', subdomain=False) # set subdomain=False
Optional: Exclude private domains
According to the Mozilla.org wiki, the Mozilla Public Suffix List contains private domains like blogspot.co.uk
and sinaapp.com
because some registered domain owners wish to delegate subdomains to mutually-untrusting parties, and find that being added to the PSL gives their solution more favourable security properties.
By default, fasttld treats private domains as TLDs (i.e. exclude_private_suffix=False
)
>>> from fasttld import FastTLDExtract
>>> FastTLDExtract(exclude_private_suffix=False).extract('news.blogspot.co.uk')
>>> ('', '', '', 'news', 'blogspot.co.uk', '', '', 'news.blogspot.co.uk') # blogspot.co.uk is treated as a TLD
>>> FastTLDExtract().extract('news.blogspot.co.uk') # this is the default behaviour
>>> ('', '', '', 'news', 'blogspot.co.uk', '', '', 'news.blogspot.co.uk') # same output as above
You can instruct fasttld to exclude private domains by setting exclude_private_suffix=True
>>> from fasttld import FastTLDExtract
>>> FastTLDExtract(exclude_private_suffix=True).extract('news.blogspot.co.uk') # set exclude_private_suffix=True
>>> ('', '', 'news', 'blogspot', 'co.uk', '', '', 'blogspot.co.uk') # notice that co.uk is now recognised as the TLD instead of blogspot.co.uk
Speed Comparison
Similar modules include tldextract and tld.
Test conditions
Initialize the module class once, then call its extract function ten million times. Measure the time taken.
Test environment
Python 3.9.12, AMD Ryzen 7 5800X 3.8 GHz 8 cores 16 threads, 48GB RAM
Test results
module\case | jophy.com | www.baidu.com.cn | jo.noexist | https://maps.google.com.ua/a/long/path?query=42 | 1.1.1.1 | https://192.168.55.1 |
---|---|---|---|---|---|---|
fasttld | 7.60s | 9.90s | 5.28s | 5.67s | 5.06s | 5.30s |
tldextract | 22.96s | 29.32s | 25.06s | 31.69s | 33.89s | 35.15s |
tld | 26.75s | 29.00s | 23.01s | 27.55s | 22.79s | 22.55s |
Excluding subdomains (i.e. subdomain=False
)
module\case | jophy.com | www.baidu.com.cn | jo.noexist | https://maps.google.com.ua/a/long/path?query=42 | 1.1.1.1 | https://192.168.55.1 |
---|---|---|---|---|---|---|
fasttld | 7.55s | 8.98s | 5.20s | 5.52s | 5.13s | 5.25s |
On average, fasttld is 4 to 5 times faster than the other modules. It retains its performance advantage even when parsing long URLs like https://maps.google.com.ua/a/long/path?query=42
Acknowledgements
- Some code borrowed from the tldextract module