Awesome
WiToKit
Welcome to WiToKit
, a Python toolkit to download and generate preprocessed Wikipedia dumps for all languages.
WiToKit can be used to converts a Wikipedia archive into a single .txt file, one (tokenized) sentence per line.
Note: WiToKit currently only supports xx-pages-articles.xml.xx.bz2
Wikipedia archives corresponding to articles, templates, media/file descriptions, and primary meta-pages.
Install
After a git clone, run:
python3 setup.py install
Use
Download
To download a .bz2-compressed Wikipedia XML dump, do:
witokit download \
--lang lang_wp_code \
--date wiki_date \
--output /abs/path/to/output/dir/where/to/store/bz2/archives \
--num-threads num_cpu_threads
For example, to download the latest English Wikipedia, do:
witokit download --lang en --date latest --output /abs/path/to/output/dir --num-threads 2
The --lang
parameter expects the WP (language) code corresponding
to the desired Wikipedia archive.
Check out the full list of Wikipedias with their corresponding WP codes here.
The --date
parameter expects a string corresponding to one of the dates
found under the Wikimedia dump site corresponding to a given Wikipedia dump
(e.g. https://dumps.wikimedia.org/enwiki/ for the English Wikipedia).
Important Keep num-threads <= 3 to avoid rejection from Wikimedia servers
Extract
To extract the content of the downloaded .bz2 archives, do:
witokit extract \
--input /abs/path/to/downloaded/wikipedia/bz2/archives \
--num-threads num_cpu_threads
Process
To preprocess the content of the extracted XML archives and output a single .txt file, tokenize, one sentence per line:
witokit process \
--input /abs/path/to/wikipedia/extracted/xml/archives \
--output /abs/path/to/single/output/txt/file \
--lower \ # if set, will lowercase text
--num-threads num_cpu_threads
Preprocessing for all languages is performed with Polyglot.
Sample
You can also use WiToKit to sample the content of a preprocess .txt file, using:
witokit sample \
--input /abs/path/to/witokit/preprocessed/txt/file \
--percent \ # percentage of total lines to keep
--balance # if set, will balance sampling, otherwise, will take top n sentences only