Awesome
Wiktextract
This is a utility and Python package for extracting data from Wiktionary.
Please report issues on github and we'll try to address them reasonably soon.
Some extracted Wiktionary editions data are available for browsing and downloading at https://kaikki.org, the website will be updated every few days.
Note: extracting all data for all languages from the English Wiktionary may take from an hour to several days, depending on your computer. Expanding Lua modules is not cheap, but it enables superior extraction quality and maintainability! You may want to look at the data downloads instead of running it yourself.
Overview
This is a Python package and tool for extracting information from various Wiktionary data dumps, most notably and completely the English edition (enwiktionary). Note that an edition of Wiktionary contains extensive dictionaries and inflectional information for many languages, not just the language it has been written in.
One thing that distinguishes this tool from any system we're aware of is that this tool expands templates and Lua macros in Wiktionary. That enables much more accurate rendering and extraction of glosses, word senses, inflected forms, and pronunciations. It also makes the system much easier to maintain. All this results in much higher extraction quality and accuracy.
The English edition extraction 'module' extracts glosses, parts-of-speech, declension/conjugation information when available, translations for all languages when available, pronunciations (including audio file links), qualifiers including usage notes, word forms, links between words including hypernyms, hyponyms, holonyms, meronyms, related words, derived terms, compounds, alternative forms, etc. Links to Wikipedia pages, Wikidata identifiers, and other such data are also extracted when available. For many classes of words, a word sense is annotated with specific information such as what word it is a form of, what is the RGB value of the color it represents, what is the numeric value of a number, what SI unit it represents, etc.
Other editions are less complete (or the Wiktionary edition itself doesn't necessarily have the same width of data), but we try to cover the basics.
This tool extracts information for all languages that have data in the wiktionary edition. It also extracts translingual data and information about characters (anything that has an entry in Wiktionary).
This tool reads a <language-code>wiktionary-<date>-pages-articles.xml.bz2
dump file and outputs JSONL-format (json objects separated with newlines)
dictionaries containing most of the information in Wiktionary. The dump files
can be downloaded from https:// dumps.wikimedia.org.
This utility will be useful for many natural language processing, semantic parsing, machine translation, and language generation applications both in research and industry.
The tool can be used to extract machine translation dictionaries,
language understanding dictionaries, semantically annotated
dictionaries, and morphological dictionaries with
declension/conjugation information (where this information is
available for the target language). Dozens of languages have
extensive vocabulary in enwiktionary
, and several thousand
languages have partial coverage.
The wiktwords
script makes extracting the information for use by other tools
trivial without writing a single line of code. It extracts the information
specified by command options for languages specified on the command line, and
writes the extracted data to a file or standard output in JSONL format (json
objects separated with newlines) for processing by other tools.
As far as we know, this is the most comprehensive tool available for extracting information from Wiktionary as of December 2020.
If you find this tool and/or the pre-extracted data helpful, please give this a star on github!
Pre-extracted data
For most people, it may be easiest to just download pre-expanded data. Please see https://kaikki.org/dictionary/rawdata.html. The raw wiktextract data, extracted category tree, extracted templates and modules, as well as a bulk download of audio files for pronunciations in both <code>.ogg</code> and <code>.mp3</code> formats are available.
There is a also download link at the bottom of every page and a button to view the JSON produced for each page. You can download all data, data for a specific language, data for just a single word, or data for a list of related words (e.g., a particular part-of-speech or words relating to a particular topic or having a particular inflectional form). All downloads are in JSON Lines format (each line is a separate JSON object). The bigger downloads are also available in compressed form.
Some people have asked for the full data as a single JSON object (instead of the current one JSON object per line format). I've decided to keep it as a JSON object per line, because loading all the data into Python requires about 120 GB of memory. It is much easier to process the data line-by-line, especially if you are only interested in a part of the information. You can easily read the files using the following code:
import json
with open("filename.json", encoding="utf-8") as f:
for line in f:
data = json.loads(line)
... # parse the data in this record
If you want to collect all the data into a list, you can read the file into a list with:
import json
lst = []
with open("filename.json", encoding="utf-8") as f:
for line in f:
data = json.loads(line)
lst.append(data)
You can also easily pretty-print the data into a more human-readable form using:
print(json.dumps(data, indent=2, sort_keys=True, ensure_ascii=False))
Here is a pretty-printed example of an extracted word entry for the
word thrill
as an English verb (only one part-of-speech is shown here):
{
"categories": [
"Emotions"
],
"derived": [
{
"word": "enthrill"
}
],
"forms": [
{
"form": "thrills",
"tags": [
"present",
"simple",
"singular",
"third-person"
]
},
{
"form": "thrilling",
"tags": [
"present"
]
},
{
"form": "thrilled",
"tags": [
"participle",
"past",
"simple"
]
}
],
"head_templates": [
{
"args": {},
"expansion": "thrill (third-person singular simple present thrills, present participle thrilling, simple past and past participle thrilled)",
"name": "en-verb"
}
],
"lang": "English",
"lang_code": "en",
"pos": "verb",
"senses": [
{
"glosses": [
"To suddenly excite someone, or to give someone great pleasure; to electrify; to experience such a sensation."
],
"tags": [
"ergative",
"figuratively"
]
},
{
"glosses": [
"To (cause something to) tremble or quiver."
],
"tags": [
"ergative"
]
},
{
"glosses": [
"To perforate by a pointed instrument; to bore; to transfix; to drill."
],
"tags": [
"obsolete"
]
},
{
"glosses": [
"To hurl; to throw; to cast."
],
"tags": [
"obsolete"
]
}
],
"sounds": [
{
"ipa": "/\u03b8\u0279\u026al/"
},
{
"ipa": "[\u03b8\u027e\u032a\u030a\u026a\u026b]",
"tags": [
"UK",
"US"
]
},
{
"ipa": "[\u03b8\u027e\u032a\u030a\u026al]",
"tags": [
"Ireland"
]
},
{
"ipa": "[t\u032a\u027e\u032a\u030a\u026al]",
"tags": [
"Ireland"
]
},
{
"rhymes": "-\u026al"
},
{
"audio": "en-us-thrill.ogg",
"mp3_url": "https://upload.wikimedia.org/wikipedia/commons/transcoded/d/db/En-us-thrill.ogg/En-us-thrill.ogg.mp3",
"ogg_url": "https://upload.wikimedia.org/wikipedia/commons/d/db/En-us-thrill.ogg",
"tags": [
"US"
],
"text": "Audio (US)"
}
],
"translations": [
{
"code": "nl",
"lang": "Dutch",
"sense": "suddenly excite someone, or to give someone great pleasure; to electrify",
"word": "opwinden"
},
{
"code": "fi",
"lang": "Finnish",
"sense": "suddenly excite someone, or to give someone great pleasure; to electrify",
"word": "syk\u00e4hdytt\u00e4\u00e4"
},
{
"code": "fi",
"lang": "Finnish",
"sense": "suddenly excite someone, or to give someone great pleasure; to electrify",
"word": "riemastuttaa"
},
...
{
"code": "tr",
"lang": "Turkish",
"sense": "slight quivering of the heart that accompanies a cardiac murmur",
"word": "\u00e7arp\u0131nt\u0131"
}
],
"wikipedia": [
"thrill"
],
"word": "thrill"
}
Getting started
Installing
Use container:
$ podman run -it --rm ghcr.io/tatuylonen/wiktextract --help
Install from source:
On Linux (example from Ubuntu 20.04), you may need to
first install the build-essential
and python3-dev
packages
with apt update && apt install build-essential python3-dev python3-pip lbzip2
.
git clone https://github.com/tatuylonen/wiktextract.git
cd wiktextract
python -m venv .venv
source .venv/bin/activate
python -m pip install -U pip
python -m pip install -e .
Use pip install
command's --force-reinstall
and -e
option to
reinstall the wikitextprocessor package from source in editable
mode if you want to update both packages' code with git pull
.
Running tests
This package includes tests written using the unittest
framework.
The test dependencies can be installed with the command
python -m pip install -e .[dev]
.
To run the tests, use the following command in the top-level directory:
make test
Expected performance
Extracting all data for all languages from English Wiktionary takes about 1.25 hours on a 128-core dual AMD EPYC 7702 system. The performance is expected to be approximately linear with the number of processor cores, provided you have enough memory (about 10GB/core or 5GB/hyperthread recommended).
As the extractor expands, these times will change.
You can control the number of parallel processes to use with the
--num-processes
option; the default is to use the number of
available cores/hyperthreads.
You can download the full pre-extracted data from kaikki.org. The pre-extraction is updated regularly with the latest Wiktionary dump. Using the pre-extracted data may be the easiest option unless you have special needs or want to modify the code.
Using the command-line tool
The wiktwords
script is the easiest way to extract data from
Wiktionary. Just download the data dump file from
dumps.wikimedia.org and
run the script. The correct dump file the name
enwiktionary-<date>-pages-articles.xml.bz2
.
An example of a typical invocation for extracting all data would be:
wiktwords --all --all-languages --out data.json enwiktionary-20230801-pages-articles.xml.bz2
If you wish to modify the code or test processing individual pages, the following may also be useful:
- Pass a path to save database file that you can use for quickly processing individual pages:
wiktwords --db-path en_20230801.db enwiktionary-20230801-pages-articles.xml.bz2
- To process a single page and produce a human-readable output file for debugging:
wiktwords --db-path en_20230801.db --all --all-languages --out outfile --page page_title
The following command-line options can be used to control its operation:
- --out FILE: specifies the name of the file to write (specifying "-" as the file writes to stdout)
- --all-languages: extract words for all available languages
- --language-code LANGUAGE_CODE: extracts the given language (this option may be specified multiple times; defaults to dump file language code and
mul
(Translingual)) - --language-name LANGUAGE_NAME: Similar to
--language-code
except this option accepts language name - --dump-file-language-code LANGUAGE_CODE: specifies the language code for the Wiktionary edition that the dump file is for (defaults to "en"; "zh" is supported and others are being added)
- --skip-extraction: Used to create a database file from the dump file without waiting for the extraction process to complete.
- --all: causes all data to be captured for the selected languages
- --translations: causes translations to be captured
- --pronunciation: causes pronunciation information to be captured
- --linkages: causes linkages (synonyms etc.) to be captured
- --examples: causes usage examples to be captured
- --etymologies: causes etymology information to be captured
- --descendants: causes descendants information to be captured
- --inflections: causes inflection tables to be captured
- --redirects: causes redirects to be extracted
- --pages-dir DIR: save all wiktionary pages under this directory (mostly for debugging)
- --db-path PATH: save/use database from this path (for debugging)
- --page FILE or TITLE: read page from file or database, can be specified multiple times(first line must be "TITLE: pagetitle"; file should use UTF-8 encoding)
- --num-processes PROCESSES: use this many parallel processes (needs 4GB/process)
- --human-readable: print human-readable JSON with indentation (no longer machine-readable)
- --override PATH: override pages with files in this directory (first line of the file must be TITLE: pagetitle)
- --templates-file: extract Template namespace to this tar file
- --modules-file: extract Module namespace to this tar file
- --categories-file: extract Wiktionary category tree into this file as JSON (see description below)
- --inflection_tables_file: extract and expand tables into this file as wikitext; use this to create tests
- --help: displays help text (with some more options than listed here)
Calling the library
While this package has been mostly intended to be used using the
wiktwords
command, it is also possible to call this as a library.
Underneath, this uses the wikitextprocessor
module. For more usage
examples please read the wiktwords.py and wiktionary.py files.
This code can be called from an application as follows:
from wiktextract import (
WiktextractContext,
WiktionaryConfig,
parse_wiktionary,
)
from wikitextprocessor import Wtp
config = WiktionaryConfig(
dump_file_lang_code="en",
capture_language_codes=["en", "mul"],
capture_translations=True,
capture_pronunciation=True,
capture_linkages=True,
capture_compounds=True,
capture_redirects=True,
capture_examples=True,
capture_etymologies=True,
capture_descendants=True,
capture_inflections=True,
)
wxr = WiktextractContext(Wtp(), config)
RECOGNIZED_NAMESPACE_NAMES = [
"Main",
"Category",
"Appendix",
"Project",
"Thesaurus",
"Module",
"Template",
"Reconstruction"
]
namespace_ids = {
wxr.wtp.NAMESPACE_DATA.get(name, {}).get("id")
for name in RECOGNIZED_NAMESPACE_NAMES
}
with open("output.json", "w", encoding="utf-8") as f:
parse_wiktionary(wxr, dump_path, None, False, namespace_ids, f)
The capture arguments default to True
, so they only need to be set if
some values are not to be captured (note that the wiktwords
program sets them to False
unless the --all
or specific capture
options are used).
parse_wiktionary()
def parse_wiktionary(
wxr: WiktextractContext,
dump_path: str,
num_processes: Optional[int],
phase1_only: bool,
namespace_ids: Set[int],
out_f: TextIO,
human_readable: bool = False,
override_folders: Optional[List[str]] = None,
skip_extract_dump: bool = False,
save_pages_path: Optional[str] = None,
) -> None:
The parse_wiktionary
function will call word_cb(data)
for
words and redirects found in the Wiktionary dump. data
is
information about a single word and part-of-speech as a dictionary and
may include several word senses. It may also be a redirect (indicated
by the presence of a "redirect" key in the dictionary). It is in the same
format as the JSONL-formatted dictionaries returned by the
wiktwords
tool.
Its arguments are as follows:
wxr
(WiktextractContext) - a Wiktextract-level processing context containing fields that point to a Wtp context and WiktionarConfig object (below). **wxr.wtp
(Wtp) - a wikitextprocessor processing context. The number of parallel processes to use can be given as thenum_threads
argument to the constructor, and a database file path can be provided as thedb_path
argument. **wxr.config
(WiktionaryConfig) - a configuration object describing what to exctract (see below)dump_path
(str) - path to a Wiktionary dump file (*-pages-articles.xml.bz2)phase1_only
- if this is set toTrue
, then only a cache file will be created but no extraction will take place. In this case theWtp
constructor should probably be given thedb_path
argument when creatingwxr.wtp
.namespace_ids
- a set of namespace ids, pages with namespace ids that are not included in this set won't be processed. Avaliable id values can be found in wikitextprocessor project's data/en/namespaces.json file and the Wiktionary *.xml.bz2 dump file.out_f
- output file object.human_readable
- if set toTrue
, the output JSON will be formatted with indentation.override_folders
- override pages with files in these directories.skip_extract_dump
- skip extract dump file if database exists.save_pages_path
- path for storing extracted pages.
This call gathers statistics in wxr.config
. This function will
automatically parallelize the extraction. page_cb
will be called in
the parent process, however.
parse_page()
def parse_page(
wxr: WiktextractContext, page_title: str, page_text: str
) -> List[Dict[str, str]]:
wxr
(WiktextractContext) - awiktextract
context containing: **wxr.wtp
(Wtp) - awikitextprocessor
context **wxr.config
(WiktionaryConfig) - specifies what to capture and is also usedpage_title
(str) - the title to use for the pagepage_text
(str) - contents of the page (wikitext) for collecting statistics
PARTS_OF_SPEECH
This is a constant set of all part-of-speech values (pos
key) that
may occur in the extracted data. Note that the list is somewhat larger than
what a conventional part-of-speech list would be.
class WiktextractContext(object)
The WiktextractContext
object is used to hold the wikitextprocessor
-
specific Wtp
context object and the wiktextract's WiktionaryConfig
objects, and XXX in the future it will hold actual context that doesn't
belong in Wtp and XXX WiktionaryConfig will be most probably integrated
into the WiktextractContext object proper.
The constructor is called simply by supplying a Wtp and WiktionaryConfig object:
# Blanks slate for testing, usually
wxr = WiktextractContext(Wtp(), WiktionaryConfig())
or
# separately initialized config with a bunch of arguments like in the
# example in the -> class WiktionaryConfig(object)-section below
wxr = WiktextractContext(wtp, config)
if it is more convenient.
class WiktionaryConfig(object)
The WiktionaryConfig
object is used for specifying what data to collect
from Wiktionary and is also used for collecting statistics during
extraction. Currently, it is a field of the WiktextractContext context object.
The constructor:
def __init__(
self,
dump_file_lang_code="en",
capture_language_codes=["en", "mul"],
capture_translations=True,
capture_pronunciation=True,
capture_linkages=True,
capture_compounds=True,
capture_redirects=True,
capture_examples=True,
capture_etymologies=True,
capture_inflections=True,
capture_descendants=True,
verbose=False,
expand_tables=False,
):
The arguments are as follows:
capture_language_codes
(list/tuple/set of strings) - codes of languages for which to capture data. It defaults to["en", "mul"]
. To capture all languages, set it toNone
.capture_translations
(boolean) - set toFalse
to disable capturing translations. Translation information seems to be most widely available for the English language, which has translations into other languages.capture_pronunciation
(boolean) - set toFalse
to disable capturing pronunciations. Typically, pronunciations include IPA transcriptions and any audio files included in the word entries, along with other information (including dialectal tags). The type and amount of pronunciation information varies widely between languages.capture_linkages
(boolean) - set toFalse
to disable capturing linkages between word, such as hypernyms, antonyms, synonyms, etc.capture_compounds
(boolean) - set toFalse
to disable capturing compound words containing the word. Compound word capturing is not currently fully implemented.capture_redirects
(boolean) - set toFalse
to disable capturing redirects. Redirects are not associated with any specific language and thus requesting them returns them for all words in all languages.capture_examples
(boolean) - set toFalse
to disable capturing usage examples.capture_etymologies
(boolean) - set toFalse
to disable capturing etymologies.capture_descendants
(boolean) - set toFalse
to disable capturing descendants.capture_inflections
(boolean) - set toFalse
to disable capturing inflection tables.
Format of extracted redirects
Some pages in Wiktionary are redirects. For these, word_cb
will
be called with data in a special format. In this case, the dictionary
will have a redirect
key, which will contain the page title that
the entry redirects to. The title
key contains the word/term that
contains the redirect. Redirect entries do not have pos
or any of
the other fields. Redirects also are not associated with any
language, so all redirects are always returned regardless of the
captured languages (if extracting redirects has been requested).
Format of the extracted word entries
Information returned for each word is a dictionary. The dictionary has the following keys (others may also be present or added later):
word
- the word formpos
- part-of-speech, such as "noun", "verb", "adj", "adv", "pron", "determiner", "prep" (preposition), "postp" (postposition), and many others. The complete list of possible values returned by the package can be found inwiktextract.PARTS_OF_SPEECH
.lang
- name of the language this word belongs to (e.g.,English
)lang_code
- Wiktionary language code corresponding tolang
key (e.g.,en
)senses
- list of word senses (dictionaries) for this word/part-of-speech (see below)forms
- list of inflected or alternative forms specified for the word (e.g., plural, comparative, superlative, roman script version). This is a list of dictionaries, where each dictionary has aform
key and atags
key. Thetags
identify what type of form it is. It may also contain "ipa", "roman", and "source" fields. The form can be "-" when the word is marked as not having that form (some of those will be word-specific, while others are language-specific; post-processing can drop such forms when no word has a value for that tag combination).sounds
- list of dictionaries containing pronunciation, hyphenation, rhyming, and related information. Each dictionary may have atags
key containing tags that clarify what kind of form that entry is. Different types of information are stored in different fields:ipa
is IPA pronunciation,enPR
is enPR pronunciation,audio
is name of sound file in Wikimedia commons.categories
- list of non-disambiguated categories for the wordtopics
- list of non-disambiguated topics for the wordtranslations
- non-disambiguated translation entries (see below)etymology_text
- etymology section as cleaned textetymology_templates
- templates and their arguments and expansions from the etymology section. These can be used to easily parse etymological relations. Certain common templates that do not signify etymological relations are not included.etymology_number
- for words with multiple numbered etymologies, this contains the number of the etymology under which this entry appeareddescendants
- descendants of the word (see below)synonyms
- non-disambiguated synonym linkages for the word (see below)antonyms
- non-disambiguated antonym linkages for the word (see below)hypernyms
- non-disambiguated hypernym linkages for the word (see below)holonyms
- non-disambiguated linkages indicating being part of something (see below) (not systematically encoded)meronyms
- non-disambiguated linkages indicating having a part (see below) (fairly rare)derived
- non-disambiguated derived word linkages for the word (see below)related
- non-disambiguated related word linkages for the word (see below)coordinate_terms
- non-disambiguated coordinate term linkages for the word (see below)wikidata
- non-disambiguated Wikidata identiferwiktionary
- non-disambiguated page title in Wikipedia (possibly prefixed by language id)head_templates
: part-of-speech specific head tags for the word. This basically just captures the templates (their name and arguments) as a list of dictionaries. Most applications may want to ignore this.inflection_templates
- conjugation and declension templates found for the word, as dictionaries. These basically capture the language-specific inflection template for the word. Note that for some languages inflection information is also contained inhead_templates
. XXX in the very near future, we will start parsing inflections from the inflection tables intoforms
, so there is usually no need to use theinflection_templates
data.
There may also be other fields.
Word senses
Each word entry may have multiple glosses under the senses
key. Each
sense is a dictionary that may contain the following keys (among others, and more may be added in the future):
glosses
- list of gloss strings for the word sense (usually only one). This has been cleaned, and should be straightforward text with no tagging.raw_glosses
- list of gloss strings for the word sense, with less cleaning thanglosses
. In particular, parenthesized parts that have been parsed from the gloss intotags
andtopics
are still present here. This version may be easier for humans to interpret.tags
- list of qualifiers and tags for the gloss. This is a list of strings, and may include words such as "archaic", "colloquial", "present", "participle", "plural", "feminine", and many others (new words may appear arbitrarily).categories
- list of sense-disambiguated category names extracted from (a subset) of the Category links on the pagetopics
- list of sense-disambiguated topic names (kind of similar to categories but determined differently)alt_of
- list of words that his sense is an alternative form of; this is a list of dictionaries, with fieldword
containing the linked word and optionallyextra
containing additional textform_of
- list of words that this sense is an inflected form of; this is a list of dictionaries, with fieldword
containing the linked word and optionallyextra
containing additional texttranslations
- sense-disambiguated translation entries (see below)synonyms
- sense-disambiguated synonym linkages for the word (see below)antonyms
- sense-disambiguated antonym linkages for the word (see below)hypernyms
- sense-disambiguated hypernym linkages for the word (see below)holonyms
- sense-disambiguated linkages indicating being part of something (see below) (not systematically encoded)meronyms
- sense-disambiguated linkages indicating having a part (see below) (fairly rare)coordinate_terms
- sense-disambiguated coordinate_terms linkages (see below)derived
- sense-disambiguated derived word linkages for the word (see below)related
- sense-disambiguated related word linkages for the word (see below)senseid
- list of textual identifiers collected for the sense. If there is a QID for the entry (e.g., Q123), those are stored in thewikidata
field.wikidata
- list of QIDs (e.g., Q123) for the sensewikipedia
- list of Wikipedia page titles (with optional language code prefix)examples
- list of usage examples, each example being a dictionary withtext
field containing the example text, optionalref
field containing a source reference, optionalenglish
field containing English translation, optionaltype
field containing example type (currentlyexample
orquotation
if present), optionalroman
field containing romanization (for some languages written in non-Latin scripts), and optional (rare)note
field contains English-language parenthesized note from the beginning of a non-english example.english
- if the word sense has a qualifier that could not be parsed, that qualifier is put in this field (rare). Most qualifiers are parsed intotags
and/ortopics
. The gloss with the qualifier still present can be found inraw_glosses
.
Pronunciation
Pronunciation information is stored under the sounds
key. It is a
list of dictionaries, each of which may contain the following keys,
among others:
ipa
- pronunciation specifications as an IPA string /.../ or [...]enpr
- pronunciation in English pronunciation respellingaudio
- name of a sound file in WikiMedia Commonsogg_url
- URL for an OGG Vorbis format sound filemp3_url
- URL for an MP3 format sound fileaudio-ipa
- IPA string associated with the audio file, generally giving IPA transcription of what is in the sound filehomophones
- list of homophones for the wordhyphenation
- list of hyphenationstags
- other labels or context information attached to the pronunciation entry (e.g., might indicate regional variant or dialect)text
- text associated with an audio file (often not very useful)
Note that Wiktionary audio files are available for bulk download at
https://kaikki.org/dictionary/rawdata.html.
Files in the download are named with the last component of the URL in
ogg_url
and/or mp3_url
. Downloading them individually takes
serveral days and puts unnecessary load on Wikimedia servers.
Translations
Translations are stored under the translations
key in the word's
data (if not sense-disambiguated) or in the word sense (if
sense-disambiguated). They are stored in a list of dictionaries,
where each dictionary has the following keys (and possibly others):
alt
- optional alternative form of the translation (e.g., in a different script)code
- Wiktionary's 2 or 3-letter language code for the language the translation is forenglish
- English text, generally clarifying the target sense of the translationlang
the language name that the translation is fornote
- optional text describing or commenting on the translationroman
- optional romanization of the translation (when in non-Latin characters)sense
- optional sense indicating the meaning for which this is a translation (this is a free-text string, and may not match any gloss exactly)tags
- optional list of qualifiers for the translations, e.g., gendertaxonomic
- optional taxonomic name of an organism mentioned in the translationword
- the translation in the specified language (may be missing whennote
is present)
Etymologies
Etymological information is stored under the etymology_text
and
etymology_templates
keys in the word's data. When multiple parts-of-speech
are listed under the same etymology, the same data is copied to each
part-of-speech entry under that etymology.
The etymology_text
field contains the contents of the whole etymology
section cleaned into human-readable text (i.e., templates have been expanded
and HTML tags removed, among other things).
The etymology_templates
field contains a list of templates from
the etymology section. Some common templates considered not relevant
for etymological information have been removed (e.g., redlink category
and isValidPageName
). The list also includes nested
templates referenced from templates directly used in the etymology
description. Each template in the list is a dictionary with the following
keys:
name
- name of the templateargs
- dictionary mapping argument names to their cleaned values. Positional arguments have keys that are numeric strings, starting with "1".expansion
- the (cleaned) text the template expands to.
Descendants
If a word has a "Descendants" section, the descendants
key will appear in the word's data. It contains a list of objects corresponding to each line in the section, where each object has the following keys:
depth
: The level of indentation of the current line. This can be used to track the hierarchical structure of the list.templates
: An array of objects corresponding to templates that appear on the line. The structure of each of these objects is the same as the structure of each object inetymology_templates
.text
: The expanded and cleaned line text, akin toetymology_text
.
descendants
data will also appear for the special case of "Derived terms" and "Extensions" sections for words that are roots in reconstructed languages, as these sections have the same format.
Linkages to other words
Linkages (synonyms
, antonyms
, hypernyms
, derived words
, holonyms
, meronyms
, derived
, related
,
coordinate_terms
) are stored in the word's data if not
sense-disambiguated, and in the word sense if sense-disambiguated.
They are lists of dictionaries, where each dictionary can contain the
following keys, among others:
alt
- optional alternative form of the target (e.g., in a different script)english
- optional English text associated with the sense, usually identifying the linked target senseroman
- optional romanization of a linked word in a non-Latin scriptsense
- text identifying the word sense or context (e.g.,"to rain very heavily"
)tags
: qualifiers specified for the sense (e.g., field of study, region, dialect, style)taxonomic
: optional taxonomic name associated with the linkagetopics
: list of topic descriptors for the linkage (e.g.,military
)word
- the word this links to (string)
Category tree data format
The --categories-file
option extracts the Wiktionary category tree
as JSON into the specified file. The data is extracted from the Wiktionary
Lua modules by evaluating them.
The data written to the JSON file is a dictionary, with the top-level
keys roots
and nodes
.
Roots is a list of top-level nodes that are not children of other
nodes. Fundamental
is the normal top-level node; other roots may
reflect errors in the hierarchy in Wiktionary. While not a root, the
category all topics
contains the subhierarchy of topical
categories (e.g., food and drink
, nature
, sciences
, etc.).
Nodes is a dictionary mapping lowercased category name to a dictionary containing data about the category. For each category, the following fields may be present:
name
(always present): non-lowercased name of the category (note, however, that many categories are originally lowercase in the Wiktionary hierarchy)desc
: optional description of the categoryclean_desc
: optional cleaned description of the category, with wikitext formatting cleaned to human-readable text, except {{{langname}}} (and possibly other similar tags) are left intact.children
: optional list of child categories of the categorysort
: optional list of sorts (types of subcategories?).
The categories are returned as they are in the original Wiktionary
category data. Language-specific categories are generally not
included. However, there is a category {{{langcat}}}
that appears
to contain a lot of the categories that have language-specific
variants. Also, the category tree data does not contain language
prefixes (the tree is defined in Wiktionary without prefixes and then
replicated for each language).
Related packages
The
wikitextprocessor
is a generic module for extracting data from Wiktionary, Wikipedia, and
other WikiMedia dump files. wiktextract
is built using this module.
When using a version of wiktextract from github, please also setup wikitextprocessor so that they have rough parity. The pypi versions of these packages are usually out-of-date, and mixing a newer version with an older one will lead to bugs. These packages are being developed in parallel.
The wiktfinnish package can be used to interpret Finnish noun declinations and verb conjugations and for generating Finnish inflected word forms.
Publications
If you use Wiktextract or the extracted data in academic work, please cite the following article:
Tatu Ylonen: Wiktextract: Wiktionary as Machine-Readable Structured data, Proceedings of the 13th Conference on Language Resources and Evaluation (LREC), pp. 1317-1325, Marseille, 20-25 June 2022.
Linking to https://kaikki.org or the relevant sub-pages would also be greatly appreciated.
Related tools
A few other tools also exist for parsing Wiktionaries. These include Dbnary, Wikiparse, and DKPro JWKTL.
Contributing and reporting bugs
Please report bugs and other issues on github. I also welcome suggestions for improvement.
Please email to ylo
at clausal.com
if you wish to contribute
or have patches or suggestions.
License
Copyright (c) 2018-2020 Tatu Ylonen. This package is free for both commercial and non-commercial use. It is licensed under the MIT license. See the file LICENSE for details. (Certain files have different open source licenses)