Home

Awesome

Lemmatization Lists

These are large-coverage, machine-readable lemma/token pairs in several languages which I have collected (legally) from various sources, mostly as part of my work on the Global Glossary project. I use these for query expansion during fulltext searches: if a user searches for the lemma walk, the query is expanded to also search for the tokens walking, walked etc.

These are plain text files (zipped). Each line contains one lemma/token pair separated by a tab character in this sequence: lemma, tab, token. The files are encoded in UTF-8 with Windows-style line breaks.

Licence

Sources