Awesome
Lemmatization Lists
These are large-coverage, machine-readable lemma/token pairs in several languages which I have collected (legally) from various sources, mostly as part of my work on the Global Glossary project. I use these for query expansion during fulltext searches: if a user searches for the lemma walk, the query is expanded to also search for the tokens walking, walked etc.
These are plain text files (zipped). Each line contains one lemma/token pair separated by a tab character in this sequence: lemma, tab, token. The files are encoded in UTF-8 with Windows-style line breaks.
- Asturian (ast) (108,792 pairs)
- Bulgarian (bg) (30,323 pairs)
- Catalan (ca) (591,534 pairs)
- Czech (cs) (36,400 pairs)
- English (en) (41,760 pairs)
- Estonian (et) (80,536 pairs)
- French (fr) (224,002 pairs)
- Galician (gl) (392,856 pairs)
- German (de) (358,473 pairs)
- Hungarian (hu) (39,898 pairs)
- Irish (ga) (415,502 pairs)
- Manx Gaelic (gv) (67,177 pairs)
- Italian (it) (341,074 pairs)
- Persian/Farsi (fa) (6,273 pairs)
- Polish (pl) (3,296,232 pairs)
- Portuguese (pt) (850,264 pairs)
- Romanian (ro) (314,810 pairs)
- Russian (ru) (537,810 pairs)
- Scottish Gaelic (gd) (51,624 pairs)
- Slovak (sk) (858,414 pairs)
- Slovene (sl) (99,063 pairs)
- Spanish (es) (497,560 pairs)
- Swedish (sv) (675,137 pairs)
- Ukrainian (uk) (193,703 pairs)
- Welsh (cy) (359,224 pairs)
Licence
- Available under the Open Database License
Sources
- Various Hunspell dictionaries from the OpenOffice.org website
- Deutsches Morphologie-Lexikon by Daniel Naber
- Lexique by Boris New and Christophe Pallier
- e_lemma.txt by Yasumasa Someya
- Multext East (only those morphological lexicons that are under a free licence are used)
- Morphological dictionaries from FreeLing
- SALDO morphological lexicon
- Irish National Morphology Database
- Various lists by Kevin Scannell
- OpenRussian.org