Home

Awesome

go-wiktionary-parse

This is a tool to parse language dumps from Wiktionary and store the results into a Sqlite database.

Quickstart

git clone https://github.com/macdub/go-wiktionary-parse
cd go-wikitionary-parse
wget https://dumps.wikimedia.org/enwiktionary/latest/enwiktionary-latest-pages-articles.xml.bz2
bzip2 -d enwiktionary-latest-pages-articles.xml.bz2
go install .
go-wiktionary-parse -file enwiktionary-latest-pages-articles.xml -threads 20 -database test.db

Usage

Usage of wiktionary-parser:
    -cache_file string
        Use this as the cache file (default "xmlCache.gob")
    -database string
        Database file to use (default "database.db")
    -file string
        XML file to parse
    -lang string
        Language to target for parsing (default "English")
    -log_file string
        Log to this file
    -make_cache
        Make a cache file of the parsed XML
    -threads int
        Set the number of threads to use for parsing (default 5)
    -use_cache
        Use a 'gob' of the parsed XML file
    -purge
        Purge the existing database provided by the database flag
    -verbose
        Use verbose logging

Build

Dependencies

Build

$ go build -o wiktionary-parser main.go

Current Limitations

Database

Structure

COLUMNTYPE
idinteger
wordtext
lemmatext
etymology_nointeger
definition_nointeger
definitiontext

Statistics