Home

Awesome

Vortlisto

The goal of "vortlisto" ("word list") is to produce a machine-readable vocabulary of words for an (English-speaking) Esperanto learner.

The source is based on a curated word list produced as part of a UK high school (GCSE) Esperanto syllabus that ran until 1996. The advantages of this source are that there's some academic consensus around the selection; the words are grouped by linguistic categories; "basic" words are tagged; and perhaps most importantly this list exists in a somewhat machine-readable state online already, thanks to Bill Walker who created a giant HTML page of it all with translations.

Contents

  1. gcselist.htm - this is the original HTML document
  2. vortlisto.htm - the original with some additional sub-categories and additional words
  3. csv/*.csv - a series of CSV files containing Esperanto and English words whose names reflect the category and level
  4. vortlisto.rb - a Ruby program to convert the fixed HTML doc into the CSV files

Usage

If you wish to use vortlisto.rb's download capability gem install httparty.

Run with ./vortlisto.rb; run time is about a half second.

Use in other applications

Cerego

So far, I've successfully uploaded a CSV file directly to Cerego a spaced repetition learning site and app. I'm a fan of Cerego since it really seems to work, the UI & UX are pretty good, and they publish an API.

The first word list I uploaded was the "General Basic" list that contains 396 words. It turns out Cerego struggles with a word list this big; the app spins processing all the data (there's a lot of work per-word for its learning algorithm). Cerego recommend lists less than 100 associations. The other issue is that Cerego will present the words in the order they're uploaded which currently means grinding through a giant alphabetized list. Cerego is already full of progress markers so alphabetizing only seems to be a disadvantage.

Next steps

The "General Basic" and some other lists need to be made much smaller; less than 100 items each. Two options are arranging into chunks of say 40 words by observed frequency (so learn most common words first) and by additional category like "color", "prepositions", etc. Those subcategories could also be arranged by frequency if only to break up the alphabetizing.

So far I have created some additional sub-categories in "General Basic" which has reduced "General Basic" to about half.

Statistics

Here is a list of word counts in each section and level,

$ wc -l *-basic.csv | sort -n
      12 general_color-basic.csv
      15 house_and_home-basic.csv
      19 general_pronouns-basic.csv
      21 general_position-basic.csv
      23 money-basic.csv
      32 weather-basic.csv
      39 places-basic.csv
      41 general_numbers_and_measurements-basic.csv
      42 affixes-basic.csv
      46 services-basic.csv
      50 health_and_welfare-basic.csv
      51 language-basic.csv
      53 general_correlatives-basic.csv
      58 general_time-basic.csv
      73 travel-basic.csv
      92 life_at_home-basic.csv
      95 shopping-basic.csv
     106 free_time_entertainment-basic.csv
     107 holidays-basic.csv
     119 education_and_career-basic.csv
     142 relations_with_others-basic.csv
     158 food_and_drink-basic.csv
     185 general-basic.csv
     266 personal_identification-basic.csv
    1845 total
$ wc -l *-advanced.csv | sort -n
       9 house_and_home-advanced.csv
      27 places-advanced.csv
      30 money-advanced.csv
      30 services-advanced.csv
      30 weather-advanced.csv
      31 life_at_home-advanced.csv
      57 language-advanced.csv
      57 shopping-advanced.csv
      65 health_and_welfare-advanced.csv
      65 travel-advanced.csv
      71 education_and_career-advanced.csv
      71 holidays-advanced.csv
      91 food_and_drink-advanced.csv
      93 free_time_entertainment-advanced.csv
     107 personal_identification-advanced.csv
     152 relations_with_others-advanced.csv
     986 total

Credits