Home

Awesome

SeedLing

Building and using a seed corpus for the Human Language Project (Steven and Abney, 2010).

The SeedLing corpus on this repository includes the data from:

The SeedLing API includes scripts to access data/information from:

FAQs:


Usage

To access the SeedLing from various data sources:

from seedling import udhr, omniglot, odin

# Accessing ODIN IGTs:
>>> for lang, igts in odin.igts():
>>>   for igt in igts:
>>>     print lang, igt

# Accesing Omniglot phrases
>>> for lang, sent, trans in omniglot.phrases():
>>>   print lang, sent, trans

# Accessing UDHR sentences.
>>> for lang, sent in udhr.sents():
>>>   print lang, sent

To access the SIL and WALS information:

from seedling import miniwals, miniethnologue

# Accessing SIL ISO codes.
>>> sil = miniethnologue.MiniSIL()
>>> print sil.ISO6393['eng']
{'iso6391': u'en', 'name': u'English', 'iso6392t': u'eng', 'invert': u'English', 'ismacro': False, 'scope': 'Indvidual', 'type': 'Living', 'iso6392b': u'eng'}

# Accessing WALS information
>>> wals = miniwals.MiniWALS()
>>> print wals['eng']
{u'glottocode': u'stan1293', u'name': u'English', u'family': u'Indo-European', u'longitude': u'0.0', u'sample 200': u'True', u'latitude': u'52.0', u'genus': u'Germanic', u'macroarea': u'Eurasia', u'sample 100': u'True'}

Detailed usage of the API can also be found in demo.py.


Getting Wikipedia

There are two ways to access the Wikipedia data:

  1. Plant your own Wiki
  2. Access it from our cloud storage

Plant your own Wiki

We encourage SeedLing users to take part in building the Wikipedia data from the SeedLing corpus. A fruitful experience, you will find.

Please ENSURE that you have sufficient space on your harddisk (~50-70GB) and also this process of download and cleaning might take up to a week for ALL languages available in Wikipedia.

For the lazy: run the script plant_wiki.py and it would produce the desired cleaned plaintext Wikipedia data as presented in the SeedLing publication:

$ python plant_wiki.py &

For more detailed, step-by-step instructions:

import codecs
from seedling.wikipedia import clean

extracted_wiki_dir = "/home/yourusername/path/to/extracted/wiki/"
cleaned_wiki_dir = "/home/yourusername/path/to/cleaned/wiki/"

for i in os.listdir(extracted_wiki_dir):
  dirpath, filename = os.path.split(i)
  with codecs.open(i, 'r', 'utf8') as fin, codecs.open(clean_wiki_dir+"/"+filename, 'w', 'utf8') as fout:
    fout.write(clean(fin.read()))

Please feel free to contact the colloborators in the SeedLing project if you encounter problems with getting the Wikipedia data.

Access it from our cloud storage

To be updated.


Cite

To cite the SeedLing corpus:

Guy Emerson, Liling Tan, Susanne Fertmann, Alexis Palmer and Michaela Regneri . 2014. SeedLing: Building and using a seed corpus for the Human Language Project. In Proceedings of The use of Computational methods in the study of Endangered Languages (ComputEL) Workshop. Baltimore, USA.

in bibtex:

@InProceedings{seedling2014,
  author    = {Guy Emerson, Liling Tan, Susanne Fertmann, Alexis Palmer and Michaela Regneri},
  title     = {SeedLing: Building and using a seed corpus for the Human Language Project},
  booktitle = {Proceedings of The use of Computational methods in the study of Endangered Languages (ComputEL) Workshop},
  month     = {June},
  year      = {2014},
  address   = {Baltimore, USA},
  publisher = {Association for Computational Linguistics},
  pages     = {},
  url       = {}
}

References