Awesome
english-corpus-nepal
Monolingual corpus scraped from English-language newspapers in Nepal. Main purpose is to collect Nepal-related content in English for use in domain-specific natural language processing in the Nepali language.
Files
The source articles are zipped inside source
folder and the consolidated sentence-level files are in the root directory.
The lists of links that have been crawled are inside crawl-lists
folder. These can be ignored in your own crawls if you need more data.
Crawls
- October 9, 2019 (The Kathmandu Post): 3849 article items, 115890 sentences after removing repetitions
- October 10, 2019 (The Annapurna Express): 385 article items, 15263 sentences
- October 10, 2019 (Republica): 6087 items, 121858 sentences