Awesome
World Factbook Corpus
The CIA World Factbook is a Public Domain data set comprising of geographical, economic and political data on every country in the world.
Data types include free text, currency, percentages, longitude & latitude, altitude, taxonomies, and as such it makes a viable test & demonstration corpus for search applications, on top of the intrinsic value of the data.
Since the Factbook is not available in an easily machine-readable format, we've created a crawler to extract the data in a way that should be easier to consume.
Implementation
The crawler was written using Node.js and outputs in both XML and JSON. Pre-generated output is provided.
Run the crawler
The command below will extract data from the dataset in ./factbook-crawler/data and export it to ./data
node factbook-crawler/index.js
Use the data
var fs = require('fs'),
path = require('path');
fs.readdirSync('./data/json').forEach(function(file){
var country = JSON.parse(fs.readFileSync('./data/json/'+file));
console.log( country.name )
});