Home

Awesome

<div align="center"> <div><b>wtf_wikipedia</b></div> <img src="https://user-images.githubusercontent.com/399657/68222691-6597f180-ffb9-11e9-8a32-a7f38aa8bded.png"/> <div>parse data from wikipedia</div> <div><code>npm install wtf_wikipedia</code></div> <div align="center"> <sub> by <a href="https://github.com/spencermountain">Spencer Kelly</a> and <a href="https://github.com/spencermountain/wtf_wikipedia/graphs/contributors"> many contributors </a> </sub> </div> <img height="25px" src="https://user-images.githubusercontent.com/399657/68221824-09809d80-ffb8-11e9-9ef0-6ed3574b0ce8.png"/> </div> <div align="center"> <div> <a href="https://npmjs.org/package/wtf_wikipedia"> <img src="https://img.shields.io/npm/v/wtf_wikipedia.svg?style=flat-square" /> </a> <a href="https://bundlephobia.com/result?p=wtf_wikipedia"> <img src="https://img.shields.io/bundlephobia/min/wtf_wikipedia" /> </a> </div> </div> <!-- spacer --> <img height="15px" src="https://user-images.githubusercontent.com/399657/68221862-17ceb980-ffb8-11e9-87d4-7b30b6488f16.png"/> <div align="left"> <img height="30px" src="https://user-images.githubusercontent.com/399657/68221862-17ceb980-ffb8-11e9-87d4-7b30b6488f16.png"/> <img height="30px" src="https://user-images.githubusercontent.com/399657/68221862-17ceb980-ffb8-11e9-87d4-7b30b6488f16.png"/> it is <a href="https://osr.cs.fau.de/wp-content/uploads/2017/09/wikitext-parser.pdf">very</a>, <a href="https://utcc.utoronto.ca/~cks/space/blog/programming/ParsingWikitext">very</a> hard. &nbsp; &nbsp; <span> &nbsp; &nbsp; we're <a href="https://en.wikipedia.org/wiki/Wikipedia_talk:Times_that_100_Wikipedians_supported_something">not</a> <a href="https://twitter.com/ftrain/status/1036060636587978753">joking</a>.</span> </div> <!-- einstein sentence --> <div align="center"> <img height="50px" src="https://user-images.githubusercontent.com/399657/43598341-75ca8f94-9652-11e8-9b91-cabae4fb1dce.png"/> </div> <div align="center"> <img height="30px" src="https://user-images.githubusercontent.com/399657/68221862-17ceb980-ffb8-11e9-87d4-7b30b6488f16.png"/> </div> <div > why do we always do this? <br/> <img height="30px" src="https://user-images.githubusercontent.com/399657/68221862-17ceb980-ffb8-11e9-87d4-7b30b6488f16.png"/> we put our information where we can't take it out. </div> <!-- spacer --> <img height="50px" src="https://user-images.githubusercontent.com/399657/68221862-17ceb980-ffb8-11e9-87d4-7b30b6488f16.png"/>
import wtf from 'wtf_wikipedia'

let doc = await wtf.fetch('Toronto Raptors')
let coach = doc.infobox().get('coach')
coach.text() //'Darko Rajaković'
<div align="center"> <img height="50px" src="https://user-images.githubusercontent.com/399657/68221824-09809d80-ffb8-11e9-9ef0-6ed3574b0ce8.png"/> </div> <!-- spacer --> <img height="50px" src="https://user-images.githubusercontent.com/399657/68221862-17ceb980-ffb8-11e9-87d4-7b30b6488f16.png"/>

.text()

get clean plaintext:

let str = `[[Greater_Boston|Boston]]'s [[Fenway_Park|baseball field]] has a {{convert|37|ft}} wall. <ref>Field of our Fathers: By Richard Johnson</ref>`
wtf(str).text()
// "Boston's baseball field has a 37ft wall."
let doc = await wtf.fetch('Glastonbury', 'en')
doc.sentences()[0].text()
// 'Glastonbury is a town and civil parish in Somerset, England, situated at a dry point ...'
<div align="right"> <a href="https://observablehq.com/@spencermountain/wtf-wikipedia-text">.text() docs</a> </div> <div align="center"> <img height="50px" src="https://user-images.githubusercontent.com/399657/68221837-0d142480-ffb8-11e9-9d30-90669f1b897c.png"/> </div>

.json()

get all the data from a page:

let doc = await wtf.fetch('Whistling')

doc.json()
// { categories: ['Oral communication', 'Vocal skills'], sections: [{ title: 'Techniques' }], ...}

the default .json() output is really verbose, but you can cherry-pick data by poking-around like this:

// get just the links:
doc.links().map((link) => link.json())
//[{ page: 'Theatrical superstitions', text: 'supersitions' }]

// just the images:
doc.images()[0].json()
// { file: 'Image:Duveneck Whistling Boy.jpg', url: 'https://commons.wiki...' }

// json for a particular section:
doc.section('see also').links()[0].json()
// { page: 'Slide Whistle' }
<div align="right"> <a href="https://observablehq.com/@spencermountain/wtf-wikipedia-json">.json() docs</a> </div> <div align="center"> <img height="50px" src="https://user-images.githubusercontent.com/399657/68221824-09809d80-ffb8-11e9-9ef0-6ed3574b0ce8.png"/> </div> <!-- spacer --> <img height="50px" src="https://user-images.githubusercontent.com/399657/68221862-17ceb980-ffb8-11e9-87d4-7b30b6488f16.png"/>

run it on the client-side:

<script src="https://unpkg.com/wtf_wikipedia"></script>
<script>
  wtf.fetch('Radiohead', { 'Api-User-Agent': 'Name your script here' }, function (err, doc) {
    let members = doc.infobox().get('current members')
    members.links().map((l) => l.page())
    //['Thom Yorke', 'Jonny Greenwood', 'Colin Greenwood'...]
  })
</script>

or the server-side:

import wtf from 'wtf_wikipedia'
// or,
const wtf = require('wtf_wikipedia')
<!-- spacer --> <img height="50px" src="https://user-images.githubusercontent.com/399657/68221862-17ceb980-ffb8-11e9-87d4-7b30b6488f16.png"/> <div align="center"> <img height="50px" src="https://user-images.githubusercontent.com/399657/68221837-0d142480-ffb8-11e9-9d30-90669f1b897c.png"/> </div>

full wikipedia dumps

With this library, in conjunction with dumpster-dive, you can parse the whole english wikipedia in an aftertoon.

npm install -g dumpster-dive
<img height="280px" src="https://user-images.githubusercontent.com/399657/40262198-a268b95a-5ad3-11e8-86ef-29c2347eec81.gif"/> <div align="right"> <a href="https://github.com/spencermountain/dumpster-dive/">dumpster docs</a> </div> <!-- spacer --> <img height="50px" src="https://user-images.githubusercontent.com/399657/68221862-17ceb980-ffb8-11e9-87d4-7b30b6488f16.png"/> <div align="center"> <img height="50px" src="https://user-images.githubusercontent.com/399657/68221824-09809d80-ffb8-11e9-9ef0-6ed3574b0ce8.png"/> </div>

Ok first, 🛀

Wikitext is no small thing.

Consider:

this library supports many recursive shenanigans, depreciated and obscure template variants, and illicit wiki-shorthands.

What it does:

What doesn't do:

It is built to be as flexible as possible. In all cases, tries to fail in considerate ways.

<!-- spacer --> <img height="50px" src="https://user-images.githubusercontent.com/399657/68221862-17ceb980-ffb8-11e9-87d4-7b30b6488f16.png"/>

How about html scraping..?

Wikimedia's official parser turns wikitext ➔ HTML.

<!-- You can even get html from the api [like this](https://en.wikipedia.org/w/api.php?format=json&origin=*&action=parse&prop=text&page=Whistling). -->

if you prefer this screen-scraping workflow, you can pluck at parts of a page like that.

that's cool!

getting structured data this way is still a complex, weird process. Manually spelunking the html is sometimes just as tricky and error-prone as scanning the wikitext itself.

The contributors to this library have come to that conclusion, as many others have.

This library is gracious to the Parsoid contributors.

<!-- spacer --> <img height="50px" src="https://user-images.githubusercontent.com/399657/68221862-17ceb980-ffb8-11e9-87d4-7b30b6488f16.png"/> <div align="center"> <img height="50px" src="https://user-images.githubusercontent.com/399657/68221824-09809d80-ffb8-11e9-9ef0-6ed3574b0ce8.png"/> </div>

okay,

flip your wikitext into a Doc object

import wtf from 'wtf_wikipedia'

let txt = `
==Wood in Popular Culture==
* Harry Potter's wand
* The Simpson's fence
`
wtf(txt)
// Document {text(), json(), lists()...}

doc.links()

let txt = `Whistling is featured in a number of television shows, such as [[Lassie (1954 TV series)|''Lassie'']], and the title theme for ''[[The X-Files]]''.`
wtf(txt)
  .links()
  .map((l) => l.page())
// [ 'Lassie (1954 TV series)',  'The X-Files' ]

doc.text()

returns nice plain-text of the article

let txt =
  "[[Greater_Boston|Boston]]'s [[Fenway_Park|baseball field]] has a {{convert|37|ft}} wall.<ref>{{cite web|blah}}</ref>"
wtf(txt).text()
//"Boston's baseball field has a 37ft wall."

doc.sections():

a section is a heading '==Like This=='

wtf(page).sections()[1].children() //traverse nested sections
wtf(page).section('see also').remove() //delete one

doc.sentences()

let s = wtf(page).sentences()[4]
s.links()
s.bolds()
s.italics()
s.text()
s.wikitext()

doc.categories()

await wtf.fetch('Whistling').categories()
//['Oral communication', 'Vocal music', 'Vocal skills']

doc.images()

let img = wtf(page).images()[0]
img.url() // the full-size wikimedia-hosted url
img.thumbnail() // 300px, by default
img.format() // jpg, png, ..
<!-- spacer --> <img height="15px" src="https://user-images.githubusercontent.com/399657/68221862-17ceb980-ffb8-11e9-87d4-7b30b6488f16.png"/>

Fetch

You can grab and parse articles from any wiki api. This includes any language, any wiki-project, and most 3rd-party wikis.

// 3rd-party wiki
let doc = await wtf.fetch('https://muppet.fandom.com/wiki/Miss_Piggy')

// wikipedia français
doc = await wtf.fetch('Tony Hawk', 'fr')
doc.sentence().text() // 'Tony Hplawk est un skateboarder professionnel et un acteur ...'

// accept an array, or wikimedia pageIDs
let docs = wtf.fetch(['Whistling', 2983], { follow_redirects: false })

// article from german wikivoyage
wtf.fetch('Toronto', { lang: 'de', wiki: 'wikivoyage' }).then((doc) => {
  console.log(doc.sentences()[0].text()) // 'Toronto ist die Hauptstadt der Provinz Ontario'
})

you may also pass the wikipedia page id as parameter instead of the page title:

let doc = await wtf.fetch(64646, 'de')

the fetch method follows redirects.

API plugin

wtf.getCategoryPages(title, [options])

retrieves all pages and sub-categories belonging to a given category:

wtf.extend(require('wtf-plugin-api'))
let result = await wtf.getCategoryPages('Category:Politicians_from_Paris')
/*
{
  [
    {"pageid":52502362,"ns":0,"title":"William Abitbol"},
    {"pageid":50101413,"ns":0,"title":"Marie-Joseph Charles des Acres de L'Aigle"}
    ...
    {"pageid":62721979,"ns":14,"title":"Category:Councillors of Paris"},
    {"pageid":856891,"ns":14,"title":"Category:Mayors of Paris"}
  ]
}
*/

wtf.random([options])

fetches a random wikipedia article, from a given language or domain

wtf.extend(require('wtf-plugin-api'))
wtf.random().then((doc) => {
  console.log(doc.title(), doc.categories())
  //'Whistling'  ['Oral communication', 'Vocal skills']
})

see wtf-plugin-api

<!-- spacer --> <img height="50px" src="https://user-images.githubusercontent.com/399657/68221862-17ceb980-ffb8-11e9-87d4-7b30b6488f16.png"/>

Tutorials

<!-- spacer --> <img height="50px" src="https://user-images.githubusercontent.com/399657/68221862-17ceb980-ffb8-11e9-87d4-7b30b6488f16.png"/> <div align="center"> <img height="50px" src="https://user-images.githubusercontent.com/399657/68221824-09809d80-ffb8-11e9-9ef0-6ed3574b0ce8.png"/> </div>

Plugins

these add all sorts of new functionality:

wtf.extend(require('wtf-plugin-classify'))
await wtf.fetch('Toronto Raptors').classify()
// 'Organization/SportsTeam'

wtf.extend(require('wtf-plugin-summary'))
await wtf.fetch('Pulp Fiction').summary()
// 'a 1994 American crime film'

wtf.extend(require('wtf-plugin-person'))
await wtf.fetch('David Bowie').birthDate()
// {year:1947, date:8, month:1}

wtf.extend(require('wtf-plugin-i18n'))
await wtf.fetch('Ziggy Stardust', 'fr').infobox().json()
// {nom:{text:"Ziggy Stardust"}, oeuvre:{text:"The Rise and Fall of Ziggy Stardust"}}
Plugin
classifyperson/place/thing
summaryshort description text
personbirth/death information
apifetch more data from the API
i18nimproves multilingual template coverage
wtf-mlbfetch baseball data
wtf-nhlfetch hockey data
nsfwflag sexual/graphic/adult articles
imageadditional methods for .images()
htmloutput html
wikitextoutput wikitext
markdownoutput markdown
latexoutput latex
<div align="right"> <a href="https://observablehq.com/@spencermountain/wtf-wikipedia-plugins">plugin docs</a> </div> <div align="center"> <img height="50px" src="https://user-images.githubusercontent.com/399657/68221824-09809d80-ffb8-11e9-9ef0-6ed3574b0ce8.png"/> </div>

Good practice:

The wikipedia api is pretty welcoming though recommends three things, if you're going to hit it heavily -

wtf
  .fetch(['Royal Cinema', 'Aldous Huxley'], {
    lang: 'en',
    'Api-User-Agent': 'spencermountain@gmail.com',
  })
  .then((docList) => {
    let links = docList.map((doc) => doc.links())
    console.log(links)
  })

<!-- spacer --> <img height="50px" src="https://user-images.githubusercontent.com/399657/68221862-17ceb980-ffb8-11e9-87d4-7b30b6488f16.png"/>

Full API

Section

Paragraph

Sentence

Image

Template

Infobox

List

Reference

Table

<div align="center"> <img height="50px" src="https://user-images.githubusercontent.com/399657/68221824-09809d80-ffb8-11e9-9ef0-6ed3574b0ce8.png"/> </div>

Configuration

Adding new methods:

you can add new methods to any class of the library, with wtf.extend()

wtf.extend((models) => {
  // throw this method in there...
  models.Doc.prototype.isPerson = function () {
    return this.categories().find((cat) => cat.match(/people/))
  }
})

await wtf.fetch('Stephen Harper').isPerson()

Adding new templates:

does your wiki use a {{foo}} template? Add a custom parser for it:

wtf.extend((models, templates) => {
  // create a custom parser function
  templates.foo = (tmpl, list, parse) => {
    let obj = parse(tmpl) //or do a custom regex
    list.push(obj)
    return 'new-text'
  }

  // array-syntax allows easy-labeling of parameters
  templates.foo = ['a', 'b', 'c']

  // number-syntax for returning by param # '{{name|zero|one|two}}'
  templates.baz = 0

  // replace the template with a string '{{asterisk}}' -> '*'
  templates.asterisk = '*'
})

by default, if there's no parser for a template, it will be just ignored and generate an empty string. However, it's possible to configure a fallback parser function to handle these templates:

wtf('some {{weird_template}} here', {
  templateFallbackFn: (tmpl, list, parse) => {
    let obj = parse(tmpl) //or do a custom regex
    list.push(obj)
    return '[unsupported template]' // or return null to ignore this template
  },
})

you can determine which templates are understood to be 'infoboxes' with the 3rd parameter:

wtf.extend((models, templates, infoboxes) => {
  Object.assign(infoboxes, { person: true, place: true, thing: true })
})
<div align="right"> <a href="https://observablehq.com/@spencermountain/wtf-wikipedia-plugins">plugin docs</a> </div> <div align="center"> <img height="50px" src="https://user-images.githubusercontent.com/399657/68221824-09809d80-ffb8-11e9-9ef0-6ed3574b0ce8.png"/> </div>

Notes:

3rd-party wikis

by default, a public API is provided by a installed mediawiki application. This means that most wikis have an open api, even if they don't realize it. Some wikis may turn this feature off.

It can usually be found by visiting http://mywiki.com/api.php

to fetch pages from a 3rd-party wiki:

wtf.fetch('Kermit', { domain: 'muppet.fandom.com' }).then((doc) => {
  console.log(doc.text())
})

some wikis will change the path of their API, from ./api.php to elsewhere. If your api has a different path, you can set it like so:

wtf.fetch('2016-06-04_-_J.Fernandes_@_FIL,_Lisbon', { domain: 'www.mixesdb.com', path: 'db/api.php' }).then((doc) => {
  console.log(doc.template('player').json())
})

for image-urls to work properly, the wiki should also have Special:Redirect enabled. Some wikis, (like wikia) have intentionally disabled this.

i18n and multi-language:

wikitext is (amazingly) used across all languages, wikis, and even in right-to-left languages. This parser actually does an okay job at it too.

Wikipedia I18n langauge information for Redirects, Infoboxes, Categories, and Images are included in the library, with pretty-decent coverage.

To improve coverage of i18n templates, use wtf-plugin-i18n

Please make a PR if you see something missing for your language.

<!-- spacer --> <img height="50px" src="https://user-images.githubusercontent.com/399657/68221862-17ceb980-ffb8-11e9-87d4-7b30b6488f16.png"/>

Builds:

this library ships seperate client-side and server-side builds, to preserve filesize.

the browser version uses fetch() and the server version uses require('https').

<!-- spacer --> <img height="50px" src="https://user-images.githubusercontent.com/399657/68221862-17ceb980-ffb8-11e9-87d4-7b30b6488f16.png"/>

Performance:

It is not the fastest parser, and is very unlikely to beat a single-pass parser in C or Java.

Using dumpster-dive, this library can parse a full english wikipedia in around 4 hours on a macbook.

That's about 100 pages/second, per thread.

<!-- spacer --> <img height="50px" src="https://user-images.githubusercontent.com/399657/68221862-17ceb980-ffb8-11e9-87d4-7b30b6488f16.png"/> <div align="center"> <img height="50px" src="https://user-images.githubusercontent.com/399657/68221824-09809d80-ffb8-11e9-9ef0-6ed3574b0ce8.png"/> </div>

See also:

Other alternative javascript parsers:

and many more!

MIT