Home

Awesome

LitBank

LitBank is an annotated dataset of 100 works of English-language fiction to support tasks in natural language processing and the computational humanities, described in more detail in the following publications:

LitBank currently contains annotations for entities, events, entity coreference, and quotation attribution in a sample of ~2,000 words from each of those texts, totaling 210,532 tokens.

<a rel="license" href="http://creativecommons.org/licenses/by/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" href="http://purl.org/dc/dcmitype/Dataset" property="dct:title" rel="dct:type">LitBank</span> is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.

Entity annotations

The entity annotation layer of LitBank covers six of the ACE 2005 categories in text:

The targets of annotation here include both named entities (e.g., Tom Sawyer) and common entities (the boy). These entities can be nested, as in the following:

<img src="img/nested_structure.png" alt="drawing" width="300"/>

For more, see: David Bamman, Sejal Popat and Sheng Shen, "An Annotated Dataset of Literary Entities," NAACL 2019.

Event annotations

The event layer in LitBank identifies events with asserted realis (depicted as actually taking place, with specific participants at a specific time) -- as opposed to events with other epistemic modalities (hypotheticals, future events, extradiegetic summaries by the narrator).

TextEventsSource
My father’s eyes had closed upon the light of this world six months, when mine opened on it.{closed, opened}Dickens, David Copperfield
Call me Ishmael.{}Melville, Moby Dick
His sister was a tall, strong girl, and she walked rapidly and resolutely, as if she knew exactly where she was going and what she was going to do next.{walked}Cather, O Pioneers

For more, see: Matt Sims, Jong Ho Park and David Bamman, "Literary Event Detection," ACL 2019.

Coreference annotations

The coreference layer in LitBank covers the six ACE entity categories outlined above (people, facilities, locations, geo-political entities, organizations and vehicles); while the entity tagging above only covers proper noun phrases (Tom Sawyer) and common noun phrases (the boy), the coref annotations also cover personal pronouns (he) as well.

We annotate three different categories of coreference phenomena -- coreference of identity (which links a mention in text to a discourse entity); copula (which links an attribute mention to another mention); and apposition (which links an appositive expression to another mention).

PhenomenonExampleSource
CoreferenceOne may as well begin with [Helen]<sub>x</sub>'s letters to [[her]<sub>x</sub> sister]<sub>y</sub>Forster, Howard's End
Copula<img src="img/copula.png" alt="drawing" width="200"/>Melville, Bartleby, the Scrivener
Apposition<img src="img/apposition.png" alt="drawing" width="350"/>Conrad, Heart of Darkness

Annotations largely follow OntoNotes guidelines (though singleton mentions are annotated here), with several departures for literary style (including the distinction between generic/specific mentions, near-identity and the revelation of identity). For more on the coreference criteria used in these annotations, see David Bamman, Olivia Lewke and Anya Mansoor (2020), "An Annotated Dataset of Coreference in English Literature", LREC.

Quotation annotations

The quotation layer in LitBank identifies all instances of direct speech in the text, attributed to its speaker.

QuoteSpeakerSource
— Come up , Kinch ! Come up , you fearful jesuit !Buck_Mulligan-0Joyce, Ulysses
‘ Oh dear ! Oh dear ! I shall be late ! ’The_White_Rabbit-4Carroll, Alice in Wonderland
“ Do n't put your feet up there , Huckleberry ; ”Miss_Watson-26Twain, Huckleberry Finn

This layer captures dialogue in quotation marks and other typographical markers (such as dashes in Joyce's Ulysses), and excludes strings wrapped in quotation marks that do not constitute dialogue, such as the use of scare quotes for emphatic use (for jargon, neologisms or irony), titles of works of art, and the mention of a term (as distinct from its use).

Speaker labels are identical to those in the coref/ section of LitBank, to enable linking the quotation and coreference layers.

For more on the quotation annotations, see this paper.

Tagger

A trained tagger (which can be used to tag entities and events in new text) can be found in the tagger/ directory. (A model for coreference resolution is coming shortly.)

Corpus

The corpus is drawn from the public domain texts on Project Gutenberg, and includes individual works of fiction (both novels and short stories) that include a mix of high literary style (e.g., Edith Wharton's Age of Innocence, James Joyce's Ulysses) and popular pulp fiction (e.g., H. Rider Haggard's King Solomon's Mines, Horatio Alger's Ragged Dick). We select approximately 2,000 words from each of 100 texts; the total annotated dataset contains 210,532 tokens.

Gutenberg IDDateAuthorTitle
5141868Alcott, Louisa MayLittle Women
185811904Alger, Horatio, Jr.Adrift in New York: Tom and Florence Braving the World
53481868Alger, Horatio, Jr.Ragged Dick, Or, Street Life in New York with the Boot-Blacks
1581815Austen, JaneEmma
1051818Austen, JanePersuasion
13421813Austen, JanePride and Prejudice
12061914Bower, B. M.The Flying U Ranch
9691848Brontë, AnneThe Tenant of Wildfell Hall
12601847Brontë, CharlotteJane Eyre: An Autobiography
7681847Brontë, EmilyWuthering Heights
20951853Brown, William WellsClotelle: A Tale of the Southern States
1131911Burnett, Frances HodgsonThe Secret Garden
60531778Burney, FannyEvelina, Or, the History of a Young Lady's Entrance into the World
621912Burroughs, Edgar RiceA Princess of Mars
781912Burroughs, Edgar RiceTarzan of the Apes
20841903Butler, SamuelThe Way of All Flesh
111865Carroll, LewisAlice's Adventures in Wonderland
241913Cather, WillaO Pioneers!
441915Cather, WillaThe Song of the Lark
4721900Chesnutt, Charles W. (Charles Waddell)The House Behind the Cedars
16951908Chesterton, G. K. (Gilbert Keith)The Man Who Was Thursday: A Nightmare
1601899Chopin, KateThe Awakening, and Selected Short Stories
11551922Christie, AgathaThe Secret Adversary
1551868Collins, WilkieThe Moonstone
2191899Conrad, JosephHeart of Darkness
9741907Conrad, JosephThe Secret Agent: A Simple Tale
9401826Cooper, James FenimoreThe Last of the Mohicans; A narrative of 1757
731895Crane, StephenThe Red Badge of Courage: An Episode of the American Civil War
8761861Davis, Rebecca HardingLife in the Iron-Mills; Or, The Korl Woman
5211719Defoe, DanielThe Life and Adventures of Robinson Crusoe
10231852Dickens, CharlesBleak House
7661849Dickens, CharlesDavid Copperfield
14001861Dickens, CharlesGreat Expectations
7301838Dickens, CharlesOliver Twist
16611892Doyle, Arthur ConanThe Adventures of Sherlock Holmes
28521902Doyle, Arthur ConanThe Hound of the Baskervilles
2331900Dreiser, TheodoreSister Carrie: A Novel
152651911Du Bois, W. E. B. (William Edward Burghardt)The Quest of the Silver Fleece: A Novel
1451871Eliot, GeorgeMiddlemarch
5501861Eliot, GeorgeSilas Marner
126771914Ferber, EdnaPersonality Plus: Some Experiences of Emma McChesney and Her Son, Jock
65931749Fielding, HenryHistory of Tom Jones, a Foundling
98301922Fitzgerald, F. Scott (Francis Scott)The Beautiful and Damned
8051920Fitzgerald, F. Scott (Francis Scott)This Side of Paradise
27751915Ford, Ford MadoxThe Good Soldier
26411908Forster, E. M. (Edward Morgan)A Room with a View
28911910Forster, E. M. (Edward Morgan)Howards End
42761855Gaskell, Elizabeth CleghornNorth and South
321915Gilman, Charlotte PerkinsHerland
5021913Grey, ZaneDesert Gold
34571919Grey, ZaneThe Man of the Forest
7111887Haggard, H. Rider (Henry Rider)Allan Quatermain
21661885Haggard, H. Rider (Henry Rider)King Solomon's Mines
271874Hardy, ThomasFar from the Madding Crowd
1101891Hardy, ThomasTess of the d'Urbervilles: A Pure Woman
771851Hawthorne, NathanielThe House of the Seven Gables
331850Hawthorne, NathanielThe Scarlet Letter
951894Hope, AnthonyThe Prisoner of Zenda
411820Irving, WashingtonThe Legend of Sleepy Hollow
2081879James, HenryDaisy Miller: A Study
4321903James, HenryThe Ambassadors
2091898James, HenryThe Turn of the Screw
3671896Jewett, Sarah OrneThe Country of the Pointed Firs
28071899Johnston, MaryTo Have and to Hold
42171916Joyce, JamesA Portrait of the Artist as a Young Man
28141914Joyce, JamesDubliners
43001922Joyce, JamesUlysses
2171913Lawrence, D. H. (David Herbert)Sons and Lovers
5431920Lewis, SinclairMain Street
2151903London, JackThe Call of the Wild
3511915Maugham, W. Somerset (William Somerset)Of Human Bondage
112311853Melville, HermanBartleby, the Scrivener: A Story of Wall-Street
24891851Melville, HermanMoby Dick; Or, The Whale
451908Montgomery, L. M. (Lucy Maud)Anne of Green Gables
412861866Oliphant, Mrs. (Margaret)Miss Marjoribanks
601905Orczy, Emmuska Orczy, BaronessThe Scarlet Pimpernel
9321839Poe, Edgar AllanThe Fall of the House of Usher
10641842Poe, Edgar AllanThe Masque of the Red Death
40511915Praed, Campbell, Mrs.Lady Bridget in the Never-Never Land: a story of Australian life
32681794Radcliffe, Ann WardThe Mysteries of Udolpho
4341908Rinehart, Mary RobertsThe Circular Staircase
1711791Rowson, Mrs.Charlotte Temple
2711877Sewell, AnnaBlack Beauty
841823Shelley, Mary WollstonecraftFrankenstein; Or, The Modern Prometheus
1201883Stevenson, Robert LouisTreasure Island
3451897Stoker, BramDracula
8291726Swift, JonathanGulliver's Travels into Several Remote Nations of the World
88671918Tarkington, BoothThe Magnificent Ambersons
5991848Thackeray, William MakepeaceVanity Fair
761884Twain, MarkAdventures of Huckleberry Finn
741876Twain, MarkThe Adventures of Tom Sawyer
13271898Von Arnim, ElizabethElizabeth and Her German Garden
2381915Webster, JeanDear Enemy
52301897Wells, H. G. (Herbert George)The Invisible Man: A Grotesque Romance
361897Wells, H. G. (Herbert George)The War of the Worlds
5411920Wharton, EdithThe Age of Innocence
1741890Wilde, OscarThe Picture of Dorian Gray
20051919Wodehouse, P. G. (Pelham Grenville)Piccadilly Jim
163571788Wollstonecraft, MaryMary: A Fiction
12451919Woolf, VirginiaNight and Day

Format

The entity data is formatted in the original brat standoff annotation format and in the tab-separated layered format of https://github.com/meizhiju/layered-bilstm-crf.

Commons

We welcome contributions to LitBank in the form of annotations for new texts in the public domain and suggestions for new texts to include; please contact dbamman@berkeley.edu to get involved. For more information:

Acknowledgments

This work was supported by an Amazon Research Award ("Natural Language Processing for Literary Texts").