Awesome

Attributed Question Answering

Attributed Question Answering (QA) as a key first step in the development of attributed LLMs. There are multiple possible motivations for studying Attributed QA:

It is perhaps the simplest possible information-seeking application, and as such it is relatively straightforward to evaluate Attributed QA systems.
In spite of its simplicity, models and experiments for Attributed QA are likely to be highly informative to the general goal of building attributed LLMs.
Attributed QA is an interesting task that has advantages over existing approaches to the evaluation of QA systems.

In Attributed QA, the input is a question, and the output is an (answer, attribution) pair where answer is an answer string, and attribution is a pointer into a fixed underlying corpus. The attribution should give supporting evidence for the answer; for example, it should satisfy the conditions of AIS.

what is the population of st petersburg fl

Answer	Attribution
244,769	According to the 2010 census, the city contained 244,769 people, making St. Petersburg the largest city in Pinellas County, and 129,401 households. The population density was 3,964.4 per square mile (1530.7/km2). [Title: St. Petersburg, Florida Section: Demographics, 2010 Census]
263,768	St. Petersburg, Florida is the fifth largest city in Florida with a population of 263,768 as of 2017. The city is home to 74 completed high rises (as of 2018), and the most notable are the One St. Petersburg, Priatek Plaza and Signature Place skyscrapers. [Title: List of tallest buildings in St. Petersburg, Florida]

We include both data and code to support research into Attributed QA in this release. If you use this in your work, please cite our paper.

@misc{https://doi.org/10.48550/arxiv.2212.08037,
  doi = {10.48550/ARXIV.2212.08037},
  url = {https://arxiv.org/abs/2212.08037},
  author = {Bohnet, Bernd and Tran, Vinh Q. and Verga, Pat and Aharoni, Roee and Andor, Daniel and Soares, Livio Baldini and Ciaramita, Massimiliano and Eisenstein, Jacob and Ganchev, Kuzman and Herzig, Jonathan and Hui, Kai and Kwiatkowski, Tom and Ma, Ji and Ni, Jianmo and Saralegui, Lierni Sestorain and Schuster, Tal and Cohen, William W. and Collins, Michael and Das, Dipanjan and Metzler, Donald and Petrov, Slav and Webster, Kellie},
  title = {Attributed Question Answering: Evaluation and Modeling for Attributed Large Language Models},
  publisher = {arXiv},
  year = {2022},
  copyright = {Creative Commons Attribution 4.0 International}
}

Data

Attribution Corpus

For comparability between systems, our paper standardizes the collection of allowable attributions to the provided scrape of Wikipedia release, taken on 2021-10-13 and processed with Pyserini. Access to this data is via Google Cloud. You will need to unzip the downloaded file and note the location of the unzipped files, for use as the --wikipedia_glob argument in the evaluation script.

« St. Elsewhere » « St. Elsewhere, Episodes, "Their Town" » In a somewhat change-of-pace episode, Drs. Craig and Novino, Ellen Craig, and Lizzie Westphall visit Donald and Tommy Westphall (Lizzie\'s father and brother, respectively), who appear to be enjoying the quiet life in small town New Hampshire. The episode features Dr. Westphall occasionally breaking the fourth wall and speaking directly to the viewer, a la the "Stage Manager" character in Our Town (the episode title and its location are nods to the Thornton Wilder play).

System Evaluation

Human and automatic evaluation of system output is provided in ratings.csv:

Dataset Statistics

Statistic
Dataset size	67.7MB
Number of instances	83,030 (3610 examples x 23 systems)
Number of fields	8, described below
Human labels	23,000 (1000 examples x 23 systems)
Automatic labels	83,030 (3610 examples x 23 systems)

Dataset Structure

The format is columns headed and containing:

Field Name	Type	Description	Example
`system_name`	`string`	System identifier from Tables 1, 2, and 3 of the paper.	Post-4
`question`	`string`	A question from the development set of OpenNQ	who played hyde in league of extraordinary gentlemen
`answer`	`string`	The system-generated answer span	Jason Flemyng
`attribution`	`string`	An identifier from the Attribution Corpus	`http://en.wikipedia.org/wiki/Jason_Flemyng#Jason_Flemyng#Television_and_film_work#2`
`passage`	`string`	The corresponding passage from the Attribution Corpus formatted by `evaluation.py`	Title: Jason Flemyng Section: Television and film work In the early 2000s he featured in two big-budget Hollywood films which were adaptations of Alan Moore comic books; as John Netley in 2001's From Hell, with Johnny Depp, and 2003's The League of Extraordinary Gentlemen, with Sean Connery, in which Flemyng played Dr. Henry Jekyll and Edward Hyde. The latter film was a disappointment, but Flemyng commented that: ""It was a bit of a nightmare... the film cost a fortune and didn't make back the money it was meant to... But I still get a huge kick out of doing films like that and From Hell. Any day you walk onto a set and Sean Connery or Johnny Depp or Brad Pitt is there has to be a good day.
`human_rating`	`Y / N`	The attribution decision of human rating according to 5-way annotation	Y
`auto_ais`	`Y / N`	The attribution decision of AutoAIS, Y for attributable (`nli_score` score greater than 0.5), or N.	Y
`nli_score`	`float`	The entailment score of the AutoAIS model	0.9814687

Languages

This release is in English.

Evaluation Script

Automatic evaluation of Attributed QA performs AutoAIS and SQuAD EM scoring over an input predictions .csv (provided via the --predictions_file argument), where the columns are headed and contain:

question a question from NQ Open (questions which do not match will be discarded from evaluation)
answer an output string
attribution an index from the Attribution Corpus (indexes which do not match will be discarded from evaluation)

who played hyde in league of extraordinary gentlemen,Jason Flemyng,http://en.wikipedia.org/wiki/Jason_Flemyng#Jason_Flemyng#Television_and_film_work#2

The results in the paper are for short-answer seeking queries in the Natural Questions. evaluation.py supports analysis of the Validation Set for which we use the version natural_questions_open:1.0.0 from tensorflow_datasets of Open-NQ.

The output of evaluation.py is two files, provided by the arguments:

--scores_file A short file with evaluation scores to report. To compare to our paper, see especially AutoAIS and SQuAD (em).
--ais_output_file A table with rows of question, answer (strings from --predictions_file), and passage (the retrieved passage string from the Attribution Corpus, formatted for human assessment of AIS. autoais gives the automatic judgment of attribution, Y or N.