Home

Awesome

whtranscripts — White House Transcript Fetcher/Parser

whtranscripts helps you fetch and parse transcripts from the American Presidency Project's press-briefing and presidential-news-conference transcripts.

Installation

whtranscripts is a Python library. To install it, run:

pip install whtranscripts

Downloading Transcripts

To download the HTML of all news-conference transcripts:

mkdir ~/Downloads/conference-transcripts
python -m "whtranscripts.download" -t conference --dest ~/Downloads/conference-transcripts/

For press-briefings:

mkdir ~/Downloads/another-dir
python -m "whtranscripts.download" -t briefing --dest ~/Downloads/another-dir/

You can also limit downloads to a particular year-range, e.g., from 2001 through 2008:

python -m "whtranscripts.download" -t conference --dest ~/Downloads/conference-transcripts/ --start 2001 --end 2008

Parsing Transcripts

You can load single transcripts from a file, URL, or the HTML itself. From a file:

import whtranscripts
transcript = whtranscripts.Conference.from_path("test/pages/conferences/99975.html")

Alternatively, for a briefing:

import whtranscripts
transcript = whtranscripts.Briefing.from_path("test/pages/briefings/47646.html")

From a URL:

import whtranscripts
url = "http://www.presidency.ucsb.edu/ws/index.php?pid=99975"
transcript = whtranscripts.Conference.from_url(url)

Directly from American Presidency Project HTML:

import whtranscripts
import requests
url = "http://www.presidency.ucsb.edu/ws/index.php?pid=99975"
html = requests.get(url).content
transcript = whtranscripts.Conference(html)

You can also load multiple at once, from a directory:

import whtranscripts
transcripts = whtranscripts.Conference.from_dir("test/pages/conferences")

Note: The files you want to parse from directory must end in .html

Analyzing Transcripts

Each Conference and Briefing has the following attributes:

Each Passage object has the following attributes:

Each Passage object also has the following methods:

Exporting Transcripts

You can export transcripts as CSVs, using the TranscriptSet class:

import whtranscripts
urls = whtranscripts.download.get_urls("conference", 2013, 2013)
transcripts = map(whtranscripts.conference.Conference.from_url, urls)
t_set = whtranscripts.TranscriptSet(transcripts)
t_set.to_csv(sys.stdout)