Awesome
Reddit Scraper
A Python Reddit scraper based on Praw and Pushshift.
This script allows you to:
- Download a list of posts;
- Download a list of subreddits;
- Make arbitrary API calls to Pushshift to build more refined datasets.
The usage should be pretty-explanatory. The only think you should know is that
you will need to get an API Key from Reddit, copy it
in reddit_config.sample.py
, and rename the file to reddit_config.py
.
usage: reddit-scraper.py [-h]
(--posts POSTS_FILE | --subs SUBS_FILE | --config CONFIG_FILE)
[--start START_DATE] [--end END_DATE] --output
OUTPUT_FOLDER [--blacklist BLACKLIST_FILE]
[--workers NUM_WORKERS]
Scrapes subreddits and puts their content in a plain text file.
Use with --posts to download posts, --subs to download
subreddits, and --config to make custom Pushshift API calls.
optional arguments:
-h, --help show this help message and exit
--posts POSTS_FILE A file containing the list of posts to scrape, one per
line.
--subs SUBS_FILE A file containing the list of subreddits to scrape,
one per line.
--config CONFIG_FILE A file containing the arguments for the Pushshift
APIs.
--start START_DATE The date to start parsing from, in YYYY-MM-DD format
--end END_DATE The final date of the parsing, in YYYY-MM-DD format
--output OUTPUT_FOLDER
The output folder
--blacklist BLACKLIST_FILE
A file containing the lines to skip.
--workers NUM_WORKERS
Number of parallel scraper workers
Examples:
-
Download all posts in the subreddits specified in
subreddits.txt
, from January 1, 2015 to December 31, 2016, using 8 parallel processes, save them inscraped/
, and ignoring the lines defined inblacklist.txt
:python reddit-scraper.py --subs subreddits.txt --output scraped --start 2015-01-01 --end 2016-12-31 --workers 8 --blacklist blacklist.txt
-
Download the post specified in
posts.txt
, and save them inscraped/
:python reddit-scraper.py --posts posts.txt --output scraped
-
Use the Pushshift API to look for posts in Reddit, using the parameters provided in
config.default.txt
from January 1, 2019 to January 2, 2019, using 8 parallel processes, and save them inscraped/
:python reddit-scraper.py --config config.default.txt --output scraped --start 2019-01-01 --end 2019-01-02 --workers 8