Awesome
Update: We are currently working on a new implementation of discogs-xml2db
.
The new aproach is significantly faster and can be found in the speedup directory.
What is it?
This is a python program for importing the discogs data dumps found at http://www.discogs.com/data/ into PostgreSQL, CouchDB, or MongoDB database.
MySQL or other databases are not supported at the moment, but you are welcome to submit a patch.
discogs-xml2db works with Python 2.7 and 3.6.
It may work with 2.6 or 3.x, but probably not (I don't know, I didn't test).
It definitely doesn't work with 2.5.
discogs-xml2db makes use of the following modules
(some are standard in 2.7, some you'll need to pip install
):
- xml.sax - for handling the source XML files
- argparse - for parsing the command line arguments
- json - to save files in JSON format or talk to some back-ends
- couchdb - if you use the CouchDB back-end (probably not)
- pymongo - if you use the MongoDB back-end (best option)
- psycopg2 - for the PostgreSQL back-end (not your best option)
Options for discogsparser.py
- Input:
-d
/--date
parses all three files (artists, labels, masters, releases) for a given monthly dump:discogsparser.py -d 20111101
will look fordiscogs_20111101_artists.xml
,discogs_20111101_labels.xml
,discogs_20111101_masters.xml
, anddiscogs_20111101_releases.xml
in the current directory;
- Input: parse only specific file(s):
discogsparser.py /tmp/discogs_20111101_artists.xml /tmp/discogs_20111101_releases.xml
- will only parse the artist and release dumps from the/tmp
directory;
- Input:
-i
/--ignore-unknown-tags
: ignores new fields that may appear in the XML as the dump format evolvesdiscogsparser.py -i
- will display the unknown tags at the end of parsing each file, e.g.:Encountered some unknown Release tags: [u'data_quality', u'videos', u'video', u'identifiers', u'identifier']
- Output:
-o json
dumps records as JSON to the console:discogsparser.py -o json discogs_20111101_artists.xml
;
- Output:
-n 20
only process a given number of recordsdiscogsparser.py -n 20 -o json -d 20111101
will dump 20 records from each of the three monthly dumps as JSON to the console;
- Output:
-o
/--output
for output format (aka exporter) and-p
/--params
for parameters to the specific exporter-o json
, no-p
: dumps JSON to the console;-o pgsql -p "connection string"
: exports into a PostgreSQL database. See The psycopg2 module content for connection string documentation.-o couch -p "couch URI"
: exports to a CouchDB server running on localhost on port 5984 into a database nameddiscogs
;-o mongo -p "mongodb://localhost/discogs"
: connects, withuser
andpass
, to a MongoDB server running on localhost, and into a database nameddiscogs
. See Standard Connection String Format in the MongoDB docs.-o mongo -p "file:///path/to/dir/"
: outputs each of the Artists, Labels, Masters, Releases into a separate JSON file into the specified directory,/path/to/dir/
in this case, one line for each. Pass--ignoreblanks
tomongoimport
in case extra new-lines are added; you probably also want--upsert --upseftFields id
.
- Output:
-q
/--quality
- imports only items with the specified data_quality. Takes in a comma-separated list of values for multiple entries. Valid values: 'Needs Vote', 'Complete And Correct', 'Correct', 'Needs Minor Changes', 'Needs Major Changes', 'Entirely Incorrect', 'Entirely Incorrect Edit'.discogsparser.py -q 'Complete And Correct,Correct,Needs Minor Changes'
Examples
python discogsparser.py -n 200 -o couch --params http://127.0.0.1:5984/discogs -d 20111101
python discogsparser.py -o mongo -p mongodb://localhost,remote1/discogs discogs_20111101_artists.xml discogs_20111101_releases.xml
python discogsparser.py -o pgsql -p "host=remote1 dbname=discogs user=postgres password=s3cret" discogs_20111101_artists.xml
python discogsparser.py -o pgsql -p "host=remote1 dbname=discogs user=postgres password=s3cret" -d 20140501
How do I use it?
Start by downloading the data dumps (you can use get_latest_dump.sh
to get the latest dumps).
Steps to import the data-dumps into PostgreSQL:
- Unzip the dumps to the source directory:
gunzip discogs_20140501_*.xml.gz
- Login as database administrator user if not already, i.e:
sudo su - postgres
- Create discogs user and empty discogs database
createuser discogs; createdb -U discogs discogs
- Exit from administrator account
- Import the database schema:
psql -U discogs -d discogs -f create_tables.sql
- The XML data dumps often contain control characters and do not have root tags.
To fix this run
python fix-xml.py release
, where release is the release date of the dump, for example20100201
. - Import the data with
python discogsparser.py -o pgsql -p "dbname=discogs user=discogs" -d release
, where release is the release date of the dump, for example20100201
, this will take some time, for example takes 15 hours on my linux server with SSD - Run additional Sql fixes (such as removing duplicate rows):
psql -U discogs discogs -f fix_db.sql
- Create Database indexes:
psql -U discogs discogs -f create_indexes.sql
To import data into MongoDB you have two choices: direct import or dumping the records to JSON and then using mongoimport
.
The latter is considerably faster, particularly for the initial import.
To import directly into MongoDB, specify a mongodb://
scheme,
but be aware that the process is not overly quick.
You might find yourself running the initial import for days.
The JSON dump method is considerably faster, yet in either case you could take advantage of an option to import only the records that have changed from the previous import.
The mongo parser will store MD5 hashes of all records it parsed and it can re-use
these hashes on subsequent imports, provided you keep the .md5
files.
To perform a direct import:
discogsparser.py -i -o mongo -p "mongodb://localhost/discogs?uniq=md5" -d 20111101
The JSON dump route requires that you specify a file://
scheme and a location
where the intermediate files are to be stored
(you'll need space - these files are about the same size as the original XMLs):
$ discogsparser.py -i -o mongo -p "file:///tmp/discogs/?uniq=md5" -d 20111101
# this results in 2 files creates for each class, e.g. an artists.json file and an artists.md5 file
$ mongoimport -d discogs -c artists --ignoreBlanks artists.json
$ mongoimport -d discogs -c labels --ignoreBlanks labels.json
$ mongoimport -d discogs -c masters --ignoreBlanks masters.json
$ mongoimport -d discogs -c releases --ignoreBlanks releases.json
# use the mongo command to connect to the database and create the indexes you need, the ids at a minimum
# but you'll probably want l_name as well
$ mongo discogs
> db.artists.ensureIndex({id:1}, {background:true,unique:true})
> db.artists.ensureIndex({l_name:1}, {background:true})
> db.releases.ensureIndex({id:1}, {background:true,unique:true})
> db.releases.ensureIndex({l_title:1, l_artist:1}, {background:true, unique:true})
# etc
# now import the next month using --upsert:
$ mongoimport -d discogs -c artists --ignoreBlanks --upsert --upsertFields 'id' artists.json
To give you an idea of sizes, the November 10th, 2011 file sizes are: artists (2,149,473 records) - XML: 417MB, JSON: 554M; labels (275,065) - XML: 56MB, JSON: 70M; releases (2,779,084) - XML: 7.5GB, JSON: 6.1GB.
The December 1st XMLs are a bit bigger, e.g. 422MB for artists.
However, if you imported the November XMLs with ?uniq=md5
, the December JSON files are:
artists - 18MB (48,852 records changed from November),
labels - 3.2MB (11,997 records),
releases - 256MB (98,287).
Credits
Original project: discogs-sql-importer
Some sort of changelog
- v0.80 - now importing masters
- v0.70 - Ids as primary identifiers for artists, labels, and releases (not FKs though).
MongoDB can now do diff imports when using the
?uniq=md5
option. - v0.60 - support for CouchDB and MongoDB
- v0.50 - command line parameters controlling various import options
- v0.15 - Original import of discogs-sql-importer