Home

Awesome

Update: We are currently working on a new implementation of discogs-xml2db. The new aproach is significantly faster and can be found in the speedup directory.


What is it?

This is a python program for importing the discogs data dumps found at http://www.discogs.com/data/ into PostgreSQL, CouchDB, or MongoDB database.

Build Status

MySQL or other databases are not supported at the moment, but you are welcome to submit a patch.

discogs-xml2db works with Python 2.7 and 3.6.
It may work with 2.6 or 3.x, but probably not (I don't know, I didn't test). It definitely doesn't work with 2.5.

discogs-xml2db makes use of the following modules (some are standard in 2.7, some you'll need to pip install):

Options for discogsparser.py

Examples

python discogsparser.py -n 200 -o couch --params http://127.0.0.1:5984/discogs -d 20111101
python discogsparser.py -o mongo -p mongodb://localhost,remote1/discogs discogs_20111101_artists.xml discogs_20111101_releases.xml
python discogsparser.py -o pgsql -p "host=remote1 dbname=discogs user=postgres password=s3cret" discogs_20111101_artists.xml
python discogsparser.py -o pgsql -p "host=remote1 dbname=discogs user=postgres password=s3cret" -d 20140501

How do I use it?

Start by downloading the data dumps (you can use get_latest_dump.sh to get the latest dumps).

Steps to import the data-dumps into PostgreSQL:

  1. Unzip the dumps to the source directory: gunzip discogs_20140501_*.xml.gz
  2. Login as database administrator user if not already, i.e: sudo su - postgres
  3. Create discogs user and empty discogs database createuser discogs; createdb -U discogs discogs
  4. Exit from administrator account
  5. Import the database schema: psql -U discogs -d discogs -f create_tables.sql
  6. The XML data dumps often contain control characters and do not have root tags. To fix this run python fix-xml.py release, where release is the release date of the dump, for example 20100201.
  7. Import the data with python discogsparser.py -o pgsql -p "dbname=discogs user=discogs" -d release, where release is the release date of the dump, for example 20100201, this will take some time, for example takes 15 hours on my linux server with SSD
  8. Run additional Sql fixes (such as removing duplicate rows): psql -U discogs discogs -f fix_db.sql
  9. Create Database indexes: psql -U discogs discogs -f create_indexes.sql

To import data into MongoDB you have two choices: direct import or dumping the records to JSON and then using mongoimport. The latter is considerably faster, particularly for the initial import.

To import directly into MongoDB, specify a mongodb:// scheme, but be aware that the process is not overly quick.
You might find yourself running the initial import for days.

The JSON dump method is considerably faster, yet in either case you could take advantage of an option to import only the records that have changed from the previous import.

The mongo parser will store MD5 hashes of all records it parsed and it can re-use these hashes on subsequent imports, provided you keep the .md5 files.

To perform a direct import:

discogsparser.py -i -o mongo -p "mongodb://localhost/discogs?uniq=md5" -d 20111101

The JSON dump route requires that you specify a file:// scheme and a location where the intermediate files are to be stored (you'll need space - these files are about the same size as the original XMLs):

$ discogsparser.py -i -o mongo -p "file:///tmp/discogs/?uniq=md5" -d 20111101
# this results in 2 files creates for each class, e.g. an artists.json file and an artists.md5 file

$ mongoimport -d discogs -c artists --ignoreBlanks artists.json
$ mongoimport -d discogs -c labels --ignoreBlanks labels.json
$ mongoimport -d discogs -c masters --ignoreBlanks masters.json
$ mongoimport -d discogs -c releases --ignoreBlanks releases.json

# use the mongo command to connect to the database and create the indexes you need, the ids at a minimum
# but you'll probably want l_name as well
$ mongo discogs
> db.artists.ensureIndex({id:1}, {background:true,unique:true})
> db.artists.ensureIndex({l_name:1}, {background:true})
> db.releases.ensureIndex({id:1}, {background:true,unique:true})
> db.releases.ensureIndex({l_title:1, l_artist:1}, {background:true, unique:true})
# etc

# now import the next month using --upsert:
$ mongoimport -d discogs -c artists --ignoreBlanks --upsert --upsertFields 'id' artists.json

To give you an idea of sizes, the November 10th, 2011 file sizes are: artists (2,149,473 records) - XML: 417MB, JSON: 554M; labels (275,065) - XML: 56MB, JSON: 70M; releases (2,779,084) - XML: 7.5GB, JSON: 6.1GB.

The December 1st XMLs are a bit bigger, e.g. 422MB for artists. However, if you imported the November XMLs with ?uniq=md5, the December JSON files are: artists - 18MB (48,852 records changed from November), labels - 3.2MB (11,997 records), releases - 256MB (98,287).

Credits

Original project: discogs-sql-importer

Some sort of changelog