Home

Awesome

This is an easy-to-use Python reader for the enriched WebNLG data.

How to run

python data/webnlg/reader.py [--version x.x]

--version choices: 1.4 | 1.5 (default)

The resulted file structure is like this:

.
├── data
│   └── webnlg
│       ├── reader.py
│       ├── utils.py
│       ├── raw/
│       ├── test.json
│       ├── train.json
│       └── valid.json
└── README.md

Contributions

  1. Decomposed the WebNLG dataset from document-level into sentence-level
  2. Created an Easy-to-use Python reader for WebNLG dataset v1.5, runnable by 2019-SEP-20. (Debugged and adapted from the reader in chimera's repo.)
  3. Manually fixed spaCy's sentence tokenization
  4. Deleted parts of sentences where no corresponding triple exists.
  5. Deleted irrelevant triples manually
  6. Manually fixed all wrong templates (e.g. template.replace('AEGNT-1', 'AGENT-1')), made it convenient for template-based models.
  7. Carefully replaces - with _ in template names, such as AGENT-1 to AGENT_1. This provides convenience for tokenization.

Overview of dataset

Todo