Awesome
This is an easy-to-use Python reader for the enriched WebNLG data.
How to run
python data/webnlg/reader.py [--version x.x]
--version
choices: 1.4
| 1.5
(default)
The resulted file structure is like this:
.
├── data
│ └── webnlg
│ ├── reader.py
│ ├── utils.py
│ ├── raw/
│ ├── test.json
│ ├── train.json
│ └── valid.json
└── README.md
Contributions
- Decomposed the WebNLG dataset from document-level into sentence-level
- Created an Easy-to-use Python reader for WebNLG dataset v1.5, runnable by 2019-SEP-20. (Debugged and adapted from the reader in chimera's repo.)
- Manually fixed spaCy's sentence tokenization
- Deleted parts of sentences where no corresponding triple exists.
- Deleted irrelevant triples manually
- Manually fixed all wrong templates (e.g.
template.replace('AEGNT-1', 'AGENT-1')
), made it convenient for template-based models. - Carefully replaces
-
with_
in template names, such asAGENT-1
toAGENT_1
. This provides convenience for tokenization.
Overview of dataset
- Dataset sizes: train 24526, valid 3019, test 6622
- Vocab of entities: 3227
- Vocab of ner: 12 (
['agent_1', 'bridge_1', 'bridge_2', 'bridge_3', 'bridge_4', 'patient_1', 'patient_2', 'patient_3', 'patient_4', 'patient_5', 'patient_6', 'patient_7']
) - Vocab of relations: 726
- Vocab of txt: 6671
- Vocab of tgt: 1897
- Len(tgt): avg 11.5, max 42
Todo
- "was selected by NASA" is a relationship which spans several words, -- it should be made as one word in the triple.
- "(workedAt," is a relationship which has punctuations, -- it should be clean.
- There are still several hundred dirty, unaligned (stripleset, template) pairs. Align them by tracking the
self.cnt_dirty_data
variable when runningreader.py
. - 'discrimina-tive training' spelling errors
- fix unalignment errors by
grep -nriF '<sentence ID="3"/>' '<sentence ID="2"/>'