Home

Awesome

Neural-Wikipedian

This repository contains the code along with the datasets of the work that has been submitted as a research paper to the Journal of Web Semantics. The work focuses on how an adaptation of the encoder-decoder framework can be used to generate textual summaries for Semantic Web triples.

For a detailed description of the work presented in this repository, please refer to the preprint version of the submitted paper at: https://arxiv.org/abs/1711.00155.

Datasets

In order to train our proposed models, we built two datasets of aligned knowledge base triples with text.

In a Unix shell environment execute: sh download_datasets.sh in order to download and uncompress both of them in their corresponding folders (i.e. D1 and D2). Each dataset folder consists of three different sub-folders:

Inspect-Dataset.ipynb is a Python script on iPython Notebook that allows easier inspection of the above aligned datasets. The scripts provides also detailed information regarding the structure of the intermediate parts in D1/data/ and D2/data/ and the functionality of the supporting files in D1/utils/ and D2/utils/.

The table below presents the distribution of the 10 most common predicates, and entities in our two datasets, D1 and D2 respectively.

<table> <thead> <tr> <td><b>Predicates In Triples</b></td> <td align="center">%</td> <td><b>Entities In Triples</b></td> <td align="center">%</td> <td><b>Entities In Summaries</b></td> <td align="center">%</td> </tr> </thead> <tr> <td><tt>dbo:birthDate</tt></td> <td align="center">12.43</td> <td><tt>dbr:United_States</tt></td> <td align="center">0.49</td> <td><tt>dbr:United_States</tt></td> <td align="center">2.82</td> </tr> <tr> <td><tt>dbo:birthPlace</tt></td> <td align="center">10.67</td> <td><tt>dbr:England</tt></td> <td align="center">0.19</td> <td><tt>dbr:Actor</tt></td> <td align="center">2.14</td> </tr> <tr> <td><tt>dbo:careerStation</tt></td> <td align="center">5.47</td> <td><tt>dbr:United_Kingdom</tt></td> <td align="center">0.14</td> <td><tt>dbr:Association_football</tt></td> <td align="center">1.02</td> </tr> <tr> <td><tt>dbo:deathDate</tt></td> <td align="center">5.11</td> <td><tt>dbr:France</tt></td> <td align="center">0.14</td> <td><tt>dbr:Politician</tt></td> <td align="center">0.97</td> </tr> <tr> <td><tt>dbo:occupation</tt></td> <td align="center">5.06</td> <td><tt>dbr:Canada</tt></td> <td align="center">0.12</td> <td><tt>dbr:Singing</tt></td> <td align="center">0.90</td> </tr> <tr> <td><tt>dbo:team</tt></td> <td align="center">4.18</td> <td><tt>dbr:India</tt></td> <td align="center">0.11</td> <td><tt>dbr:United_Kingdom</tt></td> <td align="center">0.59</td> </tr> <tr> <td><tt>dbo:deathPlace</tt></td> <td align="center">3.51</td> <td><tt>dbr:Actor</tt></td> <td align="center">0.10</td> <td><tt>dbr:England</tt></td> <td align="center">0.58</td> </tr> <tr> <td><tt>dbo:genre</tt></td> <td align="center">3.22</td> <td><tt>dbr:Italy</tt></td> <td align="center">0.10</td> <td><tt>dbr:Writer</tt></td> <td align="center">0.53</td> </tr> <tr> <td><tt>dbo:associatedBand</tt></td> <td align="center">2.85</td> <td><tt>dbr:London</tt></td> <td align="center">0.10</td> <td><tt>dbr:Canada</tt></td> <td align="center">0.50</td> </tr> <tr> <td><tt>dbp:associatedMusicalArtist</tt></td> <td align="center">2.85</td> <td><tt>dbr:Japan</tt></td> <td align="center">0.09</td> <td><tt>dbr:France</tt></td> <td align="center">0.49</td> </tr> <tr> <td colspan="6"></td> </tr> <thead> <tr> <td><b>Predicates In Triples</b></td> <td align="center">%</td> <td><b>Entities In Triples</b></td> <td align="center">%</td> <td><b>Entities In Summaries</b></td> <td align="center">%</td> </tr> </thead> <td><tt>wikidata:P569</tt><br/> (place of birth)</td> <td align="center">14.15</td> <td><tt>wikidata:Q5</tt><br/> (human)</td> <td align="center">3.96</td> <td><tt>wikidata:Q30</tt><br/> (United States of America)</td> <td align="center">3.20</td> </tr> <tr> <td><tt>wikidata:P106</tt><br/> (occupation)</td> <td align="center">11.63</td> <td><tt>wikidata:Q6581097</tt><br/> (male)</td> <td align="center">3.27</td> <td><tt>wikidata:Q33999</tt><br/> (actor)</td> <td align="center">1.56</td> </tr> <tr> <td><tt>wikidata:P31</tt><br/> (instance of)</td> <td align="center">8.29</td> <td><tt>wikidata:Q30</tt><br/> (United States of America)</td> <td align="center">1.13</td> <td><tt>wikidata:Q82955</tt><br/> (politician)</td> <td align="center">1.02</td> </tr> <tr> <td><tt>wikidata:P21</tt><br/> (sex or gender)</td> <td align="center">7.92</td> <td><tt>wikidata:Q6581072</tt><br/> (female)</td> <td align="center">0.70</td> <td><tt>wikidata:Q21</tt><br/> (England)</td> <td align="center">0.87</td> </tr> <tr> <td><tt>wikidata:P570</tt><br/> (date of death)</td> <td align="center">7.58</td> <td><tt>wikidata:Q145</tt><br/> (United Kingdom)</td> <td align="center">0.44</td> <td><tt>wikidata:Q145</tt><br/> (United Kingdom)</td> <td align="center">0.85</td> </tr> <tr> <td><tt>wikidata:P27</tt><br/> (country of citizenship)</td> <td align="center">6.75</td> <td><tt>wikidata:Q82955</tt><br/> (politician)</td> <td align="center">0.42</td> <td><tt>wikidata:Q27939</tt><br/> (singing)</td> <td align="center">0.79</td> </tr> <tr> <td><tt>wikidata:P735</tt><br/> (given name)</td> <td align="center">6.53</td> <td><tt>wikidata:Q1860</tt><br/> (English)</td> <td align="center">0.39</td> <td><tt>wikidata:Q36180</tt><br/> (writer)</td> <td align="center">0.71</td> </tr> <tr> <td><tt>wikidata:P19</tt><br/> (place of birth)</td> <td align="center">5.20</td> <td><tt>wikidata:Q33999</tt><br/> (actor)</td> <td align="center">0.36</td> <td><tt>wikidata:Q2736</tt><br/> (association football)</td> <td align="center">0.68</td> </tr> <tr> <td><tt>wikidata:P5</tt><br/> (member of sports team)</td> <td align="center">2.64</td> <td><tt>wikidata:Q36180</tt><br/> (writer)</td> <td align="center">0.24</td> <td><tt>wikidata:Q183</tt><br/> (Germany)</td> <td align="center">0.61</td> </tr> <tr> <td><tt>wikidata:P69</tt><br/> (educated at)</td> <td align="center">2.58</td> <td><tt>wikidata:Q177220</tt><br/> (singer)</td> <td align="center">0.20</td> <td><tt>wikidata:Q16</tt><br/> (Canada)</td> <td align="center">0.58</td> </tr> </table>

Our Systems

The Systems directory contains all the code to both train and generate summaries for the sets of triples that are located in the validation and test sets of our datasets. It contains our two models in two separate sub-folders (i.e. Triples2GRU and Triples2LSTM). The neural network models are implemented using the Torch package. We conducted our experiments on a single Titan X (Pascal) GPU. Please make sure that Torch along with the torch-hdf5 package and the NVIDIA CUDA drivers are installed in your machine before executing any of the .lua files in these directories.

The generated summaries will be saved as HDF5 files in the directory of the pre-trained model. Our trained models use CUDA Tensors. Consequently, the NVIDIA CUDA drivers along with the cutorch and cunn Lua packages should be installed in your machine. The latter can be installed by running:

luarocks install cutorch
luarocks install cunn

For all possible alteration in the parameters of the above files, please consult their corresponding comment sections.

KenLM

The KenLM directory contains all the required code in order to train an n-gram Kneser-Ney language model. The code is based on the KenLM Language Model Toolkit. The binary files that reside in the ./kenlm/build/ directory have been compiled using Boost on a machine running Ubuntu 16.04 (x86_64 Linux 4.4.0-98-generic). In case you wish to experiment with this baseline on a different OS, you need to download and compile the original package according to the instructions at https://kheafield.com/code/kenlm/.

The following Python packages should also be installed in your machine: (i) numpy, (ii) pandas, and (iii) kenlm. The latter can be installed by running: pip install https://github.com/kpu/kenlm/archive/master.zip (i.e. https://github.com/kpu/kenlm).

In the default scenario, the model trains on D1 and samples summaries for the sets of triples that have been allocated to the test set. In case you wish to run the files (i.e. train.sh, sample.py and process-templates.py) in a different setup, you can alter them following the guidelines in each file's comment sections.

License

This project is licensed under the terms of the Apache 2.0 License.