Home

Awesome

Open Practical

[Chris Dyer, Phil Blunsom, Yannis Assael, Brendan Shillingford, Yishu Miao, Jan Buys]

The first three practicals covered a variety of introductory topics in Deep Learning for NLP. For the remainder of the term, you can select from among the following projects and explore them according to your interests. Treat the following project descriptions as suggested starting points, but feel free to take any of these ideas in a different direction if you feel inspired to. The practical sessions offer you an opportunity to pursue such ideas while making use of the support and expertise of your practical demonstrators.

Microsoft Azure Sponsorship

<img src="https://rawgit.com/oxford-cs-deepnlp-2017/practical-open/master/doc/azure.svg" width="50%" />

We would like to thank Microsoft Azure Sponsorship 2 for their generous donation of GPU computational resources for this course, as well as, Michael Thomas, Andrew Webber and Lee Stott for their precious help. For more information setting up your GPU VM: Azure Setup Guide

Conditional language modeling

Conditioning on the topic set and generating text

This is the inverse of the problem considered in Practical 2/3: rather than conditioning on a talk, and predicting its labels, you should condition on an embedding of the label set and generate a TED talk. Because the TED dataset is limited in terms of size, the (partial) Wikipedia dumps could be considered as an alternative choice.

This can be summarized by the very simple stochastic program:

   x = embed(topic set) # map a set of topics into a (learned) vector representation

   y ~ RNNLM( x ) # feed the topic set into the RNN so it can use this information

The parameters of the RNN and the embedding model can be trained to maximize the log likelihood of a set of training pairs (topic set, y )*.

You will need to design a function to create an embedding of the set of topics you are going to condition on. Options here are using an "RNN encoder" to read the list, an additive model, a convolutional architecture, or something else you come up with.

Questions

Conditioning on the talk and generating an summary

Summarisation is an important application in NLP. Sequence to sequence models are an appealing model for generating summaries: an encoder "reads" the contents of the talk and a decoder attends to the resulting representations and generates an output summary.

Each TED talk in our dataset comes with a short summary (located in the <description> XML tag). Design and implement a model for learning to generate summaries. Evaluate it using the ROUGE metric.

There are several challenges in this task:

Machine translation (MT)

Machine translation has been one of the notable successes of deep learning in NLP. The TED corpus has been translated into many languages by a volunteer effort. And TED now holds regional talks in many different languages, many of which are also translated into English. The result is: we have a lot of "in domain" training data to learn how to translate TED talks.

In this task, you should implement a sequence-to-sequence translation model for German to English translation, as described by Bahdanau et al. (2015). You can obtained preprocessed (tokenised, filtered for length, and lowercased, split into training/validation/test sets) parallel German-English data (if you wish, you can copy this data somewhere here). Note: the validation/test sets contain OOV words relative to the vocabulary in the training set, in both the source and target languages. You will need to deal with this (the simplest thing is to replace infrequent words with an UNK token).

For decoding, you should implement two decoding algorithms: the greedy left-to-right decoder, and one that generates sample translations proportional to the probability p(translation | input). Evaluate the output of your translation model using multi-bleu.perl.

Questions

Speaking rate modeling

How long does it take to read a document? Different speakers speak at different base rates; and different words take more or less time to pronounce.

In this task, you will use the timing data from the TED XML transcripts to build a model to predict how long different speakers will take to produce words. Since each speaker will have a different base rate, we will control for that by observing the speaking rate both in the training set and test set.

r = per-character speaking rate of speaker (observed in this case)

equation= context vector at time t ( can be processed by token embedding, bidirectional RNN or convNet)

equation = predict_duration equation ( in milliseconds? or log milliseconds? Should be positive?)

equation (What loss function is used here? What does the error distribution look like? Is it symmetric? skewed?)

In this task, you will need to:

Questions

Propose your own task

If none of the above tasks inspire you, or if you happen to have discovered an interesting source of data to model, feel free to propose and pursue your own research idea.

Some ideas might be: