Awesome
NLP-Projects
Natural Language Processing projects, which includes concepts and scripts about:
-
gensim
,fastText
andtensorflow
implementations. See Chinese notes, 中文解读
-
doc2vec
,word2vec averaging
andSmooth Inverse Frequency
implementations
-
- Categories and components of dialog system
-
tensorflow LSTM
(See Chinese notes 1, 中文解读 1 and Chinese notes 2, 中文解读 2)fastText
implementation
-
- Principle of ELMo, ULMFit, GPT, BERT, XLNet
-
- Chinese_word_segmentation
HMM Viterbi
implementations. See Chinese notes, 中文解读
- Named_Entity_Recognition
- Brands NER via bi-directional LSTM + CRF,
tensorflow
implementation. See Chinese notes, 中文解读
- Brands NER via bi-directional LSTM + CRF,
- Chinese_word_segmentation
Concepts
1. Attention
- Attention == weighted averages
- The attention review 1 and review 2 summarize attention mechanism into several types:
- Additive vs Multiplicative attention
- Self attention
- Soft vs Hard attention
- Global vs Local attention
2. CNNs, RNNs and Transformer
-
Parallelization [1]
- RNNs
- Why not good ?
- Last step's output is input of current step
- Solutions
- Simple Recurrent Units (SRU)
- Perform parallelization on each hidden state neuron independently
- Sliced RNNs
- Separate sequences into windows, use RNNs in each window, use another RNNs above windows
- Same as CNNs
- Simple Recurrent Units (SRU)
- CNNs
- Why good ?
- For different windows in one filter
- For different filters
- RNNs
-
Long-range dependency [1]
- CNNs
- Why not good ?
- Single convolution can only caputure window-range dependency
- Solutions
- Dilated CNNs
- Deep CNNs
N * [Convolution + skip-connection]
- For example, window size=3 and sliding step=1, second convolution can cover 5 words (i.e., 1-2-3, 2-3-4, 3-4-5)
- Transformer > RNNs > CNNs
- CNNs
-
Position [1]
-
CNNs
- Why not good ?
- Convolution preserves relative-order information, but max-pooling discards them
-
Solutions
- Discard max-pooling, use deep CNNs with skip-connections instead
- Add position embedding, just like in ConvS2S
-
- Why not good ?
- In self-attention, one word attends to other words and generate the summarization vector without relative position information
-
-
Semantic features extraction [2]
- Transformer > CNNs == RNNs
3. Pattern of DL in NLP models [3]
- Data
- Preprocess
- Sub-word segmentation to avoid OOV and reduce vocabulary size
- Pre-training (e.g., ELMO, BERT)
- Multi-task learning
- Transfer learning, ref_1, ref_2
- Use source task/domain
S
to increase target task/domainT
- Use source task/domain
- If
S
has a zero/one/few instances, we call it zero-shot, one-shot, few-shot learning, respectively
- Preprocess
- Model
- Encoder
- CNNs, RNNs, Transformer
- Structure
- Sequential, Tree, Graph
- Encoder
- Learning (change loss definition)
- Adversarial learning
- Reinforcement learning
References
- [1] Review
- [2] Why self-attention? A targeted evaluation of neural machine translation architectures
- [3] ACL 2019 oral
Awesome public apis
Awesome packages
Chinese
English
- Spacy
- gensim
- Install tensorflow with one line:
conda install tensorflow-gpu