Awesome
Welcome to SYNLP!
The followings are some of our representative research papers.
Notes: The Language collumn in the following tables indicates that the models are evaluated on those languages in the paper. It does not mean the model will not work on other languages.
Word Embedding and Pre-trained LM
Model Recommendation: DSG provides 200-dimensional word embeddings for around 8M Chinese words. ZEN 2.0 provides large pre-trained language models (the large version uses 24 layers of self-attentions with 1024 dimensional hidden vectors) for Arabic and Chinese. The models are trained on large corpus and enhance text modeling through n-grams. ChiMed-GPT is a Chinese medical large language model (LLM) built by continually training Ziya-v2 on Chinese medical data, where pre-training, supervised fine-tuning (SFT), and reinforcement learning from human feedback (RLHF) are comprehensively performed on it.
Chinese Word Segmentation and POS Tagging
Model Recommendation: WMSeg and McASP contains easy-to-use CWS and joint CWS and POS tagging models that are based on BERT and ZEN. Models trained on different datasets are available for downloading.
Parsing
Name | Paper | Code | Language |
---|---|---|---|
SAPar | Improving Constituency Parsing with Span Attention | link | Arabic, Chinese, English |
DMPar | Enhancing Structure-aware Encoder with Extremely Limited Data for Graph-based Dependency Parsing | link | English |
NeST-CCG | Supertagging Combinatory Categorial Grammar with Attentive Graph Convolutional Networks | link | English |
Model Recommendation: SAPar provides constituent parsers (which are based on BERT, XLNet, and ZEN) for Arabic, Chinese, and English; DMPar provides code for dependency parsing; NeST-CCG offers BERT-based models for English CCG supertagging. Both repositories provide pre-trained models and they are easy-to-use.
Semantic Role Labeling
Name | Paper | Code | Language |
---|---|---|---|
SRL-MM | Syntax-driven Approach for Semantic Role Labeling | link | English |
Named Entity Recognition
Name | Paper | Code | Language |
---|---|---|---|
SANER | Named Entity Recognition for Social Media Texts with Semantic Augmentation | link | Chinese,English |
AESINER | Improving Named Entity Recognition with Attentive Ensemble of Syntactic Information | link | Chinese,English |
BioKMNER | Improving biomedical named entity recognition with syntactic information | link | English |
Model Recommendation: SANER use pre-trained language models and word embeddings in text modeling, with the semantic of similar words are used to enhance text understanding. Pre-trained models are available for downloading and they are easy-to-use.
Coreference Resolution
Name | Paper | Code | Language |
---|---|---|---|
Pronoun-Coref-KG | Knowledge-aware Pronoun Coreference Resolution | link | English |
Pronoun-Coref | Incorporating Context and External Knowledge for Pronoun Coreference Resolution | link | English |
Visual_PCR | What You See is What You Get: Visual Pronoun Coreference Resolution in Dialogues | link | English |
Model Recommendation: Pronoun-Coref uses GloVe and ELMo embeddings in text modeling. The model is light and easy-to-use.
Aspect-level Sentiment Analysis
Model Recommendation: DGSA provides an end-to-end solution (the model are based on BERT) for aspect-level sentiment analysis, which can be directly used to process raw text.
Relation Extraction
Model Recommendation: RE-AGCN provides BERT-based models for relation extraction, where the model leverages the auto-parsed dependency tree of the input text to have a better understanding to the text.
Domain Adaptation
Name | Paper | Code | Language |
---|---|---|---|
T-DNA | Taming Pre-trained Language Models with N-gram Representations for Low-Resource Domain Adaptation | link | English |
SDG4DA | Reinforced Training Data Selection for Domain Adaptation | link | English |
DPM4DA | Domain Adaptation for Disease Phrase Matching with Adversarial Networks | -- | English |
TD4DA | Entropy-based Training Data Selection for Domain Adaptation | -- | Chinese, English |
GM4DA | Using a goodness measurement for domain adaptation: A case study on Chinese word segmentation | -- | Chinese |
Model Recommendation: T-DNA is a Transformer-based language model for domain adaptation, which can be used easily.
Medical NER
Name | Paper | Code | Language |
---|---|---|---|
HET-MC | Summarizing Medical Conversations via Identifying Important Utterances | link | Chinese |
BioKMNER | Improving biomedical named entity recognition with syntactic information | link | English |
Radiology Report Generation
Name | Paper | Code | Language |
---|---|---|---|
R2GenRL | Reinforced Cross-modal Alignment for Radiology Report Generation | link | English |
R2GenCMN | Cross-modal Memory Networks for Radiology Report Generation | link | English |
R2Gen | Generating Radiology Reports via Memory-driven Transformer | link | English |
🔥RRG-Review | A Systematic Review of Deep Learning-based Research on Radiology Report Generation | -- | English |
Language Resource
Name | Paper | Code | Language |
---|---|---|---|
ChiMed | ChiMed: A Chinese Medical Corpus for Question Answering | link | Chinese |
ChiMST | ChiMST: A Chinese Medical Corpus for Word Segmentation and Medical Term Recognition | link | Chinese |
Chinese CCGBank | Chinese CCGBank Construction from Tsinghua Chinese Treebank | -- | Chinese |
HNZ | The Construction of a Segmented and Part-of-speech Tagged Archaic Chinese Corpus: A Case Study on Huainanzi | link | Chinese |