Awesome

Welcome to SYNLP!

The followings are some of our representative research papers.

Notes: The Language collumn in the following tables indicates that the models are evaluated on those languages in the paper. It does not mean the model will not work on other languages.

Word Embedding and Pre-trained LM

Name	Paper	Code	Language
DSG	Directional Skip-Gram: Explicitly Distinguishing Left and Right Context for Word Embeddings	link	Chinese
ZEN 1.0	ZEN: Pre-training Chinese Text Encoder Enhanced by N-gram Representations	link	Chinese
ZEN 2.0	ZEN 2.0: Continue Training and Adaption for N-gram Enhanced Text Encoders	link	Arabic, Chinese
T-DNA	Taming Pre-trained Language Models with N-gram Representations for Low-Resource Domain Adaptation	link	English
🔥 ChiMed-GPT	ChiMed-GPT: A Chinese Medical Large Language Model with Full Training Regime and Better Alignment to Human Preferences	link	Chinese, English

Model Recommendation: DSG provides 200-dimensional word embeddings for around 8M Chinese words. ZEN 2.0 provides large pre-trained language models (the large version uses 24 layers of self-attentions with 1024 dimensional hidden vectors) for Arabic and Chinese. The models are trained on large corpus and enhance text modeling through n-grams. ChiMed-GPT is a Chinese medical large language model (LLM) built by continually training Ziya-v2 on Chinese medical data, where pre-training, supervised fine-tuning (SFT), and reinforcement learning from human feedback (RLHF) are comprehensively performed on it.

Chinese Word Segmentation and POS Tagging

Name	Paper	Code	Language
WMSeg	Improving Chinese Word Segmentation with Wordhood Memory Networks	link	Chinese
TwASP	Joint Chinese Word Segmentation and Part-of-speech Tagging via Two-way Attentions of Auto-analyzed Knowledge	link	Chinese
McASP	Joint Chinese Word Segmentation and Part-of-speech Tagging via Multi-channel Attention of Character N-grams	link	Chinese
GCASeg	Federated Chinese Word Segmentation with Global Character Associations	link	Chinese

Model Recommendation: WMSeg and McASP contains easy-to-use CWS and joint CWS and POS tagging models that are based on BERT and ZEN. Models trained on different datasets are available for downloading.

Parsing

Name	Paper	Code	Language
SAPar	Improving Constituency Parsing with Span Attention	link	Arabic, Chinese, English
DMPar	Enhancing Structure-aware Encoder with Extremely Limited Data for Graph-based Dependency Parsing	link	English
NeST-CCG	Supertagging Combinatory Categorial Grammar with Attentive Graph Convolutional Networks	link	English

Model Recommendation: SAPar provides constituent parsers (which are based on BERT, XLNet, and ZEN) for Arabic, Chinese, and English; DMPar provides code for dependency parsing; NeST-CCG offers BERT-based models for English CCG supertagging. Both repositories provide pre-trained models and they are easy-to-use.

Semantic Role Labeling

Name	Paper	Code	Language
SRL-MM	Syntax-driven Approach for Semantic Role Labeling	link	English

Named Entity Recognition

Name	Paper	Code	Language
SANER	Named Entity Recognition for Social Media Texts with Semantic Augmentation	link	Chinese,English
AESINER	Improving Named Entity Recognition with Attentive Ensemble of Syntactic Information	link	Chinese,English
BioKMNER	Improving biomedical named entity recognition with syntactic information	link	English

Model Recommendation: SANER use pre-trained language models and word embeddings in text modeling, with the semantic of similar words are used to enhance text understanding. Pre-trained models are available for downloading and they are easy-to-use.

Coreference Resolution

Name	Paper	Code	Language
Pronoun-Coref-KG	Knowledge-aware Pronoun Coreference Resolution	link	English
Pronoun-Coref	Incorporating Context and External Knowledge for Pronoun Coreference Resolution	link	English
Visual_PCR	What You See is What You Get: Visual Pronoun Coreference Resolution in Dialogues	link	English

Model Recommendation: Pronoun-Coref uses GloVe and ELMo embeddings in text modeling. The model is light and easy-to-use.

Aspect-level Sentiment Analysis

Name	Paper	Code	Language
ASA-TGCN	Aspect-based Sentiment Analysis with Type-aware Graph Convolutional Networks and Layer Ensemble	link	English
ASA-WD	Enhancing Aspect-level Sentiment Analysis with Word Dependencies	link	English
ASA-CLD	Complementary Learning of Aspect Terms for Aspect-based Sentiment Analysis	link	English
DGSA	Joint Aspect Extraction and Sentiment Analysis with Directional Graph Convolutional Networks	link	English
ASA-TM	Improving Federated Learning for Aspect-based Sentiment Analysis via Topic Memories	link	English

Model Recommendation: DGSA provides an end-to-end solution (the model are based on BERT) for aspect-level sentiment analysis, which can be directly used to process raw text.

Relation Extraction

Name	Paper	Code	Language
RE-AGCN	Dependency-driven Relation Extraction with Attentive Graph Convolutional Networks	link	English
RE-TAMM	Relation Extraction with Type-aware Map Memories of Word Dependencies	link	English
RE-DMP	Improving Relation Extraction through Syntax-induced Pre-training with Dependency Masking	link	English
RE-NGCN	Relation Extraction with Word Graphs from N-grams	link	English
RE-AMT	Enhancing Relation Extraction via Adversarial Multi-task Learning	link	English

Model Recommendation: RE-AGCN provides BERT-based models for relation extraction, where the model leverages the auto-parsed dependency tree of the input text to have a better understanding to the text.

Domain Adaptation

Name	Paper	Code	Language
T-DNA	Taming Pre-trained Language Models with N-gram Representations for Low-Resource Domain Adaptation	link	English
SDG4DA	Reinforced Training Data Selection for Domain Adaptation	link	English
DPM4DA	Domain Adaptation for Disease Phrase Matching with Adversarial Networks	--	English
TD4DA	Entropy-based Training Data Selection for Domain Adaptation	--	Chinese, English
GM4DA	Using a goodness measurement for domain adaptation: A case study on Chinese word segmentation	--	Chinese

Model Recommendation: T-DNA is a Transformer-based language model for domain adaptation, which can be used easily.

Medical NER

Name	Paper	Code	Language
HET-MC	Summarizing Medical Conversations via Identifying Important Utterances	link	Chinese
BioKMNER	Improving biomedical named entity recognition with syntactic information	link	English

Radiology Report Generation

Name	Paper	Code	Language
R2GenRL	Reinforced Cross-modal Alignment for Radiology Report Generation	link	English
R2GenCMN	Cross-modal Memory Networks for Radiology Report Generation	link	English
R2Gen	Generating Radiology Reports via Memory-driven Transformer	link	English
🔥RRG-Review	A Systematic Review of Deep Learning-based Research on Radiology Report Generation	--	English

Language Resource

Name	Paper	Code	Language
ChiMed	ChiMed: A Chinese Medical Corpus for Question Answering	link	Chinese
ChiMST	ChiMST: A Chinese Medical Corpus for Word Segmentation and Medical Term Recognition	link	Chinese
Chinese CCGBank	Chinese CCGBank Construction from Tsinghua Chinese Treebank	--	Chinese
HNZ	The Construction of a Segmented and Part-of-speech Tagged Archaic Chinese Corpus: A Case Study on Huainanzi	link	Chinese