Home

Awesome

Welcome to SYNLP!

The followings are some of our representative research papers.

Notes: The Language collumn in the following tables indicates that the models are evaluated on those languages in the paper. It does not mean the model will not work on other languages.

Word Embedding and Pre-trained LM

NamePaperCodeLanguage
DSGDirectional Skip-Gram: Explicitly Distinguishing Left and Right Context for Word EmbeddingslinkChinese
ZEN 1.0ZEN: Pre-training Chinese Text Encoder Enhanced by N-gram RepresentationslinkChinese
ZEN 2.0ZEN 2.0: Continue Training and Adaption for N-gram Enhanced Text EncoderslinkArabic, Chinese
T-DNATaming Pre-trained Language Models with N-gram Representations for Low-Resource Domain AdaptationlinkEnglish
🔥 ChiMed-GPTChiMed-GPT: A Chinese Medical Large Language Model with Full Training Regime and Better Alignment to Human PreferenceslinkChinese, English

Model Recommendation: DSG provides 200-dimensional word embeddings for around 8M Chinese words. ZEN 2.0 provides large pre-trained language models (the large version uses 24 layers of self-attentions with 1024 dimensional hidden vectors) for Arabic and Chinese. The models are trained on large corpus and enhance text modeling through n-grams. ChiMed-GPT is a Chinese medical large language model (LLM) built by continually training Ziya-v2 on Chinese medical data, where pre-training, supervised fine-tuning (SFT), and reinforcement learning from human feedback (RLHF) are comprehensively performed on it.

Chinese Word Segmentation and POS Tagging

NamePaperCodeLanguage
WMSegImproving Chinese Word Segmentation with Wordhood Memory NetworkslinkChinese
TwASPJoint Chinese Word Segmentation and Part-of-speech Tagging via Two-way Attentions of Auto-analyzed KnowledgelinkChinese
McASPJoint Chinese Word Segmentation and Part-of-speech Tagging via Multi-channel Attention of Character N-gramslinkChinese
GCASegFederated Chinese Word Segmentation with Global Character AssociationslinkChinese

Model Recommendation: WMSeg and McASP contains easy-to-use CWS and joint CWS and POS tagging models that are based on BERT and ZEN. Models trained on different datasets are available for downloading.

Parsing

NamePaperCodeLanguage
SAParImproving Constituency Parsing with Span AttentionlinkArabic, Chinese, English
DMParEnhancing Structure-aware Encoder with Extremely Limited Data for Graph-based Dependency ParsinglinkEnglish
NeST-CCGSupertagging Combinatory Categorial Grammar with Attentive Graph Convolutional NetworkslinkEnglish

Model Recommendation: SAPar provides constituent parsers (which are based on BERT, XLNet, and ZEN) for Arabic, Chinese, and English; DMPar provides code for dependency parsing; NeST-CCG offers BERT-based models for English CCG supertagging. Both repositories provide pre-trained models and they are easy-to-use.

Semantic Role Labeling

NamePaperCodeLanguage
SRL-MMSyntax-driven Approach for Semantic Role LabelinglinkEnglish

Named Entity Recognition

NamePaperCodeLanguage
SANERNamed Entity Recognition for Social Media Texts with Semantic AugmentationlinkChinese,English
AESINERImproving Named Entity Recognition with Attentive Ensemble of Syntactic InformationlinkChinese,English
BioKMNERImproving biomedical named entity recognition with syntactic informationlinkEnglish

Model Recommendation: SANER use pre-trained language models and word embeddings in text modeling, with the semantic of similar words are used to enhance text understanding. Pre-trained models are available for downloading and they are easy-to-use.

Coreference Resolution

NamePaperCodeLanguage
Pronoun-Coref-KGKnowledge-aware Pronoun Coreference ResolutionlinkEnglish
Pronoun-CorefIncorporating Context and External Knowledge for Pronoun Coreference ResolutionlinkEnglish
Visual_PCRWhat You See is What You Get: Visual Pronoun Coreference Resolution in DialogueslinkEnglish

Model Recommendation: Pronoun-Coref uses GloVe and ELMo embeddings in text modeling. The model is light and easy-to-use.

Aspect-level Sentiment Analysis

NamePaperCodeLanguage
ASA-TGCNAspect-based Sentiment Analysis with Type-aware Graph Convolutional Networks and Layer EnsemblelinkEnglish
ASA-WDEnhancing Aspect-level Sentiment Analysis with Word DependencieslinkEnglish
ASA-CLDComplementary Learning of Aspect Terms for Aspect-based Sentiment AnalysislinkEnglish
DGSAJoint Aspect Extraction and Sentiment Analysis with Directional Graph Convolutional NetworkslinkEnglish
ASA-TMImproving Federated Learning for Aspect-based Sentiment Analysis via Topic MemorieslinkEnglish

Model Recommendation: DGSA provides an end-to-end solution (the model are based on BERT) for aspect-level sentiment analysis, which can be directly used to process raw text.

Relation Extraction

NamePaperCodeLanguage
RE-AGCNDependency-driven Relation Extraction with Attentive Graph Convolutional NetworkslinkEnglish
RE-TAMMRelation Extraction with Type-aware Map Memories of Word DependencieslinkEnglish
RE-DMPImproving Relation Extraction through Syntax-induced Pre-training with Dependency MaskinglinkEnglish
RE-NGCNRelation Extraction with Word Graphs from N-gramslinkEnglish
RE-AMTEnhancing Relation Extraction via Adversarial Multi-task LearninglinkEnglish

Model Recommendation: RE-AGCN provides BERT-based models for relation extraction, where the model leverages the auto-parsed dependency tree of the input text to have a better understanding to the text.

Domain Adaptation

NamePaperCodeLanguage
T-DNATaming Pre-trained Language Models with N-gram Representations for Low-Resource Domain AdaptationlinkEnglish
SDG4DAReinforced Training Data Selection for Domain AdaptationlinkEnglish
DPM4DADomain Adaptation for Disease Phrase Matching with Adversarial Networks--English
TD4DAEntropy-based Training Data Selection for Domain Adaptation--Chinese, English
GM4DAUsing a goodness measurement for domain adaptation: A case study on Chinese word segmentation--Chinese

Model Recommendation: T-DNA is a Transformer-based language model for domain adaptation, which can be used easily.

Medical NER

NamePaperCodeLanguage
HET-MCSummarizing Medical Conversations via Identifying Important UtteranceslinkChinese
BioKMNERImproving biomedical named entity recognition with syntactic informationlinkEnglish

Radiology Report Generation

NamePaperCodeLanguage
R2GenRLReinforced Cross-modal Alignment for Radiology Report GenerationlinkEnglish
R2GenCMNCross-modal Memory Networks for Radiology Report GenerationlinkEnglish
R2GenGenerating Radiology Reports via Memory-driven TransformerlinkEnglish
🔥RRG-ReviewA Systematic Review of Deep Learning-based Research on Radiology Report Generation--English

Language Resource

NamePaperCodeLanguage
ChiMedChiMed: A Chinese Medical Corpus for Question AnsweringlinkChinese
ChiMSTChiMST: A Chinese Medical Corpus for Word Segmentation and Medical Term RecognitionlinkChinese
Chinese CCGBankChinese CCGBank Construction from Tsinghua Chinese Treebank--Chinese
HNZThe Construction of a Segmented and Part-of-speech Tagged Archaic Chinese Corpus: A Case Study on HuainanzilinkChinese