Home

Awesome

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding 论文的中文翻译

本资源完整的翻译了论文,并且给出了论文中所有引用资料的网络连接,方便对 BERT 感兴趣的朋友们进一步研究 BERT。

  1. 原文 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,这是BERT在2018年11月发布的版本,与2019年5月版本v2有稍许不同。
  2. 以下内容是部分预览内容,完整内容查看本资源中的 Bidirectional_Encoder_Representations_Transformers翻译.md
  3. BERT论文翻译 PDF版下载
  4. 转载请注明出处,商用请联系译者 袁宵 wangzichaochaochao@gmail.com
  5. 未来将继续翻译和解析深度学习相关论文,特别是 NLP 方向的论文。
  6. 如果你喜欢我的工作,请点亮右上角星星,谢谢 :smiley:

手机扫码阅读:


BERT:预训练的深度双向 Transformer 语言模型

Jacob Devlin;Ming-Wei Chang;Kenton Lee;Kristina Toutanova Google AI Language {jacobdevlin,mingweichang,kentonl,kristout}@google.com

图 1:预训练模型结构的不同。BERT 使用双向 Transformer。OpenAI GPT 使用 从左到右的 Transformer。ELMo 使用独立训练的从左到右和从右到左的 LSTM 的连接来为下游任务生成特征。其中,只有 BERT 表示在所有层中同时受到左右语境的制约。

图 2:BERT 的输入表示。输入嵌入是标记嵌入(词嵌入)、句子嵌入和位置嵌入的总和。

摘要

我们提出了一种新的称为 BERT 的语言表示模型,BERT 代表来自 Transformer 的双向编码器表示(Bidirectional Encoder Representations from Transformers)。不同于最近的语言表示模型(Peters et al., 2018Radford et al., 2018), BERT 旨在通过联合调节所有层中的左右上下文来预训练深度双向表示。因此,只需要一个额外的输出层,就可以对预训练的 BERT 表示进行微调,从而为广泛的任务(比如回答问题和语言推断任务)创建最先进的模型,而无需对特定于任务进行大量模型结构的修改。

BERT 的概念很简单,但实验效果很强大。它刷新了 11 个 NLP 任务的当前最优结果,包括将 GLUE 基准提升至 80.4%(7.6% 的绝对改进)、将 MultiNLI 的准确率提高到 86.7%(5.6% 的绝对改进),以及将 SQuAD v1.1 的问答测试 F1 得分提高至 93.2 分(提高 1.5 分)——比人类表现还高出 2 分。

1. 介绍

语言模型预训练可以显著提高许多自然语言处理任务的效果(Dai and Le, 2015Peters et al., 2018Radford et al., 2018Howard and Ruder, 2018)。这些任务包括句子级任务,如自然语言推理(Bow-man et al., 2015Williams et al., 2018)和释义(Dolan and Brockett, 2005),目的是通过对句子的整体分析来预测句子之间的关系,以及标记级任务,如命名实体识别(Tjong Kim Sang and De Meulder, 2003)和 SQuAD 问答(Rajpurkar et al., 2016),模型需要在标记级生成细粒度的输出。

现有的两种方法可以将预训练好的语言模型表示应用到下游任务中:基于特征的和微调。基于特征的方法,如 ELMo (Peters et al., 2018),使用特定于任务的模型结构,其中包含预训练的表示作为附加特特征。微调方法,如生成预训练 Transformer (OpenAI GPT) (Radford et al., 2018)模型,然后引入最小的特定于任务的参数,并通过简单地微调预训练模型的参数对下游任务进行训练。在之前的工作中,两种方法在预训练任务中都具有相同的目标函数,即使用单向的语言模型来学习通用的语言表达。

我们认为,当前的技术严重地限制了预训练表示的效果,特别是对于微调方法。主要的局限性是标准语言模型是单向的,这就限制了可以在预训练期间可以使用的模型结构的选择。例如,在 OpenAI GPT 中,作者使用了从左到右的模型结构,其中每个标记只能关注 Transformer 的自注意层中该标记前面的标记(Williams et al., 2018)。这些限制对于句子级别的任务来说是次优的(还可以接受),但当把基于微调的方法用来处理标记级别的任务(如 SQuAD 问答)时可能会造成不良的影响(Rajpurkar et al., 2016),因为在标记级别的任务下,从两个方向分析上下文是至关重要的。

在本文中,我们通过提出 BERT 改进了基于微调的方法:来自 Transformer 的双向编码器表示。受完形填空任务的启发,BERT 通过提出一个新的预训练任务来解决前面提到的单向约束:“遮蔽语言模型”(MLM masked language model)(Tay-lor, 1953)。遮蔽语言模型从输入中随机遮蔽一些标记,目的是仅根据被遮蔽标记的上下文来预测它对应的原始词汇的 id。与从左到右的语言模型预训练不同,MLM 目标允许表示融合左右上下文,这允许我们预训练一个深层双向 Transformer。除了遮蔽语言模型之外,我们还提出了一个联合预训练文本对来进行“下一个句子预测”的任务。

本文的贡献如下:

....

参考文献

所有参考文献按论文各小节中引用顺序排列,多次引用会多次出现在下面的列表中。

Abstract 摘要中的参考文献

BERT 文中简写原始标论文标题其它
Peters et al., 2018Deep contextualized word representationsELMo
Radford et al., 2018Improving Language Understanding with Unsupervised LearningOpenAI GPT

1. Introduction 介绍中的参考文献

BERT 文中简写原始标论文标题其它
Peters et al., 2018Deep contextualized word representationsELMo
Radford et al., 2018Improving Language Understanding with Unsupervised LearningOpenAI GPT
Dai and Le, 2015Semi-supervised sequence learning. In Advances in neural information processing systems, pages 3079–3087AndrewMDai and Quoc V Le. 2015
Howard and Ruder, 2018Universal Language Model Fine-tuning for Text ClassificationULMFiT;Jeremy Howard and Sebastian Ruder.
Bow-man et al., 2015A large annotated corpus for learning natural language inferenceSamuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning.
Williams et al., 2018A Broad-Coverage Challenge Corpus for Sentence Understanding through InferenceAdina Williams, Nikita Nangia, and Samuel R Bowman.
Dolan and Brockett, 2005Automatically constructing a corpus of sentential paraphrasesWilliam B Dolan and Chris Brockett. 2005.
Tjong Kim Sang and De Meulder, 2003Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity RecognitionErik F Tjong Kim Sang and Fien De Meulder. 2003.
Rajpurkar et al., 2016SQuAD: 100,000+ Questions for Machine Comprehension of TextSQuAD
Taylor, 1953"Cloze Procedure": A New Tool For Measuring ReadabilityWilson L Taylor. 1953.

2. Related Work 相关工作中的参考文献

BERT 文中简写原始标论文标题其它
Brown et al., 1992Class-based n-gram models of natural languagePeter F Brown, Peter V Desouza, Robert L Mercer, Vincent J Della Pietra, and Jenifer C Lai. 1992.
Ando and Zhang, 2005A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled DataRie Kubota Ando and Tong Zhang. 2005.
Blitzer et al., 2006Domain adaptation with structural correspondence learningJohn Blitzer, Ryan McDonald, and Fernando Pereira.2006.
Collobert and Weston, 2008A Unified Architecture for Natural Language ProcessingRonan Collobert and Jason Weston. 2008.
Mikolov et al., 2013Distributed Representations of Words and Phrases and their CompositionalityCBOW Model;Skip-gram Model
Pennington et al., 2014GloVe: Global Vectors for Word RepresentationGloVe
Turian et al., 2010Word Representations: A Simple and General Method for Semi-Supervised LearningJoseph Turian, Lev Ratinov, and Yoshua Bengio. 2010.
Kiros et al., 2015Skip-Thought VectorsSkip-Thought Vectors
Logeswaran and Lee, 2018An efficient framework for learning sentence representationsLajanugen Logeswaran and Honglak Lee. 2018.
Le and Mikolov, 2014Distributed Representations of Sentences and DocumentsQuoc Le and Tomas Mikolov. 2014.
Peters et al., 2017Semi-supervised sequence tagging with bidirectional language modelsMatthew Peters, Waleed Ammar, Chandra Bhagavatula, and Russell Power. 2017.
Peters et al., 2018Deep contextualized word representationsELMo
Rajpurkar et al., 2016SQuAD: 100,000+ Questions for Machine Comprehension of TextSQuAD
Socher et al., 2013Deeply Moving: Deep Learning for Sentiment AnalysisSST-2
Tjong Kim Sang and De Meulder, 2003Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity RecognitionErik F Tjong Kim Sang and Fien De Meulder. 2003.
Dai and Le, 2015Semi-supervised sequence learning. In Advances in neural information processing systems, pages 3079–3087AndrewMDai and Quoc V Le. 2015
Howard and Ruder, 2018Universal Language Model Fine-tuning for Text ClassificationULMFiT;Jeremy Howard and Sebastian Ruder.
Radford et al., 2018Improving Language Understanding with Unsupervised LearningOpenAI GPT
Wang et al.(2018)GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language UnderstandingGLUE
Con-neau et al., 2017Supervised Learning of Universal Sentence Representations from Natural Language Inference DataAlexis Conneau, Douwe Kiela, Holger Schwenk, Loic Barrault, and Antoine Bordes. 2017.
McCann et al., 2017Learned in Translation: Contextualized Word VectorsBryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. 2017.
Deng et al.ImageNet: A large-scale hierarchical image databaseJ. Deng,W. Dong, R. Socher, L.-J. Li, K. Li, and L. FeiFei. 2009.
Yosinski et al., 2014How transferable are features in deep neural networks?Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014.

3. BERT 中的参考文献

BERT 文中简写原始标论文标题其它
Vaswani et al. (2017)Attention Is All You NeedTransformer
Wu et al., 2016Google's Neural Machine Translation System: Bridging the Gap between Human and Machine TranslationWordPiece
Taylor, 1953"Cloze Procedure": A New Tool For Measuring ReadabilityWilson L Taylor. 1953.
Vincent et al., 2008Extracting and composing robust features with denoising autoencodersdenoising auto-encoders
Zhu et al., 2015Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading BooksBooksCorpus (800M words)
Chelba et al., 2013One Billion Word Benchmark for Measuring Progress in Statistical Language ModelingBillion Word Benchmark corpus
Hendrycks and Gimpel, 2016Gaussian Error Linear Units (GELUs)GELUs

4. Experiments 实验中的参考文献

BERT 文中简写原始标论文标题其它
Wang et al.(2018)GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language UnderstandingGLUE
Williams et al., 2018A Broad-Coverage Challenge Corpus for Sentence Understanding through InferenceMNLI
Chen et al., 2018First Quora Dataset Release: Question PairsQQP
Rajpurkar et al., 2016SQuAD: 100,000+ Questions for Machine Comprehension of TextQNLI
Socher et al., 2013Deeply Moving: Deep Learning for Sentiment AnalysisSST-2
Warstadt et al., 2018The Corpus of Linguistic AcceptabilityCoLA
Cer et al., 2017SemEval-2017 Task 1: Semantic Textual Similarity - Multilingual and Cross-lingual Focused EvaluationSTS-B
Dolan and Brockett, 2005Automatically constructing a corpus of sentential paraphrasesMRPC
Bentivogli et al., 2009The fifth pascal recognizing textual entailment challengeRTE
Levesque et al., 2011The winograd schema challenge. In Aaai spring symposium: Logical formalizations of commonsense reasoning, volume 46, page 47.WNLI
Rajpurkar et al., 2016SQuAD: 100,000+ Questions for Machine Comprehension of TextSQuAD
Joshi et al., 2017TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading ComprehensionTriviaQA
Clark et al., 2018Semi-Supervised Sequence Modeling with Cross-View Training
Zellers et al., 2018SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense InferenceSWAG

5. Ablation Studies 消融研究中的参考文献

BERT 文中简写原始标论文标题其它
Vaswani et al. (2017)Attention Is All You NeedTransformer
Al-Rfou et al., 2018Character-Level Language Modeling with Deeper Self-Attention