Awesome
Research Documents Curation with NLP
Contributors: Bowen Chen, Ao Luo, Yihan Luo, Ho Kwan Henry Zhang
Description
Applied Finance Project from UCLA Anderson, using natural language processing techniques to classify and summarize quantitative finance research papers. Following the technique proposed by Meng's team, we implemented a research paper classification engine by using weakly supervised NLP technique in order to build a document classifier. The original paper could be found here
Required Packages
The following packages are already pre-built with Anaconda distribution
- numpy
- pandas
- seaborn
- glob
- os
The following packages are used to complete NLP specific tasks
The following packages are used to retrieve and transform the research documents
- arxiv - used to perform batch downloads for research papers in pdf format from arxiv
- pdf2txt - used to transform pdf files to txt files
The main algorithm is implemented in python 3, except for pdf2txt, which used python 2
Roadmap
- Download 1000+ papers in pdf format from arxiv (Completed)
- Transformed 1000 papers in pdf format to txt format (Completed)
- Train a word embedding using gensim (Completed)
- Generate a large set of pseudo documents (Completed)
Specifications
Downloading papers
We downloaded 2028 quantitative finance papers from arxiv using the API provided, taking around 2 hours to process.
Transform papers
We transform the 2028 papers into txt format using pdf2txt. 1007 papers were valid. The txt files could then be read in as a list of strings under the same directory. This process could only be completed in python 2
Training Word Embedding
We implemented the word embedding training in gensim topic modeling package in python 3. The embedding was trained on the entire length of 1007 documents, with 50 classes (chosen empircally). Prior to feeding into the traning algorithm, we preprocessed data in 4 different aspects
- remove white spaces
- remove words with length less than 2
- remove punctuations
- remove numbers
We did not remove stopwords, since we believe this word embedding matrix will be used for generating pseudo documents, which will definetely require the presence of these words
We also did not perform word stemming. Stemming the word will not work particularly well in finance, since most of the keywords in finance are derivative words already (see equity, trading, fixed income etc.). Performming stemming will distort the original meaning of the word. Taking equity for example, if we perform stemming on equity, it will become equal, which destroys the oringinal meaning of our keyword.