Awesome
We have run our scripts on the GPU server 13.90.81.78. The data is present on the same server.
Raw data for cases and stock price change can be found at path-
# Main Directory
/data/WorkData/firmEmbeddings/
# Case Data is inside the directory
CaseData/
# Stock Data is inside the directory
StockData/
The data after processing and joining can be found at path -
# Main Directory
/data/WorkData/firmEmbeddings/Models/
# Random Forest for Stock Prediction Data
StockPredictionUsingRandomForest/
# Neural Network for Stock Prediction Data
StockPredictionUsingNeuralNetwork/Data_stock_all/
# Neural Network for Firm Embeddings Data
FirmEmbeddings/
To install the packages for running all the scripts execute the command-
chmod 755 requirements.sh
sh requirements.sh
Go to the python shell and execute the command for downloading punkt -
python3
>> nltk.download('punkt')
Scripts to process the raw case data-
These files are present in CaseData folder. Run the files in following order-
The data generated from these scripts will be used with stock data in creating final data for training
the models. This data can be found in /data/WorkData/firmEmbeddings/CaseData/ folder present on the server.
1. filterCases.ipynb - Filters cases from sentences folder to get cases for category 6 and 7. It uses
bb2topic.pkl, bb2genis.pkl, caseid_date.csv. This generates new folder Filtered_1 and the files
-filtered.pkl, casedata.pkl. The Filtered_1 contains all cases belonging to category 6 and 7.
2. ngramdataGenerate.ipynb - Filters bigram pickle files to get cases for category 6 and 7 . It uses
casedata.pkl and [20180208]build_vocab_lemma_pos/phrased/ and creates new folder PickleFiles. The PickleFiles contains all cases belonging to category 6 and 7.
3. bigram.ipynb- It creates final ngramdata.pkl. The code uses id2gram.pkl, casedata.pkl, df-tf.pkl
and files from PickleFiles folder to generate data.
4. doc2vec.py- Uses text from Filtered_1 and runs doc2vec algorithm on filtered cases and generate
doc2vec_2.model
5. modeltodata.ipynb - Uses casedata.pkl and doc2vec_2.model. It maps model vectors to case meta
data and creates visualization of docvectors. The code produces following files docvector.pkl,
traindocvector.pkl, testdocvector.pkl, validationdocvector.pkl
Script to process the raw Stock Data -
Run the script filterCompanies.py present in path StockData to process the stock data
python3 filterCompanies.py
Script to join the two data sets -
These files are present in JoiningDataPrep folder
1. StockAndCaseDataJoined - joins case and stock data. This script uses stockData07to13_logdiff_5_0.1.csv
and following docvector files - traindocvector.pkl, testdocvector.pkl, validationdocvector.pkl.
And produces following files - training_data_CaseCompanyStockChange.pkl,
testing_data_CaseCompanyStockChange.pkl, validation_data_CaseCompanyStockChange.pkl
2. ProcessJoinedDataForNN.ipynb - processes data for final run and creates val_data_final.pkl,
train_data_final.pkl, test_data_final.pkl
3. Finaldata_stockPred.ipynb - produces final data for all cases and category
6 and 7 for stock prediction
4. Finaldata_firmEmbed.ipynb - produces final data for all cases and category 6 and 7
for firm embeddings and uses Company_meta.pkl
5. RankCompany.ipynb - used to create Company_meta_rank.pkl
After running all these scripts, the data for all the models will be copied in their respective
paths mentioned above.
Script to generate models for stock prediction and firm embeddings -
#Change file permissios to run the script
chmod 755 RunAllmodels.sh
# Run the following command to execute the script -
sh RunAllmodels.sh
This script contains three scripts. Path locations for the scripts on github are -
1. RunRandomForest.py is present in the directory Random_Forest/
2. FirmEmbeddingsModel.py is present in the directory FirmEmbeddings/
3. NeuralNetworkRun_3layers.py is present in the directory StockPrediction/
The script RunRandomForest.py will generate the Random Forest model and it will also plot the
graph for actual vs predicted change in stock price.
The predictions on test data after running the NeuralNetworkRun_3layers.py script are saved in
predictions.txt in the same path in which data is present. The file predictions.txt along
with actual.txt (which is also present in the same path as predictions.txt) will be used by the
notebook StockPrediction/ScatterPlotPredictedvsActual.ipynb in plotting the actual/predicted
stock price change. The notebook contains the absolute path for these files.
Thus the notebook can also be run from anywhere on the GPU server.
The firm embeddings matrix after running the script FirmEmbeddingsModel.py saves the matrix
in the same path in which data is present. This matrix will be used by
FirmEmbeddings/VisualizeFirmsEmbeddings.ipynb to visualize the embeddings. This notebook contains
the Tsne plots for category 6, 7 and combines cases. It also contains the embeddings visualization
against industries of the firms, ranking of the firms, states in which they lie. The
notebook also contains the cosine similarity plots for the two categories - Finance
and Manufacturing.