Awesome
Overview and Instructions for Result Replication
Competition page
World Bank Pover-T Tests: Predicting Poverty
Project Tree
world-bank-pover-t-tests-solution
├── Background and Submission Overview.md
├── data
│ └── get_data.sh
├── README.md
├── requirements.txt
├── src
│ ├── bayesian-opts-res
│ │ └── bayesian-opt-test-preds
│ ├── Data Processor Original Dataset.ipynb
│ ├── Full Bayesian Model Training and Predictions.ipynb
│ └── modules
│ ├── __init__.py
│ ├── training_models.py
│ ├── training_optimizers.py
│ └── training_utils.py
└── submission
6 directories, 10 files
Data
Raw dataset
Place the training data inside the data/
directory of the project. This can also be done automatically (assuming you're in the root directory) by running:
$ cd data/
$ bash get_data.sh
The data below should be present inside the data/
directory in order to proceed to the next step of generating the transformed dataset for training.
│ ├── A_hhold_test.csv
│ ├── A_hhold_train.csv
│ ├── A_indiv_test.csv
│ ├── A_indiv_train.csv
│ ├── B_hhold_test.csv
│ ├── B_hhold_train.csv
│ ├── B_indiv_test.csv
│ ├── B_indiv_train.csv
│ ├── C_hhold_test.csv
│ ├── C_hhold_train.csv
│ ├── C_indiv_test.csv
│ ├── C_indiv_train.csv
Transforming raw data for training
Assuming you're in the root directory, navigate inside the src/
directory and open the Data Processor Original Dataset.ipynb notebook. The notebook will do the following transformations to the hhold
and indiv
datasets for each country.
Process to generate indiv_cat:
- Take only categorical features
- One-hot-encode the features
- Summarize the encoded features to represent a household using:
mean
median
all
any
Process to generate hhold-transformed:
- Take numeric and categorical data
- For numeric, transform data using:
- MinMax scaler:
mx_
- Standard scaler:
sc_
- MinMax scaler:
- For categorical, encode data:
- Use label encoding
- Use the label encoded data to perform one-hot-encoding
- Retain the label encoding
The above process will generate these additional files inside the data/
directory. These will be used by the models.
│ ├── A-hhold-transformed-test.csv
│ ├── A-hhold-transformed-train.csv
│ ├── B-hhold-transformed-test.csv
│ ├── B-hhold-transformed-train.csv
│ ├── C-hhold-transformed-test.csv
│ ├── C-hhold-transformed-train.csv
│ ├── indiv_cat_train.hdf
│ ├── indiv_cat_test.hdf
Model
For each country, the model is a blending of meta predictions from 20 variations of 5 models. The following base models are used:
- Logistic Regression with L1 regularization
- Neural Network (3 hidden layers)
- Random Forest
- LightGBM
- XGBoost
Each variation is produced by performing Bayesian optimization over the base models given a range of parameter values. The Bayesian optimization is trained to optimize the prediction score over an optimization fold. The optimization fold is allowed to randomly vary for a more robust model mixture to prevent overfitting which is likely to happen if only a single optimization fold is used.
The top 20 meta-models having the highest optimization-fold score are included in the blending model. The blending model is trained by optimizing the log loss of the out-of-fold (OOF) predictions against the actual values. The variables over which the optimization is made are the weights of each meta-model to the final prediction.
Dependencies
- python version 2.7.12
This project depends on the following python modules:
-
Standard:
- os
- datetime
- glob
- cPickle
- time
- warnings
- hashlib
- contextlib
-
Packages:
- numpy==1.14.0
- pandas==0.20.2
- joblib==0.11
- bayesian-optimization==0.6.0
- scikit-learn==0.19.0
- xgboost==0.7
- lightgbm==2.1.0
- scipy==1.0.0
- matplotlib==2.0.0
- tqdm==4.11.2
Install the needed modules by running the command below in the project root directory:
$ pip install -r requirements.txt
Replicating Results
Assuming you're in the root directory, navigate inside the src/
directory and open the Full Bayesian Model Training and Predictions.ipynb notebook. Run all cells. This will take a while to complete.
The submission file will be generated and stored in the submission/
directory in the project root.
Logs from the model training can be accessed by looking at the output.logs
file.
Other details
Please check the Background and Submission Overview for more details.