Awesome
Binary Classification with Apache Spark / HDFS
<img src="img/logo.png" style="width: 5px;"/> ↖data source
The goal of the competition is to predict which parts will fail quality control
My goal is to utilize the hadoop ecosystem to handle a large dataset and establish a pipeline for machine learning
munge
:
- Aggregate columns using RDD transformations
- Create a column that indicates which of those column aggregations are outliers.
fit_predict
:
- Model data with Spark Machine Learning package
- Predict on test data
munge_fit_predict
:
- Run this as is to use the toy data set example