Awesome
<img src='https://s3.amazonaws.com/drivendata-public-assets/logo-white-blue.png' width='600'> <br><br>
Sustainable Industry: Rinse Over Run
Goal of the Competition
Efficient cleaning of production equipment is vital in the Food & Beverage industry. Strict industry cleaning standards apply to the presence of particles, bacteria, allergens, and other potentially dangerous materials. At the same time, the execution of cleaning processes requires substantial resources in the form of time and cleaning supplies (e.g. water, caustic soda, acid, etc.).
Given these concerns, the cleaning stations inspect the turbidity—product traces or suspended solids in the effluent—remaining during the final rinse. In this way, turbidity serves as an important indicator of the efficiency of a cleaning process. Depending on the expected level of turbidity, the cleaning station operator can either extend the final rinse (to eliminate remaining turbidity) or shorten it (saving time and water consumption).
The goal of this competition was to predict turbidity in the last rinsing phase in order to help minimize the use of water, energy and time, while ensuring high cleaning standards.
What's in this Repository
This repository contains code from winning competitors in the Sustainable Industry: Rinse Over Run DrivenData challenge. Code for all winning solutions are open source under the MIT License.
Winning code for other DrivenData competitions is available in the competition-winners repository.
Winning Submissions
Place | Team or User | Public Score | Private Score | Summary of Model |
---|---|---|---|---|
1 | Fatima Yamaha | 0.260717 | 0.265770 | We extracted aggregates and descriptive statistics from the timeseries of each process. These features were fed to roughly 25 sci-kit learn regressors to create out-of-sample predictions for training and test set. These out-of-sample predictions were then fed, along with the original features to three gradient boosting algorithms: CatBoost, eXtreme Gradient Boosting (XGB) and Light Gradient Boosting Machines (LGBM). Finally, the average from the predictions generated by the gradient boosters was calculated and used as submission. The final percentage was gained by taking the average of previous submissions. |
2 | Contiamo | 0.276745 | 0.271246 | We truncated phases in the training data in order to match the phase distribution of the test set, and we split the training data into a training set and a validation set. We transformed time series data into features, so that the resulting structured data could be fed to standard machine-learning algorithms. We decided to model the log of the target variable, as it was better distributed. We then tried a variety of algorithms, settling on Gradient Boosted Trees based on its good initial performance. We tried to fine-tune the model's hyper-parameters, but there was little improvement. |
3 | arindam43 | 0.271618 | 0.274366 | I built four separate models to reflect the ends of the four different stages at which we were asked to make predictions: pre rinse, caustic, intermediate rinse, and acid. Each feature was generated at a phase-level or lower and “unmelted” to produce the final features at a process-level. Mid-cleaning predictions were simulated by excluding the appropriate predictors from each model. I chose pure LightGBM so that the relationships between the predictors and the response would be easy to understand. I used Shapley Additive Explanations (SHAP). The summary plot provided an easily interpretable view of variable importance, and dependence plots clearly illustrated the effect of individual predictors (or pairs of predictors if interactions were highlighted) on each prediction. |
4 | Vinayak | 0.275821 | 0.274659 | I used features based on summary statistics on the provided features/phase, global features (without considering the phase), final rinse features, two-way interaction features, and features created using the response variable. I used a single LightGBM model with the above mentioned features because it is easy and fast to train, requires little preprocessing, and handles null values. This was followed by quantile regression. |
5 | riccardonizzolo | 0.278003 | 0.274708 | I trained a two-layer stacked model of LightGBM. The model was trained on the maximum available granularity (i.e. timeseries level) and then aggregated with a weighted median to obtain predictions with process granularity. |
6 | turbidity | 0.285579 | 0.276411 | First I divided the problem into several problems: I trained a separate model for each expected recipe mask (1111, 1001, 1100) and for each number of available phases (1, 2, 3, 4). I used a linear combination of Keras and LightGBM models. I used multiple aggregate features (like length of each phase, sum/mean turbidity in each phase, mean tank levels, mean/max pressure, etc.). I used target mean features that were used for scaling the target values. For computing target mean I've used weighted median and segmentation of data by expected recipe mask and object id. For local validation I used a time-based split. The final submission is a blend of two solutions generated from two different random seeds. |
7 | neurocomputing | 0.271899 | 0.276467 | I build a separate simple gradient boost model for each of the each of the 4 cases: process_id with phases = 1, process_id with phases <=2, process_id with phases <=3, process_id with phases <=4. To predict the final_rinse_total_turbidity_liter in the test data, for process_id in the test data the maximum number of phases are determined and the prediction was made with the corresponding phase-model. |
8 | guillaume.thomas | 0.272459 | 0.277252 | For each process and for each process step, consider a subprocess with data only up to this step for the training. Associate a weight accordingly to the test distribution (0.1 for process ending at pre_rinse, 0.3 otherwise). Generate for each subprocess simple time-domain aggregations. Focus on ensemble methods (especially LightGBM). Reduce weights of high targets which penalized the evaluation metrics. |
9 | mlearn | 0.273551 | 0.282050 | I followed feature extraction by fitting a conventional supervised learning model. The idea was to maximise the chance of interpretability by looking for relationships to simple features. I used data augmentation to increase the size of the training set. Each base process led to four training processes with truncation to each phase. Weights were used to match the competition setup. I used weights to both handle the data augmentation and also to convert the MAPE target into a an MAE problem. I fit a large gradient-boosted model with LightGBM. |
10 | Natalia.Pavlovskaia | 0.283697 | 0.284268 | I use LSTM for processing the data of the time series. LSTM is known as a good feature extractor for sequential data, and it can be applied to the sequences with different lengths. I take the output of LSTM together with the process' metadata and send them in a small neural network with 4 dense layers. The last layer's output is a single number because we have a regression task here. I optimize directly the competition metric. In order to prevent overfitting I augment the initial data to enlarge the training dataset. Finally, to get my best result I use test time augmentation. This approach allows us to have a single model instead of a separate model for each process. Because we can process time series with different lengths. And we don't create any SARIMA-like model to make a predict for each process separately. |
11 | prabod_cse | 0.282626 | 0.288003 | I employed a gradient boosting tree model (LightGBM) with a 10-fold stacking startegy. The model was trained on aggregated statistics of the last n records of a process (n =10, 50, 200, 500). Also, phase-based statistics and time-based statistics were used. |
12 | The BI sharps | 0.289039 | 0.288530 | Our approach starts by taking every phase of every process and aggregating the data on metrics such as Max, Mean, Min etc. We engineer some features to interpret the boolean features and keep track of phase using dummies. We gave each phase a record as this allowed us to deal with processes with different numbers of phases in the same way. This data is sent along 2 separate flows. Flow 1: group the data by object_id, remove outliers based on the target variable, and create an XGboost Regressor for each object_id. This gives us a up to 4 predictions for each process_id (one for each phase in the process) and we take the minimum of these as our prediction from Flow 1 (as this performed best for the MAPE metric). Flow 2: We cluster similar object_ids based on the median, skew, and kurtosis of the target variable. This gives us much larger training sets as small sets were an issue for Flow 1. We then group the data by cluster & remove outliers based on the target variable. Then we create an XGboost Regressor for each cluster this again gives us a up to 4 predictions for each process_id (one for each phase in the process) and we take the minimum of these as our prediction from Flow 2. We then take a blend of the predictions from the two flows as our prediction. |
13 | AliGebily | 0.288105 | 0.291349 | We predict turbidity of a process using multiple xgboost models, where each model predicts turbidity based on data of a single phase or combinations of phases for that process. This approach is building optimized models based on testing data structure and evaluation metric. |
14 | RnD_team | 0.285486 | 0.294241 | The approach used consisted in summarizing the measurements of each process by stage, dividing each process recorded into different groups according to the phases provided in the test set and training one random forest for each group. Each process was divided in 5 different phases and it recorded different measurements of the cleaning status every 2 seconds. We summarize each of the measurement by phase, obtaining 1 or 2 statistics of each measure. Additionally to this,we calculated the average of the target by object excluding outliers. Since each process had recorded different phases we subset the test data according of the observed phases in 6 different groups and we created 6 different training set for each group. For each group we trained and tune independently 6 random forests, most of the variables kept in each forest were similar among each other. For each process in the test we used the random forest that corresponded to the observed phases of the process. |
15 | adilism | 0.280523 | 0.294761 | The solution consists of four GBM models for each of the four phases. The approach was the following: Generate train sets for each of the phases using the original train set. Generate statistical features (such as min, median, max, standard deviation) for each time series and features derived from the metadata. Estimate a LightGBM] model with five-fold cross-validation. Bag the predictions using different seeds to create the folds. |
Additional solution details can be found at reports/Winner Documentation.[pdf,md,txt,docx]
inside the directory for each submission.
Report winners
This competition included a second stage in which the top 15 competition winners were invited to submit presentations of their work. The top three reports were:
Place | Team or User | Place in Prediction Competition |
---|---|---|
1st | Contiamo | 2nd |
2nd | arindam43 | 3rd |
3rd | riccardonizzolo | 5th |
These and other submitted reports can be found at reports/Stage 2 Report.pdf
inside the directory for each submission.
Benchmark Blog Post: "Benchmark - Sustainable Industry: Rinse Over Run"