Home

Awesome

<img src='https://s3.amazonaws.com/drivendata-public-assets/logo-white-blue.png' width='600'> <br><br>

Banner Image

Sustainable Industry: Rinse Over Run

Goal of the Competition

Efficient cleaning of production equipment is vital in the Food & Beverage industry. Strict industry cleaning standards apply to the presence of particles, bacteria, allergens, and other potentially dangerous materials. At the same time, the execution of cleaning processes requires substantial resources in the form of time and cleaning supplies (e.g. water, caustic soda, acid, etc.).

Given these concerns, the cleaning stations inspect the turbidity—product traces or suspended solids in the effluent—remaining during the final rinse. In this way, turbidity serves as an important indicator of the efficiency of a cleaning process. Depending on the expected level of turbidity, the cleaning station operator can either extend the final rinse (to eliminate remaining turbidity) or shorten it (saving time and water consumption).

The goal of this competition was to predict turbidity in the last rinsing phase in order to help minimize the use of water, energy and time, while ensuring high cleaning standards.

What's in this Repository

This repository contains code from winning competitors in the Sustainable Industry: Rinse Over Run DrivenData challenge. Code for all winning solutions are open source under the MIT License.

Winning code for other DrivenData competitions is available in the competition-winners repository.

Winning Submissions

PlaceTeam or UserPublic ScorePrivate ScoreSummary of Model
1Fatima Yamaha0.2607170.265770We extracted aggregates and descriptive statistics from the timeseries of each process. These features were fed to roughly 25 sci-kit learn regressors to create out-of-sample predictions for training and test set. These out-of-sample predictions were then fed, along with the original features to three gradient boosting algorithms: CatBoost, eXtreme Gradient Boosting (XGB) and Light Gradient Boosting Machines (LGBM). Finally, the average from the predictions generated by the gradient boosters was calculated and used as submission. The final percentage was gained by taking the average of previous submissions.
2Contiamo0.2767450.271246We truncated phases in the training data in order to match the phase distribution of the test set, and we split the training data into a training set and a validation set. We transformed time series data into features, so that the resulting structured data could be fed to standard machine-learning algorithms. We decided to model the log of the target variable, as it was better distributed. We then tried a variety of algorithms, settling on Gradient Boosted Trees based on its good initial performance. We tried to fine-tune the model's hyper-parameters, but there was little improvement.
3arindam430.2716180.274366I built four separate models to reflect the ends of the four different stages at which we were asked to make predictions: pre rinse, caustic, intermediate rinse, and acid. Each feature was generated at a phase-level or lower and “unmelted” to produce the final features at a process-level. Mid-cleaning predictions were simulated by excluding the appropriate predictors from each model. I chose pure LightGBM so that the relationships between the predictors and the response would be easy to understand. I used Shapley Additive Explanations (SHAP). The summary plot provided an easily interpretable view of variable importance, and dependence plots clearly illustrated the effect of individual predictors (or pairs of predictors if interactions were highlighted) on each prediction.
4Vinayak0.2758210.274659I used features based on summary statistics on the provided features/phase, global features (without considering the phase), final rinse features, two-way interaction features, and features created using the response variable. I used a single LightGBM model with the above mentioned features because it is easy and fast to train, requires little preprocessing, and handles null values. This was followed by quantile regression.
5riccardonizzolo0.2780030.274708I trained a two-layer stacked model of LightGBM. The model was trained on the maximum available granularity (i.e. timeseries level) and then aggregated with a weighted median to obtain predictions with process granularity.
6turbidity0.2855790.276411First I divided the problem into several problems: I trained a separate model for each expected recipe mask (1111, 1001, 1100) and for each number of available phases (1, 2, 3, 4). I used a linear combination of Keras and LightGBM models. I used multiple aggregate features (like length of each phase, sum/mean turbidity in each phase, mean tank levels, mean/max pressure, etc.). I used target mean features that were used for scaling the target values. For computing target mean I've used weighted median and segmentation of data by expected recipe mask and object id. For local validation I used a time-based split. The final submission is a blend of two solutions generated from two different random seeds.
7neurocomputing0.2718990.276467I build a separate simple gradient boost model for each of the each of the 4 cases: process_id with phases = 1, process_id with phases <=2, process_id with phases <=3, process_id with phases <=4. To predict the final_rinse_total_turbidity_liter in the test data, for process_id in the test data the maximum number of phases are determined and the prediction was made with the corresponding phase-model.
8guillaume.thomas0.2724590.277252For each process and for each process step, consider a subprocess with data only up to this step for the training. Associate a weight accordingly to the test distribution (0.1 for process ending at pre_rinse, 0.3 otherwise). Generate for each subprocess simple time-domain aggregations. Focus on ensemble methods (especially LightGBM). Reduce weights of high targets which penalized the evaluation metrics.
9mlearn0.2735510.282050I followed feature extraction by fitting a conventional supervised learning model. The idea was to maximise the chance of interpretability by looking for relationships to simple features. I used data augmentation to increase the size of the training set. Each base process led to four training processes with truncation to each phase. Weights were used to match the competition setup. I used weights to both handle the data augmentation and also to convert the MAPE target into a an MAE problem. I fit a large gradient-boosted model with LightGBM.
10Natalia.Pavlovskaia0.2836970.284268I use LSTM for processing the data of the time series. LSTM is known as a good feature extractor for sequential data, and it can be applied to the sequences with different lengths. I take the output of LSTM together with the process' metadata and send them in a small neural network with 4 dense layers. The last layer's output is a single number because we have a regression task here. I optimize directly the competition metric. In order to prevent overfitting I augment the initial data to enlarge the training dataset. Finally, to get my best result I use test time augmentation. This approach allows us to have a single model instead of a separate model for each process. Because we can process time series with different lengths. And we don't create any SARIMA-like model to make a predict for each process separately.
11prabod_cse0.2826260.288003I employed a gradient boosting tree model (LightGBM) with a 10-fold stacking startegy. The model was trained on aggregated statistics of the last n records of a process (n =10, 50, 200, 500). Also, phase-based statistics and time-based statistics were used.
12The BI sharps0.2890390.288530Our approach starts by taking every phase of every process and aggregating the data on metrics such as Max, Mean, Min etc. We engineer some features to interpret the boolean features and keep track of phase using dummies. We gave each phase a record as this allowed us to deal with processes with different numbers of phases in the same way. This data is sent along 2 separate flows. Flow 1: group the data by object_id, remove outliers based on the target variable, and create an XGboost Regressor for each object_id. This gives us a up to 4 predictions for each process_id (one for each phase in the process) and we take the minimum of these as our prediction from Flow 1 (as this performed best for the MAPE metric). Flow 2: We cluster similar object_ids based on the median, skew, and kurtosis of the target variable. This gives us much larger training sets as small sets were an issue for Flow 1. We then group the data by cluster & remove outliers based on the target variable. Then we create an XGboost Regressor for each cluster this again gives us a up to 4 predictions for each process_id (one for each phase in the process) and we take the minimum of these as our prediction from Flow 2. We then take a blend of the predictions from the two flows as our prediction.
13AliGebily0.2881050.291349We predict turbidity of a process using multiple xgboost models, where each model predicts turbidity based on data of a single phase or combinations of phases for that process. This approach is building optimized models based on testing data structure and evaluation metric.
14RnD_team0.2854860.294241The approach used consisted in summarizing the measurements of each process by stage, dividing each process recorded into different groups according to the phases provided in the test set and training one random forest for each group. Each process was divided in 5 different phases and it recorded different measurements of the cleaning status every 2 seconds. We summarize each of the measurement by phase, obtaining 1 or 2 statistics of each measure. Additionally to this,we calculated the average of the target by object excluding outliers. Since each process had recorded different phases we subset the test data according of the observed phases in 6 different groups and we created 6 different training set for each group. For each group we trained and tune independently 6 random forests, most of the variables kept in each forest were similar among each other. For each process in the test we used the random forest that corresponded to the observed phases of the process.
15adilism0.2805230.294761The solution consists of four GBM models for each of the four phases. The approach was the following: Generate train sets for each of the phases using the original train set. Generate statistical features (such as min, median, max, standard deviation) for each time series and features derived from the metadata. Estimate a LightGBM] model with five-fold cross-validation. Bag the predictions using different seeds to create the folds.

Additional solution details can be found at reports/Winner Documentation.[pdf,md,txt,docx] inside the directory for each submission.

Report winners

This competition included a second stage in which the top 15 competition winners were invited to submit presentations of their work. The top three reports were:

PlaceTeam or UserPlace in Prediction Competition
1stContiamo2nd
2ndarindam433rd
3rdriccardonizzolo5th

These and other submitted reports can be found at reports/Stage 2 Report.pdf inside the directory for each submission.

"Meet the Winners" Blog Post

Benchmark Blog Post: "Benchmark - Sustainable Industry: Rinse Over Run"