Awesome

SECOM_class_imbalance

Approaches for the class imbalance problem (in semicondutor manufacturing process line data)

Data Description

The SECOM dataset in the UCI Machine Learning Repository is semicondutor manufacturing data which has 1567 records, 590 anonymized features and 104 fails. The process yield has a simple pass/fail response (encoded -1/1).

The dataset has the following characteristics:

two-class problem
an imbalance with a 14:1 skew of pass to fails
large number of features -- 590
missing data
features/columns which do not have sufficient information
4% of the columns/features have more than 50% of their records missing
some columns have constant values

Objective

The SECOM dataset presents us with two problems: (i) working with skewed data and (ii) feature selection. The main focus for this analysis will be the class imbalance issue and the ability to successfully predict fails. Strategies used in fraud/anomaly detection/rare disease diagnosis will be useful here. A secondary objective will be feature reduction. (In some to the literature pertaining to the SECOM dataset, this was the primary goal <a href="#ref1">[1]</a>.) A streamlined feature set can not only lead to better prediction accuracy and data understanding but also save manufacturing resources.

Software

Python 2.7
scikit-learn packages for algorithms
pandas for data wrangling
Matplotlib and Seaborn for plotting and visualization

Methods

We will look at some of the approaches that deal with class imbalance. These can be a cost sensitive learning approach or sampling-based. We will also be working with feature selection methods. This is a list of methods we use: