Home

Awesome

Data_Preprocessing

"Preparation is everything."~David Robinson

<b> It's commonly said that data scientists spend 80% of their time preprocessing, cleaning and manipulating data and only 20% of their time analyzing it.The time spent cleaning is vital since analyzing dirty data can lead you to draw inaccurate conclusions. Data cleaning is an essential task in data science. Without properly preprocessed, cleaned data, the results of any data analysis or machine learning model could be inaccurate. In this repo, you will learn how to identify, diagnose, and treat a variety of data preprocessing & data cleaning problems in Python, ranging from simple to advanced. You will deal with improper data types, check that your data is in the correct range, handle missing data, perform record linkage, and more!</b>

Preprocessing is performed through those steps :

1. Split data into dependent and independent variables

2. Columns Processing

3. Deal with Categorical data

4. Data Cleaning

5. Feature Scaling [Normalization]

6. Additional features

separator2

1. Split Data Into Dependent And Independent Variables

OIP

The hight of the plant depends on the amount of water & amount of fertilizer so :

image

independent = df[ ['State', 'Profession', 'Age', 'Monthly_income'] ]
dependent = df[ ['Purshased_Any_Item'] ] 
# another way to split data
X = df.drop(columns=['Name','Purshased_Any_Item'])
Y = df['Purshased_Any_Item']

separator2

2. Columns Processing

# display number of rows and columns of the data set
df.shape

# display number of non null values for each column. 
df.info()
# from this step i can determine if the column has a lot of nulls or not comparing with number of original rows

# print count of nulls for each column and percentage of them
missing_data = pd.DataFrame({'total_missing': df.isnull().sum(), 'perc_missing': (df.isnull().mean())*100})
missing_data

# Statistical description of numerical variables
df.describe()

separator2

3. Deal With Categorical Data

separator2

4. Data Cleaning

Data cleaning means fixing bad data in your data set.<br> Bad data could be:

a- Null values

❱ Drop rows that contain empty cells

# not affect the original dataframe
newdf = df.dropna()

# affect the original dataframe
df.dropna(inplcae=True)

❱ Fill empty cells with values

# fill all empty cells in dataframe with value 130 (in the original dataframe)
df.fillna(130, inplace=True)

# fill all empty cells in specific column with value 130 (in the original dataframe)
df["Quantity"].fillna(130, inplace=True)

# Replacing using mean, median, mode in one column
column_mean = df['Quantity'].mean() # make sure that Quantity is column is int data type
df["Quantity"].fillna(column_mean, inplace=True)

# Replacing using mean, median, mode in all columns
columns_mean = df.mean(axis=0) # make sure that Quantity is column is int data type
df.fillna(columns_mean, inplace=True)

❱ Check this Notebook

b- Wrong fromat

❱ convert all cells in the columns into the same format

import pandas as pd
df = pd.read_csv('data.csv')
df['Date'] = pd.to_datetime(df['Date'])
print(df.to_string())

❱ Replace empty cells with values

df.dropna(subset=['Date'], inplace = True)

c- Wrong data

❱ Replacing Values

❱ Remove rows

d- Duplicates

e. Handling unwanted features

separator2

5. Feature Scaling [Normalization]

separator2

6. Additional features

separator2