Awesome

Data_Preprocessing

"Preparation is everything."~David Robinson

It's commonly said that data scientists spend 80% of their time preprocessing, cleaning and manipulating data and only 20% of their time analyzing it.The time spent cleaning is vital since analyzing dirty data can lead you to draw inaccurate conclusions. Data cleaning is an essential task in data science. Without properly preprocessed, cleaned data, the results of any data analysis or machine learning model could be inaccurate. In this repo, you will learn how to identify, diagnose, and treat a variety of data preprocessing & data cleaning problems in Python, ranging from simple to advanced. You will deal with improper data types, check that your data is in the correct range, handle missing data, perform record linkage, and more!

Preprocessing is performed through those steps :

1. Split data into dependent and independent variables

2. Columns Processing

3. Deal with Categorical data

4. Data Cleaning

5. Feature Scaling [Normalization]

6. Additional features

separator2

1. Split Data Into Dependent And Independent Variables

The hight of the plant depends on the amount of water & amount of fertilizer so :

dependent variable (does not depend on column which all values are unique) : the hight of the plant

independent varibales (Responsible for determining the dependent variable) : amount of water, amount of fertilizer

dependent variable : Purshased_Any_Item

independent variables : State, Profission, age, monthly_income

Name does not determine the dependent variable (as all values in this column are unique). So it is not an independent variables. So it should be removed

independent = df[ ['State', 'Profession', 'Age', 'Monthly_income'] ]
dependent = df[ ['Purshased_Any_Item'] ]

# another way to split data
X = df.drop(columns=['Name','Purshased_Any_Item'])
Y = df['Purshased_Any_Item']

separator2

2. Columns Processing

Drop unwanted columns :

Columns that does not affect the prediction (get them from the first step). 
newdf = df.drop(columns=['Name'])
# or df.drop(columns=['Name'], inplace=True)

Analysis Columns

# display number of rows and columns of the data set
df.shape

# display number of non null values for each column. 
df.info()
# from this step i can determine if the column has a lot of nulls or not comparing with number of original rows

# print count of nulls for each column and percentage of them
missing_data = pd.DataFrame({'total_missing': df.isnull().sum(), 'perc_missing': (df.isnull().mean())*100})
missing_data

# Statistical description of numerical variables
df.describe()

Deal with columns after get the info

If column has very large average of missing data Drop it.

If the average of missing data is small then fill the empty cells [Done in 3rd step 3.Data cleaning]

Check this Notebook

Some statistics on columns

Get the mean of column
 column_mean = df['col'].mean()
Get the mean of all columns in Datafram
columns_mean = df.means(axis=0)
columns_mean
Validate columns data types

What are Data Types?

Check those Notebooks

separator2

3. Deal With Categorical Data

What are Data Types?
Replace all nan with most frequant category (mode)
most_freq_items = df['col'].mode()
most_freq = df['col'].mode()[0]
df['col'].fillna(most_freq)
Check this Notebook

separator2

4. Data Cleaning

Data cleaning means fixing bad data in your data set. Bad data could be:

a. Null values
b. Wrong format
c. Wrong data
d. Duplicates
e. Handling unwanted features

a- Null values

If you want to assign some missing values as null ? assume that some cells have '?', 'UNDEFINED' i want to make them read as Null values
df = pd.read_csv('data.csv', na_values=['?','UNDEFINED'])
Deal with empty cells:

❱ Calculate total number of rows that has missing values then calc its mean. ❱ If the average of missing rows is very small so we can Drop those rows else we can Fill the empty cells with value ❱ Remove rows that contain empty cells. ❱ Replace all empty cells with values 
❱ Drop rows that contain empty cells
# not affect the original dataframe
newdf = df.dropna()

# affect the original dataframe
df.dropna(inplcae=True)
❱ Fill empty cells with values
# fill all empty cells in dataframe with value 130 (in the original dataframe)
df.fillna(130, inplace=True)

# fill all empty cells in specific column with value 130 (in the original dataframe)
df["Quantity"].fillna(130, inplace=True)

# Replacing using mean, median, mode in one column
column_mean = df['Quantity'].mean() # make sure that Quantity is column is int data type
df["Quantity"].fillna(column_mean, inplace=True)

# Replacing using mean, median, mode in all columns
columns_mean = df.mean(axis=0) # make sure that Quantity is column is int data type
df.fillna(columns_mean, inplace=True)
❱ Check this Notebook

b- Wrong fromat

Cells with data of wrong format can make it difficult, or even impossible, to analyze data. For example this record "20201226" in date column is wrong format data that should be 2020-12-26

Wayes to fix wrong format data :

❱ convert all cells in the columns into the same format ❱ remove the rows that has wrong format

❱ convert all cells in the columns into the same format
import pandas as pd
df = pd.read_csv('data.csv')
df['Date'] = pd.to_datetime(df['Date'])
print(df.to_string())
❱ Replace empty cells with values

The result from the converting the column to datatime gave us a NaT value, which can be handled as a NULL value, and we can remove the row by using the dropna() method.
df.dropna(subset=['Date'], inplace = True)

c- Wrong data

Wrong data does not have to be empty cells or wrong format, it can just be wrong,

like if someone registered "199" instead of "1.99".

If you have a data set for courses in the college. You have class duration is 2 or 3 hours. While you check the data set you find out that there is a classes have duration 30 hours!

Sometimes you can spot wrong data by looking at the data set, because you have an expectation of what it should be.

Wayes to fix wrong format data: ❱ Replacing Values ❱ Removing Rows

❱ Replacing Values
One way to fix wrong values is to replace them with something suitable else.
Set "Duration" = 45 in row 7:
df.loc[7, 'Duration'] = 45
For small data sets you might be able to replace the wrong data one by one, but not for big data sets. To replace wrong data for larger data sets you can create some rules, e.g. set some boundaries for legal values, and replace any values that are outside of the boundaries.
# set all values that are greater than 120 to 120 in Duration column
df.loc[df['Duration'] > 120,'Duration'] = 120 
# the same function using loop 
for x in df.index:
 if df.loc[x, "Duration"] > 120:
 df.loc[x, "Duration"] = 120
❱ Remove rows
Another way of handling wrong data is to remove the rows that contains wrong data.
This way you do not have to find out what to replace them with, and there is a good chance you do not need them to do your analyses.
for x in df.index:
 if df.loc[x, "Duration"] > 120:
 df.drop(x, inplace = True)

d- Duplicates

Discovering Duplicates

print(df.duplicated()) # Returns True for every row that is a duplicate, othwerwise False

Removing Duplicates
```
df.drop_duplicates(inplace = True)
```

e. Handling unwanted features

Words that starts with the symbol '@', e.g., @AnnaMedaris, are removed.
Hashtag symbols are removed, e.g., #depression is converted to depression.
df['name'] = df['name'].str.strip('@')
df['titles'] = df['titles'].str.strip('#')
# this function make the data type of column string
Delete any URL

separator2

5. Feature Scaling [Normalization]

Will be available soon..

separator2

6. Additional features

Every word is converted to lowercase

separator2

Awesome

Data_Preprocessing

Preprocessing is performed through those steps :

1. Split data into dependent and independent variables

2. Columns Processing

3. Deal with Categorical data

4. Data Cleaning

5. Feature Scaling [Normalization]

6. Additional features

1. Split Data Into Dependent And Independent Variables

The hight of the plant depends on the amount of water & amount of fertilizer so :

2. Columns Processing

Drop unwanted columns :

Columns that does not affect the prediction (get them from the first step). <br>

Analysis Columns

Deal with columns after get the info

If column has very large average of missing data Drop it.

If the average of missing data is small then fill the empty cells [Done in 3rd step 3.Data cleaning]

Check this Notebook

Some statistics on columns

Get the mean of column

Get the mean of all columns in Datafram

Validate columns data types

What are Data Types?

Check those Notebooks

3. Deal With Categorical Data

What are Data Types?

Replace all nan with most frequant category (mode)

Check this Notebook

4. Data Cleaning

a. Null values

b. Wrong format

c. Wrong data

d. Duplicates

e. Handling unwanted features

a- Null values

❱ Drop rows that contain empty cells

❱ Fill empty cells with values

❱ Check this Notebook

b- Wrong fromat

❱ convert all cells in the columns into the same format

❱ Replace empty cells with values

c- Wrong data

❱ Replacing Values

❱ Remove rows

d- Duplicates

e. Handling unwanted features

5. Feature Scaling [Normalization]

Will be available soon..

6. Additional features

Every word is converted to lowercase