6 Must learn Data Pre-Processing Algorithms in Machine Learning and Artificial Intelligence ( With Google Colab Notebooks)

Share this post

In this post, we are discussing data preprocessing techniques used in Machine Learning and Deep Learning along with code development in python.

Table of Contents

  1. What is data pre-processing?
  2. Why we need data pre-processing in Machine Learning and Artificial Intelligence?
  3. Data Preprocessing Techniques based Data Type
  4. Impact of Data Pre-processing on Accuracy of Model
  5. Google Colab Notebook and Github Codes

What is Data Pre-processing?

Data Pre-Processing is an important and necessary step in any Machine Learning and Deep Learning project.  It is all about removing unwanted information and modifying data to extract insights from data. Data Preprocessing is about the following operations to be performed on data before going to build a Machine Learning and Deep Learning Algorithms.
  1. Handling Missing missing Values in Dataset
  2. Standardizing Data ( A statistical Approach)
  3. Data Reduction
  4. Data Integration
  5. Data Scaling 

Why we need data pre-processing?

We can't imagine a machine learning project without a data preprocessing step. According to techjury2.5 quintillion bytes( 1 million terabytes) data is generated every day by humans in the world. Most of the data is unstructured and raw.  Machine Learning algorithms unable to much from data with that raw or un-preprocessed data. 
Image Courtesy of Unsplash

Data Pre-processing Techniques in Machine Learning

In Machine Learning, mostly there will be three data types. They are

  1. Numerical 
  2. Categorical
  3. Mixed data ( Categorical + Numerical)
Types of Numerical data pre-processing techniques
  1. Data Standardization
  2. Data Normalization
  3. Binarizing data
  4. Handling Improper values in data
  5. Robust Scaling
Types of Categorical data Pre-processing
  1. Label Encoding
  2. One Hot Encoder

Requirements for Code

  1. Sample Dataset Download
  2. Google Colab Notebook
  3. Sklearn Library 
  4. Python Programming Language

Data Standardization in python using sklearn

Data Standardization is a statistical approach widely used in most of Machine Learning and Deep Learning Project to preprocess.

                            Output = (Input - Variance) / Standard Deviation

Data Standardization technique in python

Data Normalization

Data Normalization is a technique that compresses data between (0,1) or (-1, 1). Data Normalization algorithm follows

                   Output = (Input - Minimum Value) / Maximum - Minimum

Data Normalization Technique in Python

Binarizing Data

The Whole data values are replaced with either o or 1 based on the threshold value. For Example if the threshold value set as 0.3, then values less than 0.3 are replaced as 0 and are greater than 0.3 are replaced as 1.

Binarizing data technique in python

Handling Missing Values

In some datasets, there maybe have to get null values. we have to either remove or replace that null value with any other possible value. If we remove entire row, there may be a chance to lose important information. So we have to replace that null value with any other possible one.

Replacing Null values with mean of the column

Categorical Data

Categorical data means it contains text but it has much information to learn in some cases. There will be again two sub-types
  1. Binary Labels
  2. Multi-Class Labels

Categorical Data Pre-Processing ( Binary Labels)

In Binary Label encoding one variable is labeled as 1 and the other is with 0. Python Implementation of Binay Lable Encoding.

Categorical Data Pre-Processing ( Multi-Class Labels)

In Multi- Lable Encoding all variables are labeled starting from index 0 to the size of the column.

Robust Scaling Technique

Robust scaling technique is about removing outliers of data for better Normalization and Standardization. After Robust Scaling, if we apply Normalization or Standardization gives the best results.

Python Implementation of Robust Scaling

Impact of Preprocessing on Machine Learning Algorithm

Features of Data preprocessing
  1. With Data Preprocessing, we can extract more information from data.
  2. To clean data like replacing missing values
  3.  To Improve Performance of Model
  4.  For Data Reduction

Source Codes for this Topic

Google Colab Notebook can be accessed by clicking here.
Github Code is available at githubcode

Comment below if you have queries about code and everything
Thank you for your time!! 🤩🤩
Preprocessing Neural Python Machine Learning Deep Learning Sklearn Normalization