Practical Insights on Data Science by Arpit Gothi: Practical 1

Data Preprocessing in Python using Scikit Learn

A significant step in the process of data mining is data preprocessing. Data preprocessing is a methodology for data mining that requires converting raw data into a comprehensible format. Real-world knowledge is often unreliable, contradictory, deficient in many habits or traditions, and is likely to include several mistakes. For further analysis, data preprocessing prepares raw data.

What is data pre-processing?

Data preprocessing is an important step in the data mining process. Data could be in so many different forms: Structured Tables, Images, Audio files, Videos, etc. Machines don’t understand the free text, image, or video data as it is, they understand 1s and 0s. The data we use in the real world is not perfect and it is incomplete, inconsistent (with outliers and noisy values), and in an unstructured form. Preprocessing the raw data helps to organize, scaling, clean (remove outliers), standardize i.e. simplifying it to feed the data to the machine learning algorithm.

Data Preprocessing is a technique used to improve the quality of the data before applied mining so that data will lead to high-quality mining results. The data processing technique can substantially improve the overall quality of the patterns mined and/or the time required for the actual mining.

Scikit Learn:

Scikit Learns sklearn.preprocessing package provides a lot of different preprocessing functions some of which are standardization, normalization, encoding categorical features, imputation of missing values.

Scikit-learn is a library of Python that includes a wide variety of supervised and unsupervised learning algorithms. On top of several Python libraries of popular data and math, Scikit Learn is developed. Numpy arrays and panda data frames can be passed directly to the ML algorithms of Scikit. It uses the libraries below:

NumPy: For any matrix work, especially math operations.
SciPy: Scientific and technical computing
Matplotlib: Data visualization I
Python: Interactive console for Python
Sympy: Symbolic mathematics
Pandas: Data handling, manipulation, and analysis

Standardization:

A very common technique is the standardization of datasets. It is also important for certain algorithms and datasets. Algorithms such as RBF kernel SVM or Ridge and Lasso Regression assume all features are standardized. It's important for algorithms like Nearest Neighbors that features are scaled so the distance is determined.