Practical Insights on Data Science by Arpit Gothi: Practical 5

Data Pre-processing and text analytics using Orange

what is text analytics:

The automated method of translating large volumes of unstructured text into quantitative data to uncover insights, trends, and patterns is text analytics.

Preprocessing is a key component in Data Science. The orange tool has various ways to achieve the activities.

1. Discretization

class

Orange.preprocess.discretize

Discretization is the process of transferring continuous functions, models, variables, and equations into discrete counterparts. This process is usually carried out as a first step toward making them suitable for numerical evaluation and implementation on digital computers.

Discretization replaces continuous features with the corresponding categorical features:

code:

import Orange

store = Orange.data.Table("iris.tab")
iris = Orange.preprocess.Discretize()
iris.method = Orange.preprocess.discretize.EqualFreq(n=3)
d_store = iris(store)

print("Original dataset:")
for e in store[:3]:
print(e)

print("Discretized dataset:")
for e in d_store[:3]:
print(e)

2. Continuization

class Orange.preprocess.Continuize

Given a data table, return a new table in which the discretize attributes are replaced with continuous or removed.

binary variables are transformed into 0.0/1.0 or -1.0/1.0 indicator variables, depending upon the argument zero_based.
multinomial variables are treated according to the argument multinomial_treatment.
discrete attribute with only one possible value is removed;

code:

import Orange
titanic = Orange.data.Table("titanic")
continuizer = Orange.preprocess.Continuize()
titanic1 = continuizer(titanic)

3. Normalization

class `Orange.preprocess.Normalize`(zero_based=True, norm_type=Normalize.NormalizeBySD, transform_class=False, center=True, normalize_datetime=False)[source]
Construct a preprocessor for normalization of features. Given a data table, preprocessor returns a new table in which the continuous attributes are normalized.
Code:
from Orange.data import Table
from Orange.preprocess import Normalize
data = Table("iris.tab")
normalizer = Normalize(norm_type=Normalize.NormalizeBySpan)
normalized_data = normalizer(data)

4. Randomization

class Orange.preprocess.Randomize(rand_type=Randomize.RandomizeClasses, rand_seed=None)[source]

Construct a preprocessor for randomization of classes, attributes and/or metas. Given a data table, preprocessor returns a new table in which the data is shuffled.

code:
from Orange.data import Table
from Orange.preprocess import Randomize
data = Table("iris")
randomizer = Randomize(Randomize.RandomizeClasses)
randomized_data = randomizer(data)

How to work with Orange in Python and vice-versa?

Orange is a visualization and research platform for open-source data, where data mining is conducted by graphic programming or Python scripting. The instrument has components for deep learning, bioinformatics, and text mining add-ons, and it is filled with data analytics features. Orange is a library for Python.

The python example in orange is as given above.

And if we want to use orange in python then we just need to import orange shown below.