Practical Insights on Data Science by Arpit Gothi: Practical 4

Visual programming with Orange tool

Orange is an open-source data visualization, machine learning, and data mining toolkit. It features a visual programming front-end for explorative rapid qualitative data analysis and interactive data visualization.

Orange is a component-based visual programming software package for data visualization, machine learning, data mining, and data analysis. Orange components are called widgets and they range from simple data visualization, subset selection, and preprocessing, to empirical evaluation of learning algorithms and predictive modeling.

To Split, the data in the Orange tool we use Data Sampler Widget. and below are the I/O of the sampler

Inputs

Dataset

Outputs

Sampled data instances

Remaining Data: out-of-sample data

Sampling Methods:

Fixed proportion of data returns a selected percentage of the entire data (e.g. 70% of all the data)
Fixed sample size returns a selected number of data instances with a chance to set Sample with replacement, which always samples from the entire dataset (does not subtract instances already in the subset). With replacement, you can generate more instances than available in the input dataset.
Cross Validation partitions data instances into the specified number of complementary subsets. Following a typical validation schema, all subsets except the one selected by the user are output as Data Sample, and the selected subset goes to Remaining Data. (Note: In older versions, the outputs were swapped. If the widget is loaded from an older workflow, it switches to compatibility mode.)
Bootstrap infers the sample from the population statistic.

Fixed Sample Size:

First, let’s see how the Data Sampler works. We will use the heart_diseas data from the File widget. We see there are 303 instances in the data. We sampled the data with the Data Sampler widget and we chose to go with a fixed sample size of 5 instances for simplicity. We can observe the sampled data in the Data Table widget (Data Table (in-sample)). The second Data Table (Data Table (out-of-sample)) shows the remaining 298 instances that weren’t in the sample. To output the out-of-sample data, double-click the connection between the widgets and rewire the output to Remaining Data –> Data.

Fixed proportion of data:

Now, we will use the Data Sampler to split the Iris data into training and testing part. We are using the iris data, which we loaded with the File widget. In Data Sampler, we split the data with Fixed proportion of data, keeping 70% of data instances in the sample. Then we connected two outputs to the Test & Score widget, Data Sample –> Data and Remaining Data –> Test Data. Finally, we added Logistic Regression as the learner. This runs logistic regression on the Data input and evaluates the results on the Test Data.

Cross Validation:

Now, to break the details into training and testing units, we will use the Data Sampler. The Hosing Dataset dataset that we loaded with the File widget is used by us. We split the data with cross validation in the Data Sampler, retaining 10 used subsets in the test. We then linked the Data Sampler- > Test and Score. And then we add Logistic Regression, Logistic Regression- > Evaluate and Score as a learner.