A decision tree is a flowchart-like structure in which each internal node represents a test on a feature (e.g. whether a coin flip comes up heads or tails) , each leaf node represents a class label (decision taken after computing all features) and branches represent conjunctions of features that lead to those class labels. The paths from root to leaf represent classification rules. Below diagram illustrate the basic flow of decision tree for decision making with labels (Rain(Yes), No Rain(No)).
Decision Tree for Rain Forecasting
Decision tree is one of the predictive modelling approaches used in statistics, data mining and machine learning.
Decision trees are constructed via an algorithmic approach that identifies ways to split a data set based on different conditions. It is one of the most widely used and practical methods for supervised learning. Decision Trees are a non-parametric supervised learning method used for both classification and regression tasks.
Tree models where the target variable can take a discrete set of values are called classification trees. Decision trees where the target variable can take continuous values (typically real numbers) are called regression trees. Classification And Regression Tree (CART) is general term for this.
Throughout this post i will try to explain using the examples.
Data Format
Data comes in records of forms.
(x,Y)=(x1,x2,x3,....,xk,Y)
The dependent variable, Y, is the target variable that we are trying to understand, classify or generalize. The vector x is composed of the features, x1, x2, x3 etc., that are used for that task.
Example
training_data = [
['Green', 3, 'Apple'],
['Yellow', 3, 'Apple'],
['Red', 1, 'Grape'],
['Red', 1, 'Grape'],
['Yellow', 3, 'Lemon'],
]
# Header = ["Color", "diameter", "Label"]
# The last column is the label.
# The first two columns are features.
Approach to make decision tree
While making decision tree, at each node of tree we ask different type of questions. Based on the asked question we will calculate the information gain corresponding to it.
Information Gain
Information gain is used to decide which feature to split on at each step in building the tree. Simplicity is best, so we want to keep our tree small. To do so, at each step we should choose the split that results in the purest daughter nodes. A commonly used measure of purity is called information. For each node of the tree, the information value measures how much information a feature gives us about the class. The split with the highest information gain will be taken as the first split and the process will continue until all children nodes are pure, or until the information gain is 0.
Asking Question
class Question:
"""A Question is used to partition a dataset. This class just records a 'column number' (e.g., 0 for Color) and a
'column value' (e.g., Green). The 'match' method is used to compare
the feature value in an example to the feature value stored in the
question. See the demo below.
"""
def __init__(self, column, value):
self.column = column
self.value = value
def match(self, example):
# Compare the feature value in an example to the
# feature value in this question.
val = example[self.column]
if is_numeric(val):
return val >= self.value
else:
return val == self.value
def __repr__(self):
# This is just a helper method to print
# the question in a readable format.
condition = "=="
if is_numeric(self.value):
condition = ">="
return "Is %s %s %s?" % (
header[self.column], condition, str(self.value))
Lets try querying questions and its outputs.
Question(1, 3) ## Is diameter >= 3?
Question(0, "Green") ## Is color == Green?
Now we will try to Partition the dataset based on asked question. Data will be divided into two classes at each steps.
def partition(rows, question):
"""Partitions a dataset.
For each row in the dataset, check if it matches the question. If
so, add it to 'true rows', otherwise, add it to 'false rows'.
"""
true_rows, false_rows = [], []
for row in rows:
if question.match(row):
true_rows.append(row)
else:
false_rows.append(row)
return true_rows, false_rows
# Let's partition the training data based on whether rows are Red.
true_rows, false_rows = partition(training_data, Question(0, 'Red'))
# This will contain all the 'Red' rows.
true_rows ## [['Red', 1, 'Grape'], ['Red', 1, 'Grape']]
false_rows ## [['Green', 3, 'Apple'], ['Yellow', 3, 'Apple'], ['Yellow', 3, 'Lemon']]
Algorithm for constructing decision tree usually works top-down, by choosing a variable at each step that best splits the set of items. Different algorithms use different metrices for measuring best.
Gini Impurity
First let’s understand the meaning of Pure and Impure.
Pure
Pure means, in a selected sample of dataset all data belongs to same class (PURE).
Impure
Impure means, data is mixture of different classes.
Definition of Gini Impurity
Gini Impurity is a measurement of the likelihood of an incorrect classification of a new instance of a random variable, if that new instance were randomly classified according to the distribution of class labels from the data set.
If our dataset is Pure then likelihood of incorrect classification is 0. If our sample is mixture of different classes then likelihood of incorrect classification will be high.
Calculating Gini Impurity.
def gini(rows):
"""Calculate the Gini Impurity for a list of rows.
There are a few different ways to do this, I thought this one was
the most concise. See:
https://en.wikipedia.org/wiki/Decision_tree_learning#Gini_impurity
"""
counts = class_counts(rows)
impurity = 1
for lbl in counts:
prob_of_lbl = counts[lbl] / float(len(rows))
impurity -= prob_of_lbl**2
return impurity
Example
# Demo 1:
# Let's look at some example to understand how Gini Impurity works.
#
# First, we'll look at a dataset with no mixing.
no_mixing = [['Apple'],
['Apple']]
# this will return 0
gini(no_mixing) ## output=0
## Demo 2:
# Now, we'll look at dataset with a 50:50 apples:oranges ratio
some_mixing = [['Apple'],
['Orange']]
# this will return 0.5 - meaning, there's a 50% chance of misclassifying
# a random example we draw from the dataset.
gini(some_mixing) ##output=0.5
## Demo 3:
# Now, we'll look at a dataset with many different labels
lots_of_mixing = [['Apple'],
['Orange'],
['Grape'],
['Grapefruit'],
['Blueberry']]
# This will return 0.8
gini(lots_of_mixing) ##output=0.8
#######
Steps for Making decision tree
Get list of rows (dataset) which are taken into consideration for making decision tree (recursively at each nodes).
Calculate uncertanity of our dataset or Gini impurity or how much our data is mixed up etc.
Generate list of all question which needs to be asked at that node.
Partition rows into True rows and False rows based on each question asked.
Calculate information gain based on gini impurity and partition of data from previous step.
Update highest information gain based on each question asked.
Update best question based on information gain (higher information gain).
Divide the node on best question. Repeat again from step 1 again until we get pure node (leaf nodes).
Code for Above Steps
def find_best_split(rows):
"""Find the best question to ask by iterating over every feature / value
and calculating the information gain."""
best_gain = 0 # keep track of the best information gain
best_question = None # keep train of the feature / value that produced it
current_uncertainty = gini(rows)
n_features = len(rows[0]) - 1 # number of columns for col in range(n_features): # for each feature values = set([row[col] for row in rows]) # unique values in the column for val in values: # for each value question = Question(col, val) # try splitting the dataset
true_rows, false_rows = partition(rows, question) # Skip this split if it doesn't divide the
# dataset.
if len(true_rows) == 0 or len(false_rows) == 0:
continue # Calculate the information gain from this split
gain = info_gain(true_rows, false_rows, current_uncertainty) # You actually can use '>' instead of '>=' here
# but I wanted the tree to look a certain way for our
# toy dataset.
if gain >= best_gain:
best_gain, best_question = gain, question return best_gain, best_question
#######
# Demo:
# Find the best question to ask first for our toy dataset.
best_gain, best_question = find_best_split(training_data)
best_question
## output - Is diameter >= 3?
Now build the Decision tree based on step discussed above recursively at each node.
def build_tree(rows):
"""Builds the tree. Rules of recursion: 1) Believe that it works. 2) Start by checking
for the base case (no further information gain). 3) Prepare for
giant stack traces.
""" # Try partitioning the dataset on each of the unique attribute,
# calculate the information gain,
# and return the question that produces the highest gain.
gain, question = find_best_split(rows) # Base case: no further info gain
# Since we can ask no further questions,
# we'll return a leaf.
if gain == 0:
return Leaf(rows) # If we reach here, we have found a useful feature / value
# to partition on.
true_rows, false_rows = partition(rows, question) # Recursively build the true branch.
true_branch = build_tree(true_rows) # Recursively build the false branch.
false_branch = build_tree(false_rows) # Return a Question node.
# This records the best feature / value to ask at this point,
# as well as the branches to follow
# dependingo on the answer.
return Decision_Node(question, true_branch, false_branch)
Building Decision Tree
Let’s build decision tree based on training data.
training_data = [
['Green', 3, 'Apple'],
['Yellow', 3, 'Apple'],
['Red', 1, 'Grape'],
['Red', 1, 'Grape'],
['Yellow', 3, 'Lemon'],
]
# Header = ["Color", "diameter", "Label"]
# The last column is the label.
# The first two columns are features.
my_tree = build_tree(training_data)
print_tree(my_tree)
Output
Is diameter >= 3?
--> True:
Is color == Yellow?
--> True:
Predict {'Lemon': 1, 'Apple': 1}
--> False:
Predict {'Apple': 1}
--> False:
Predict {'Grape': 2}
From above output we can see that at each steps data is divided into True and False rows. This process keep repeated until we reach leaf node where information gain is 0 and further split of data is not possible as nodes are Pure.
Advantage of Decision Tree
Easy to use and understand.
Can handle both categorical and numerical data.
Resistant to outliers, hence require little data preprocessing.
Disadvantage of Decision Tree
Prone to overfitting.
Require some kind of measurement as to how well they are doing.
Need to be careful with parameter tuning.
Can create biased learned trees if some classes dominate.
How to avoid overfitting the Decision tree model
Overfitting is one of the major problem for every model in machine learning. If model is overfitted it will poorly generalized to new samples. To avoid decision tree from overfitting we remove the branches that make use of features having low importance. This method is called as Pruning or post-pruning. This way we will reduce the complexity of tree, and hence imroves predictive accuracy by the reduction of overfitting.
Pruning should reduce the size of a learning tree without reducing predictive accuracy as measured by a cross-validation set. There are 2 major Pruning techniques.
Minimum Error: The tree is pruned back to the point where the cross-validated error is a minimum.
Smallest Tree: The tree is pruned back slightly further than the minimum error. Technically the pruning creates a decision tree with cross-validation error within 1 standard error of the minimum error.
Early Stop or Pre-pruning
An alternative method to prevent overfitting is to try and stop the tree-building process early, before it produces leaves with very small samples. This heuristic is known as early stopping but is also sometimes known as pre-pruning decision trees.
At each stage of splitting the tree, we check the cross-validation error. If the error does not decrease significantly enough then we stop. Early stopping may underfit by stopping too early. The current split may be of little benefit, but having made it, subsequent splits more significantly reduce the error.
Early stopping and pruning can be used together, separately, or not at all. Post pruning decision trees is more mathematically rigorous, finding a tree at least as good as early stopping. Early stopping is a quick fix heuristic. If used together with pruning, early stopping may save time. After all, why build a tree only to prune it back again?
Decision Tree in Real Life
1. Selecting a flight to travel
Suppose you need to select a flight for your next travel. How do we go about it? We check first if the flight is available on that day or not. If it is not available, we will look for some other date but if it is available then we look for may be the duration of the flight. If we want to have only direct flights then we look whether the price of that flight is in your pre-defined budget or not. If it is too expensive, we look at some other flights else we book it!
2. Handling late night cravings
Source: Google
There are many more application of decision tree in real life. You can check for more applications of decision tree.
Source: Towards Data Science - Prince Yadav
The Tech Platform
Comments