Home / Summaries / Class notes - Business Intelligence & Data Management / split-records-impurity

Decision trees

15 important questions on Decision trees

Why are decisions trees a popular classification technique?

Performs well across a wide range of situations
Does not require much effort from the analyst
Easy understandable by the consumers

At least when the trees are not too large

Can be used for both:

Classification, called classification trees
Prediction, called regression trees

What is the main processing with decision trees?

Separate records into subgroups by creating splits on predictors

How is the tree constructed in induction?

Top-down recursive divide-and-conquer manner. At the start all the training instances are t the root of the tree. Instances are then partitioned recursively based on selected attributes to get the homogenous subgroups.

What are issues that occur with induction?

Determine how to split the records

How to specify the attribute test condition?
How to determine the best split?

Determine when to stop slitting

How can the splitting take place based on nominal attributes?

Multi-way split - use as many partitions as distinct values
Binary split - divides values into two subsets

How can the splitting take place based on continuous attributes? (numbers)

Discretization to from an ordinal categorical attribute

Static, discretize once at the beginning
Dynamic, ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles) or clustering

Binary Decision (i.e., Xi < v or Xi > v)

Finds the best cut among all possible splits
Can be more compute intensive

How can the impurity of a node can be measured?

Gini Index
Entropy measure

What is information gain?

Information gain is used to determine which feature/attribute provides the maximum information about a class. It splits records based on an attribute test optimizing certain criterion

What is calculated with the Gini index?

Gini index measures the degree or probability of a particular variable being wrongly classified when it is randomly chosen

What is calculated with the Entropy measure?

Entropy, which is the degree of uncertainty, impurity or disorder. It aims to reduce the level of entropy starting from the root to the leave nodes.

How is the combined impurity calculated?

The combined impurity created by a split is a weighted average of the impurity measures.

Weighted by the number of records in each split.

What are the stopping criteria for tree induction?

Stop expanding a node when all the records belong to the same class
Stop expanding a node when all the records have similar attribute
Early termination

How can we address overfitting?

Pre-pruning
Post-pruning

What are the advantages of decision-trees?

Easy to understand
Easy to generate rules

What are the disadvantages of decision trees?

May suffer from overfitting
Classifies by rectangular partitioning, so it does not handles correlated features very well
Can be quite large —> pruning is necessary
Does not handle streaming data easily, but a few successful ideas/techniques

The question on the page originate from the summary of the following study material:

Business Intelligence & Data Management

View summary

A unique study and practice tool
Never study anything twice again
Get the grades you hope for
100% sure, 100% understanding

Remember faster, study better. Scientifically proven.

Decision trees

15 important questions on Decision trees

Why are decisions trees a popular classification technique?

What is the main processing with decision trees?

How is the tree constructed in induction?

What are issues that occur with induction?

How can the splitting take place based on nominal attributes?

How can the splitting take place based on continuous attributes? (numbers)

How can the impurity of a node can be measured?

What is information gain?

What is calculated with the Gini index?

What is calculated with the Entropy measure?

How is the combined impurity calculated?

What are the stopping criteria for tree induction?

How can we address overfitting?

What are the advantages of decision-trees?

What are the disadvantages of decision trees?

Summaries related to Intro. to BI & Data Management

Class notes - Business Intelligence & Data M…

Global politics

Essentials of international relations

Williams textbook of endocrinology

Follow Up Engels idioom 4/5 H

International Business

International business

An introduction to geographical information …

Python programming : an introduction to comp…

Applying UML and patterns : an introduction …

Small places, large issues

Learning teaching