Decision Tree Induction

6 important questions on Decision Tree Induction

Given: suppose (according to slide 8) we have an information source generates 4 symbols with probabilities of respectively: p1  = 1/2, p2 = 1/4 , p3  = 1/8 , p4  =1/8.
Using the (‘directly decodable’) code s1 ~ 0, s2 ~ 10,
s3 ~ 110, s4 ~ 111,
            a) calculate the expected code length per symbol (e.g. self-information),
            b) compare the result of b) with the value of the entropy, and
            c) conclude about what you have discovered.

a.   I (s1)= -log2 (1/2)=1; I (s2)= - log2 (1/4)=2 ;Info(s3)= Info(s4) = log2 (1/8) =3

b.Info (s)= 1/2 I (s1) + 1/4 I (s2) + 1/8 I (s3)+1/8 I (s4)=1.75
c.“expected information” = “weighted sum (average) of self-information”
•“Weights” are the probabilities!

What is H and what is the maximum number of H?

H(0.5;0.5) = 1 (maximum value = 1 bit needed, why?)
Note that H  is a function of a probability distribution!

Which means H can also be applied to conditional probabilities (more later)

When is the highest information gain (A) and will that attribute be chosen?

If the average conditional entropy is zero, the information gain is highest and that attribute A will be chosen
  • Higher grades + faster learning
  • Never study anything twice
  • 100% sure, 100% understanding
Discover Study Smart

What is a fundamental problem of all learning algorithms?

Overfitting: if the set of possible hypotheses is (too) large, meaningless 'regularity' may easily be found: in case of DTs, overfitting takes place if the tree becomes too refined.

How would you counteract overfitting?

Early stopping by:      
  Use a training set to learn/induce a (hypothesis) tree
  Use a separate validation set to select the ‘best performing’ DT:
•e.g., measure performance on validation set while expanding the DT
•stop expansion of DT if performance start to decrease (!)
•Use a separate test set  to estimate the ‘true performance’ of DT selected…

Why is there a need for three sets? Namely training set, validation set and test set.

Hands on session 2

The question on the page originate from the summary of the following study material:

  • A unique study and practice tool
  • Never study anything twice again
  • Get the grades you hope for
  • 100% sure, 100% understanding
Remember faster, study better. Scientifically proven.
Trustpilot Logo