Decision Tree Induction
6 important questions on Decision Tree Induction
Given: suppose (according to slide 8) we have an information source generates 4 symbols with probabilities of respectively: p1 = 1/2, p2 = 1/4 , p3 = 1/8 , p4 =1/8.
Using the (‘directly decodable’) code s1 ~ 0, s2 ~ 10,
s3 ~ 110, s4 ~ 111,
a) calculate the expected code length per symbol (e.g. self-information),
b) compare the result of b) with the value of the entropy, and
c) conclude about what you have discovered.
b.Info (s)= 1/2 I (s1) + 1/4 I (s2) + 1/8 I (s3)+1/8 I (s4)=1.75
c.“expected information” = “weighted sum (average) of self-information”
•“Weights” are the probabilities!
What is H and what is the maximum number of H?
Note that H is a function of a probability distribution!
Which means H can also be applied to conditional probabilities (more later)
When is the highest information gain (A) and will that attribute be chosen?
- Higher grades + faster learning
- Never study anything twice
- 100% sure, 100% understanding
What is a fundamental problem of all learning algorithms?
How would you counteract overfitting?
Use a training set to learn/induce a (hypothesis) tree
Use a separate validation set to select the ‘best performing’ DT:
•e.g., measure performance on validation set while expanding the DT
•stop expansion of DT if performance start to decrease (!)
•Use a separate test set to estimate the ‘true performance’ of DT selected…
Why is there a need for three sets? Namely training set, validation set and test set.
The question on the page originate from the summary of the following study material:
- A unique study and practice tool
- Never study anything twice again
- Get the grades you hope for
- 100% sure, 100% understanding