Dataset complexity

15 important questions on Dataset complexity

The complexity of your problem consists mainly of two parts:

- Inherent complexity: do you expect a complex decision boundary?
- is your dataset representative enough of your problem

What do you check to see if your problem is high dimensional?

Check if the number of features or variables is significantly larger than the number of samples or the number of samples are significantly higher than the features

What can we say about the dimensionality when you have more features than samples?

A linear classifier can perfectly separate your solution. Using one feature per sample is sufficient. Hence the problem is high dimensional = complex
  • Higher grades + faster learning
  • Never study anything twice
  • 100% sure, 100% understanding
Discover Study Smart

How do we measure complexity?

Learning curve

What do learning curves tell us?

Complex classifiers are good when you have a sufficient number of training objects.
When a small number of training objects is available you overtrain
Use a simple classifier when you don't have many training examples 

Is there something more general to quantify the complexity? Learning curves are specific per dataset?

Mean squared error = Variance + squared bias

What does a high bias mean?

Classifier favors a specific solution and will consistently produce this solution

What does high variance mean?

Classifier is more flexible, thus the solution boundary may differ

What is L2 regularization?

Adding the sum of squares of the weights. Used if we want small changes in the output to not influence our results to much (keep the weights small)

What does L2 to the weights?

Encourages weight values to decay towards 0, unless supported by the data. Also known as parameter shrinkage

What is Ridge regression?

The combination of L2 and linear regression

What does L1 to the weights?

Encourages weights to become 0 = sparse model. Especially useful is the amount of features is much larger than the amount of samples

Which hyperparameters to optimize in SVM?

- Kernel type
- Kernel parameters
- Slack

How to tune hyperparameters?

- Grid search
- Randomized search

What are the issues with hyperparameter optimization?

- No guarantees (best option may not be in samples)
- Computational expensive
- Randomness
- Overfitting

The question on the page originate from the summary of the following study material:

  • A unique study and practice tool
  • Never study anything twice again
  • Get the grades you hope for
  • 100% sure, 100% understanding
Remember faster, study better. Scientifically proven.
Trustpilot Logo