Home / Summaries / Class notes - Business Intelligence & Data Management / training-data-set

K-Nearest Neighbors & Performance measures

38 important questions on K-Nearest Neighbors & Performance measures

What is the idea of k-Nearest Neighbors?

Identify the neighbors of the new record that we wish to classify, i.e., the k records in the training dataset that are similar to / close by the new record.

Use neighbors (i.e., these k records) to classify the new record into a class).

Assign the new record t the predominant class among these neighbors

What are the different parts of kNN?

- Determining the item's neighbors
- Choosing the number of neighbors, i.e., value k
- Computing classification (for a categorical outcome) or prediction (for a numerical outcome)

What is triangle inequality?

The distance between any pair cannot exceed the sum of distances between the other two pairs

What is the Euclidean Distance?

Most popular distance measure for numerical values.

What is meant when it is said that the Euclidean Distance is highly scale dependent?

When changing the units of one variable it can have a huge influence on the results, for example from cents to dollars

What is good option when there are many outliers with Euclidean Distance?

Manhattan distance

What is the difference between the Manhattan distance and the Euclidean distance?

The manhattan distance looks at the absolute differences rather than the squared differences like the Euclidean

What is the problem when k is too low?

It may be fitting to the noise in the dataset

What is the problem when k is too high?

It miss out on the method's ability to capture the local structure in the dataset, one of its main advantages

What is the problem when k is the number of records in the training dataset?

It assigns all records to the majority class in the training data.

How to choose the value for k?

Balanced choice depends on the nature of the data, e.g., the more complex and irregular the structure of the data, the lower optimum of k

What is the validation set?

A subset of the training data set that is used for the selecting of the model. It predict the class for the records invalidation.

What is the difference in the approach for k NN with numerical outcomes versus categorical values?

Determining neighbors is still done by computing distances, but determining the prediction is done by taking the average outcome value of the k-nearest neighbors

What are the advantages of KNN?

Simplicity of the method
Lack of parametric assumptions

In what situations does KNN work especially well?

When there is a large enough training set present
When each class is characterized by multiple combinations of predictor values

What are the shortcomings of the KNN?

Computing the nearest neighbors can be time consuming
For every record to be predicted, we compute its distance from the entire set of training record only a the time of the prediction, "lazy learner"
Number of records required in the training set to qualify as a large increases exponentially with the number of predictors

What are possible solutions for the first shortcoming?

Reduce time taken to compute distances by working on less dimensions, generated using dimension reduction techniques
Speed up identification of nearest neighbors using specialized data structures

What is the biggest evaluation issue?

There is not a single methodology for the evaluation of models. Hence, here are various possible evaluation measures according to the model and sorts of prediction

What are the three steps for classification?

Split data into train and test sets
Build a model on the training set
Evaluatie on a test-set

What is parameter tuning?

It is important that the test data is not used in any way to create the classifier, just used for the final testing and that's it.

Some training schemes operate in two stages:

build the basic structure
optimize parameter settings

The test data cannot be used for parameter tuning. Proper procedure uses three sets:

training data
validation data
test ata

Why is is a validation test set created?

A validation data set is used to optimize the parameters

What is supervised learning?

Building models using training dat. Interested in predicting the outcome variable for new records

What are the three main types of outcomes?

Predicted numerical value, e.g., house price
Predicted case membership, e.g., cancer or not
Probability of class membership, e.g., Naive Bayes

How do we measure accuracy when generating numeric predictions?

Measures of accuracy use the prediction errors on the validation set

How do we measure accuracy given the prediction accuracy measure?

Mean absolute error
Mean error
Mean percentage error
Mean absolute percentage error
Root mean squared error

What is Lift Chart?

Graphical way to assess predictive performance

The goals is to search for.a subset of records that gives the highest cumulative predicted values.

Compares the model's predictive performance to a baseline model that has no prediction.

What is the lift factor?

The increase in the response rate. A lift chart allows a visual comparison.

What is a natural criterion for judging the performance of a classifier?

Probability of making a misclassification error

What is a confusion/classification matrix?

A matrix that summarizes the correct and incorrect classification that a classifier produced for a given data set.

Rows and columns correspond to the predicted and true (actual) classes

What is the overall success rate /accuracy?

The number of correct classifications divided by the total number of classifications

What are ROC curves?

Receiver Operating Characteristic

Characterize the trade-off between positive hits and false alarms

It plots the true positive rate against the false positive rate for the different possible thresholds of a diagnostics test

What are the limitations of accuracy?

Accuracy is misleading because model does not detect any class 1 example.

How is this solved? --> cost matrix

What is multi class prediction?

Two confusion matrices fo a 3-class problem: actual predictor vs. random predictor

What is the Kappa statistic?

(success rate of actual predictor - success rate of random predictor) / (1 - success rate of random predictor)

What is the ability of precision?

The ability of the model to correctly detect class items

Why is determining recall difficult?

Because the total number of items/records that belong to a particular class is sometimes not available

What is the solution when determine recall is too difficult?

Sample across the dataset and perform relevance judgement on these items
Apply different models to the same dataset and then use the aggregate of relevant items as the total relevant set

Why would value 170 not be a suitable for the value of k?

Because it is not an odd number and we might get ties
Because it will assign all records to the majority class of the training data
Because it is too high and we might not capture the local data structure

The question on the page originate from the summary of the following study material:

Business Intelligence & Data Management

View summary

A unique study and practice tool
Never study anything twice again
Get the grades you hope for
100% sure, 100% understanding

Remember faster, study better. Scientifically proven.