BIBA - Naïve Bayes - Choosing K
7 important questions on BIBA - Naïve Bayes - Choosing K
What happens when you choose a value for K that is too low, high or same as the number of records in training dataset?
- k is too low:
- may be fitting to the noise in the dataset
- k is too high:
- miss out on the method’s ability to capture the local structure in the dataset, one of its main advantages
- k is the number of records in training dataset:
- assign all records to the majority class in the training data
What is typical about balancing K, number K when the structure of data is complex and irregular and typical values of K?
- Balanced choice depends on the nature of the data
- The more complex and irregular the structure of the data, the lower the optimum value of k
- Typically:
- Values of k fall in the range 1 to 20
- Use odd numbers to avoid ties
How do you make a validation dataset?
- Validation dataset:
- Take a subset of the training dataset
- Use them for the selecting the model
- Predict the class for the records in validation
- Use different values of k, e.g., equal to 3, 4, 5, etc.
- Choose k that minimize validation error
- Higher grades + faster learning
- Never study anything twice
- 100% sure, 100% understanding
How do you calculate the error rate of a validation dataset?
- Percentage of mistakes I.e., assigned an incorrect class to records
How can you extend the algorithm to predict continuous values instead of categorical values?
- First step remains unchanged, I.e., determining neighbors by computing distances
- Second step must be modified I.e., determining class through majority voting
- Determine the prediction by taking the average outcome value of the k-nearest neighbors
What are advantages of K Nearest Neighbors?
- Simplicity of the method
- Lack of parametric assumptions
- Performs surprisingly well especially when
- There is a large enough training set present
- Each class is characterized by multiple combinations of predictor values
How do you code a classification in python?
- Load data
- # load data from the source
- DS = pd.read_csv(r’C:\......\File.csv’)
- Then, split columns into dependent and independent variables
- predictors = [‘Outlook', ‘Temperature’, ‘Humidity’, ‘Wind']
- X = pd.get_dummies(DS[predictors])
- y = dataset_iris[‘PlayTennis’].values
- Method train_test_split to separate:
- Records into training and testing
- X_train, X_test, y_train, y_test = train_test_split( DS.data, DS.target, test_size=0.2, random_state=109 )
- Train the model:
- model.fit(X_train, y_train)
- Predict values for dependent variable:
- y_pred = model.predict(X_test)
The question on the page originate from the summary of the following study material:
- A unique study and practice tool
- Never study anything twice again
- Get the grades you hope for
- 100% sure, 100% understanding