BIBA - Naïve Bayes - Choosing K

7 important questions on BIBA - Naïve Bayes - Choosing K

What happens when you choose a value for K that is too low, high or same as the number of records in training dataset?

  • k is too low:
    • may be fitting to the noise in the dataset
  • k is too high:
    • miss out on the method’s ability to capture the local structure in the dataset, one of its main advantages
  • k is the number of records in training dataset:
    • assign all records to the majority class in the training data

What is typical about balancing K, number K when the structure of data is complex and irregular and typical values of K?

  • Balanced choice depends on the nature of the data
  • The more complex and irregular the structure of the data, the lower the optimum value of k
  • Typically:
    • Values of k fall in the range 1 to 20
    • Use odd numbers to avoid ties

How do you make a validation dataset?

  • Validation dataset:
    • Take a subset of the training dataset
    • Use them for the selecting the model
  • Predict the class for the records in validation
  • Use different values of k, e.g., equal to 3, 4, 5, etc.
  • Choose k that minimize validation error
  • Higher grades + faster learning
  • Never study anything twice
  • 100% sure, 100% understanding
Discover Study Smart

How do you calculate the error rate of a validation dataset?

Error rate:
  • Percentage of mistakes I.e., assigned an incorrect class to records

How can you extend the algorithm to predict continuous values instead of categorical values?

  • First step remains unchanged, I.e., determining neighbors by computing distances
  • Second step must be modified I.e., determining class through majority voting
  • Determine the prediction by taking the average outcome value of the k-nearest neighbors

What are advantages of K Nearest Neighbors?

  • Simplicity of the method
  • Lack of parametric assumptions
  • Performs surprisingly well especially when
    • There is a large enough training set present
    • Each class is characterized by multiple combinations of predictor values

How do you code a classification in python?

  • Load data
  • # load data from the source
    • DS = pd.read_csv(r’C:\......\File.csv’)
  • Then, split columns into dependent and independent variables
    • predictors = [‘Outlook', ‘Temperature’, ‘Humidity’, ‘Wind']
    • X = pd.get_dummies(DS[predictors])
    • y = dataset_iris[‘PlayTennis’].values
  • Method train_test_split to separate:
  • Records into training and testing
    • X_train, X_test, y_train, y_test = train_test_split( DS.data, DS.target, test_size=0.2, random_state=109 )
  • Train the model:
    • model.fit(X_train, y_train)
  • Predict values for dependent variable:
    • y_pred = model.predict(X_test)

The question on the page originate from the summary of the following study material:

  • A unique study and practice tool
  • Never study anything twice again
  • Get the grades you hope for
  • 100% sure, 100% understanding
Remember faster, study better. Scientifically proven.
Trustpilot Logo