K-Nearest Neighbours (KNN) Notes

Key Points:

  1. Definition:

    • K-Nearest Neighbors (KNN) is a simple, non-parametric, lazy learning algorithm used for both classification and regression tasks.
  2. Working Principle:

    • KNN classifies a data point based on how its neighbors are classified.

    • The algorithm stores all available cases and classifies new cases based on a similarity measure (e.g., distance functions).

  3. K-Value (Number of Neighbors):

    • The number of nearest neighbors (K) is a crucial parameter.

    • Choosing an appropriate K is vital; too small a K makes the model sensitive to noise, while too large a K makes the model too generalized.

  4. Distance Metrics:

    • Common distance metrics include Euclidean distance, Manhattan distance, and Minkowski distance.

    • For categorical variables, Hamming distance is often used.

  5. Classification Process:

    • Calculate the distance between the query instance and all training samples.

    • Select the K nearest neighbors to the query instance.

    • Assign the query instance the most common class among its K nearest neighbors (majority voting).

  6. Regression Process:

    • Similar to classification, but instead of voting for the most frequent class, the average of the K nearest neighbors' values is taken.
  7. Advantages:

    • Simple to implement and intuitive to understand.

    • No training phase, making it fast for small datasets.

    • Effective with large training data and suitable for multi-class problems.

  8. Disadvantages:

    • Computationally expensive during prediction, especially with large datasets.

    • Performance heavily relies on the choice of distance metric and K.

    • Sensitive to irrelevant or redundant features.

  9. Applications:

    • Used in recommendation systems, pattern recognition, and image classification.

Theory Notes on KNN

Basic Theory:

  • Non-Parametric Nature:

    • KNN is non-parametric, meaning it does not assume any underlying distribution for the data. This makes it flexible but also prone to overfitting if not handled carefully.
  • Lazy Learning Algorithm:

    • Unlike eager learning algorithms (e.g., decision trees, neural networks), KNN does not build an explicit model. Instead, it memorizes the training dataset, making it a lazy learning algorithm.

Detailed Steps:

  1. Data Normalization:

    • To ensure fair distance measurement, features should be normalized (e.g., min-max scaling or z-score normalization).
  2. Choosing K:

    • The optimal value of K can be determined using cross-validation. Typically, K is chosen as an odd number to avoid ties in binary classification.
  3. Algorithm Workflow:

    • Step 1: Store the training examples.

    • Step 2: For a new query point, compute the distance to all training points.

    • Step 3: Identify the K nearest neighbors.

    • Step 4: For classification, assign the class label by majority vote. For regression, compute the average of the K nearest neighbors' responses.

Practical Considerations:

  • Handling Missing Values:

    • KNN can handle missing values by excluding those features during distance computation or imputing missing values based on the K nearest neighbors.
  • Feature Selection:

    • Important to include only relevant features to improve accuracy and efficiency. Irrelevant features can distort distance calculations.
  • Weighted KNN:

    • Instead of treating all K neighbors equally, weight them based on their distance to the query point, giving closer neighbors more influence.