K- Nearest Neighbour (KNN)

Key Points:

Definition:
- K-Nearest Neighbors (KNN) is a simple, non-parametric, lazy learning algorithm used for both classification and regression tasks.
Working Principle:
- KNN classifies a data point based on how its neighbors are classified.
- The algorithm stores all available cases and classifies new cases based on a similarity measure (e.g., distance functions).
K-Value (Number of Neighbors):
- The number of nearest neighbors (K) is a crucial parameter.
- Choosing an appropriate K is vital; too small a K makes the model sensitive to noise, while too large a K makes the model too generalized.
Distance Metrics:
- Common distance metrics include Euclidean distance, Manhattan distance, and Minkowski distance.
- For categorical variables, Hamming distance is often used.
Classification Process:
- Calculate the distance between the query instance and all training samples.
- Select the K nearest neighbors to the query instance.
- Assign the query instance the most common class among its K nearest neighbors (majority voting).
Regression Process:
- Similar to classification, but instead of voting for the most frequent class, the average of the K nearest neighbors' values is taken.
Advantages:
- Simple to implement and intuitive to understand.
- No training phase, making it fast for small datasets.
- Effective with large training data and suitable for multi-class problems.
Disadvantages:
- Computationally expensive during prediction, especially with large datasets.
- Performance heavily relies on the choice of distance metric and K.
- Sensitive to irrelevant or redundant features.
Applications:
- Used in recommendation systems, pattern recognition, and image classification.

Theory Notes on KNN

Basic Theory:

Non-Parametric Nature:
- KNN is non-parametric, meaning it does not assume any underlying distribution for the data. This makes it flexible but also prone to overfitting if not handled carefully.
Lazy Learning Algorithm:
- Unlike eager learning algorithms (e.g., decision trees, neural networks), KNN does not build an explicit model. Instead, it memorizes the training dataset, making it a lazy learning algorithm.

Detailed Steps:

Data Normalization:
- To ensure fair distance measurement, features should be normalized (e.g., min-max scaling or z-score normalization).
Choosing K:
- The optimal value of K can be determined using cross-validation. Typically, K is chosen as an odd number to avoid ties in binary classification.
Algorithm Workflow:
- Step 1: Store the training examples.
- Step 2: For a new query point, compute the distance to all training points.
- Step 3: Identify the K nearest neighbors.
- Step 4: For classification, assign the class label by majority vote. For regression, compute the average of the K nearest neighbors' responses.

Practical Considerations:

Handling Missing Values:
- KNN can handle missing values by excluding those features during distance computation or imputing missing values based on the K nearest neighbors.
Feature Selection:
- Important to include only relevant features to improve accuracy and efficiency. Irrelevant features can distort distance calculations.
Weighted KNN:
- Instead of treating all K neighbors equally, weight them based on their distance to the query point, giving closer neighbors more influence.

K-Nearest Neighbours (KNN) Notes

Key Points:

Theory Notes on KNN

Basic Theory:

Detailed Steps:

Practical Considerations: