technicalhigh

Design and implement a Python class for a k-nearest neighbors (KNN) classifier from scratch, including methods for fitting the model to training data and predicting labels for new data points. Your implementation should handle both classification and regression tasks and allow for different distance metrics (e.g., Euclidean, Manhattan).

final round · 45-60 minutes

How to structure your answer

The ideal answer structure for this question follows a MECE (Mutually Exclusive, Collectively Exhaustive) approach, broken down into key components. First, define the KNNClassifier class, initializing with k and distance_metric. Second, implement the fit method to store training data (X_train, y_train). Third, develop the _calculate_distance private method, supporting Euclidean and Manhattan metrics. Fourth, create the predict method: for each test point, calculate distances to all training points, find the k nearest neighbors, and determine the predicted label (majority vote for classification, mean for regression). Finally, include error handling for invalid distance metrics or k values. This ensures a robust and complete solution covering all requirements.

Sample answer

import numpy as np
from collections import Counter

class KNNClassifier:
    def __init__(self, k=3, distance_metric='euclidean'):
        if not isinstance(k, int) or k <= 0:
            raise ValueError("k must be a positive integer.")
        if distance_metric not in ['euclidean', 'manhattan']:
            raise ValueError("Distance metric must be 'euclidean' or 'manhattan'.")
        self.k = k
        self.distance_metric = distance_metric
        self.X_train = None
        self.y_train = None

    def _calculate_distance(self, p1, p2):
        if self.distance_metric == 'euclidean':
            return np.sqrt(np.sum((p1 - p2)**2))
        elif self.distance_metric == 'manhattan':
            return np.sum(np.abs(p1 - p2))

    def fit(self, X_train, y_train):
        self.X_train = np.array(X_train)
        self.y_train = np.array(y_train)

    def predict(self, X_test):
        if self.X_train is None or self.y_train is None:
            raise RuntimeError("Model has not been fitted yet.")
        
        predictions = []
        for x_test_point in np.array(X_test):
            distances = [self._calculate_distance(x_test_point, x_train_point) for x_train_point in self.X_train]
            k_nearest_indices = np.argsort(distances)[:self.k]
            k_nearest_labels = [self.y_train[i] for i in k_nearest_indices]
            
            # Determine task type (classification/regression) based on y_train dtype
            if np.issubdtype(self.y_train.dtype, np.number) and not np.array_equal(self.y_train, self.y_train.astype(int)):
                # Regression: mean of neighbors
                predictions.append(np.mean(k_nearest_labels))
            else:
                # Classification: majority vote
                most_common = Counter(k_nearest_labels).most_common(1)
                predictions.append(most_common[0][0])
        return np.array(predictions)

# Example Usage:
# Classification
# X_cls = np.array([[1,1],[1,2],[2,2],[3,3],[3,4],[4,4]])
# y_cls = np.array([0,0,0,1,1,1])
# knn_cls = KNNClassifier(k=3, distance_metric='euclidean')
# knn_cls.fit(X_cls, y_cls)
# print("Classification Prediction:", knn_cls.predict([[2.5, 2.5]]))

# Regression
# X_reg = np.array([[1],[2],[3],[4],[5]])
# y_reg = np.array([2.1, 3.9, 6.2, 8.1, 10.3])
# knn_reg = KNNClassifier(k=2, distance_metric='manhattan')
# knn_reg.fit(X_reg, y_reg)
# print("Regression Prediction:", knn_reg.predict([[3.5]]))

Key points to mention

• Choice of `k` and its impact on bias-variance trade-off.
• Computational complexity of KNN, especially during prediction (O(N*D) for distance calculation, O(N log K) for sorting, where N is training samples, D is features).
• Handling ties in classification (e.g., random choice, weighted voting).
• Normalization/Standardization of features before applying distance metrics.
• The 'curse of dimensionality' and how it affects KNN performance.
• Memory requirements for storing the entire training dataset.

Common mistakes to avoid

✗ Forgetting to handle edge cases like `k` being larger than the number of training samples.
✗ Incorrectly implementing distance calculations (e.g., forgetting to take the square root for Euclidean).
✗ Not considering the impact of unscaled features on distance metrics.
✗ Inefficient sorting or neighbor selection, leading to poor performance.
✗ Failing to differentiate between classification (mode) and regression (mean) for predictions.

Back to all questions Practice with AI mock