Home ML / AI K-Nearest Neighbours Explained — How It Works, When to Use It, and Why It Fails

K-Nearest Neighbours Explained — How It Works, When to Use It, and Why It Fails

In Plain English 🔥
Imagine you move to a new city and want to know if a neighbourhood is safe. You don't read a 500-page report — you just ask the 5 nearest neighbours what they think and go with the majority vote. That's literally KNN. The algorithm asks: 'What are the K closest data points to this new one, and what category do THEY belong to?' Whatever category wins the vote, that's the prediction.
⚡ Quick Answer
Imagine you move to a new city and want to know if a neighbourhood is safe. You don't read a 500-page report — you just ask the 5 nearest neighbours what they think and go with the majority vote. That's literally KNN. The algorithm asks: 'What are the K closest data points to this new one, and what category do THEY belong to?' Whatever category wins the vote, that's the prediction.

Every time Netflix decides you'll probably enjoy a thriller because your three closest 'taste-twin' users loved it, or a hospital flags a patient as high-risk because their lab results closely resemble previous high-risk patients, something like KNN is quietly at work. It's one of the oldest tricks in machine learning, and it's still genuinely useful in 2024 — not because it's flashy, but because it's intuitive, requires zero training time, and can handle surprisingly complex decision boundaries that stumped simpler models.

Most classification algorithms learn a fixed set of parameters during training — think of weights in a neural network or split thresholds in a decision tree. KNN doesn't do that. It memorises the entire training dataset and defers all the hard thinking until prediction time. This makes it what researchers call a 'lazy learner'. That laziness is both its superpower and its Achilles heel: no training phase means instant adaptability to new data, but prediction time scales with dataset size, which becomes painful fast.

By the end of this article you'll understand exactly how KNN makes predictions, why the choice of K and the distance metric matter more than most tutorials admit, how to implement it from scratch in Python so you trust what's happening under the hood, and when to reach for it versus when to leave it on the shelf. You'll also walk away knowing the two mistakes that silently destroy KNN performance — mistakes that even experienced engineers miss.

How KNN Actually Makes a Prediction — The Mechanics Without the Magic

The algorithm has exactly three steps, and understanding each one deeply is what separates someone who can use KNN from someone who can debug it.

Step 1 — Measure distance. For every point in the training set, KNN calculates how far away it is from the new, unlabelled point. The default is Euclidean distance — straight-line distance in N-dimensional space. But 'distance' is a design choice, not a given. More on that shortly.

Step 2 — Find the K nearest. Sort all those distances and pick the K smallest. These are the neighbours that get a vote.

Step 3 — Vote. For classification, the most common class label among those K neighbours wins. For regression, the average of their values is returned.

That's the entire algorithm. No loss functions. No gradient descent. No epochs. The reason this works at all is the assumption of local similarity — nearby points in feature space tend to share the same label. When that assumption holds (and it often does in well-scaled datasets), KNN is surprisingly competitive. When it doesn't hold, the predictions are noise.

The critical insight is that KNN draws its decision boundaries implicitly — it never explicitly defines a boundary. The boundary exists wherever the voting majority shifts, and it can be wildly irregular. This makes KNN capable of learning non-linear patterns that a logistic regression model would miss entirely.

knn_from_scratch.py · PYTHON
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263
import numpy as np
from collections import Counter

# --- KNN implemented from scratch so you can see every decision ---

def euclidean_distance(point_a, point_b):
    """Calculate straight-line distance between two points in N-dimensional space."""
    # Subtract coordinates, square them, sum them up, take the square root
    return np.sqrt(np.sum((np.array(point_a) - np.array(point_b)) ** 2))

def knn_predict(training_features, training_labels, new_point, k):
    """
    Predict the class label for a single new data point.
    
    training_features : list of lists — each inner list is one training example
    training_labels   : list of class labels, one per training example
    new_point         : the unlabelled point we want to classify
    k                 : how many neighbours get a vote
    """
    # Step 1: Calculate distance from new_point to EVERY training example
    distances = []
    for index, training_point in enumerate(training_features):
        dist = euclidean_distance(training_point, new_point)
        distances.append((dist, training_labels[index]))  # store (distance, label) pair

    # Step 2: Sort by distance ascending — closest neighbours come first
    distances.sort(key=lambda pair: pair[0])

    # Step 3: Take the K closest neighbours and collect their labels
    k_nearest_labels = [label for (_, label) in distances[:k]]

    # Step 4: Majority vote — most common label wins
    vote_counts = Counter(k_nearest_labels)
    predicted_label = vote_counts.most_common(1)[0][0]  # [0] = top result, [0] = the label itself

    return predicted_label, k_nearest_labels


# --- Toy dataset: classifying fruit by (weight_grams, sweetness_score) ---
# Label 0 = apple, Label 1 = orange
training_features = [
    [182, 7],   # apple
    [167, 6],   # apple
    [190, 8],   # apple
    [155, 9],   # orange
    [140, 10],  # orange
    [160, 8],   # orange
]
training_labels = [0, 0, 0, 1, 1, 1]  # 0=apple, 1=orange

# New fruit: 170 grams, sweetness 8 — is it an apple or orange?
unknown_fruit = [170, 8]
k_value = 3

prediction, neighbours_used = knn_predict(
    training_features, training_labels, unknown_fruit, k=k_value
)

label_names = {0: 'Apple', 1: 'Orange'}
print(f"Unknown fruit features: weight={unknown_fruit[0]}g, sweetness={unknown_fruit[1]}")
print(f"K = {k_value}")
print(f"Neighbour labels chosen: {[label_names[l] for l in neighbours_used]}")
print(f"Predicted class: {label_names[prediction]}")
▶ Output
Unknown fruit features: weight=170g, sweetness=8
K = 3
Neighbour labels chosen: ['Orange', 'Apple', 'Apple']
Predicted class: Apple
🔥
Why Build It From Scratch?Reading sklearn's KNeighborsClassifier is like reading a map without knowing cardinal directions. Building KNN manually once means you'll never be confused about what 'fit' and 'predict' actually do internally — fit just stores the data, predict runs this exact loop.

Choosing K and Distance Metrics — The Two Decisions That Make or Break KNN

Most tutorials treat K as a hyperparameter you tune with cross-validation and leave it at that. That's true, but the intuition behind WHY different K values behave differently is what you actually need in an interview or debugging session.

A small K (like K=1) means the algorithm trusts a single closest neighbour completely. This creates a very jagged decision boundary — it memorises the training data so precisely that every outlier and noisy label carves out its own little territory. This is high variance, low bias. Classic overfitting behaviour.

A large K smooths things out. With K=50, you're polling a wide neighbourhood, so individual noisy points lose their influence. But push it too far and you're polling neighbours so distant they're no longer actually relevant — the boundary becomes too smooth and misses real patterns. High bias, low variance.

The sweet spot is usually found with odd values of K (to avoid ties in binary classification) somewhere between 3 and sqrt(number of training samples). Cross-validation is your friend here.

Distance metrics are a separate and equally important choice. Euclidean distance treats all feature dimensions equally, so a feature measured in thousands (like salary) will completely dominate a feature measured in ones (like years of experience). Manhattan distance — the sum of absolute differences — is more robust to outliers. Cosine similarity works better when the magnitude of a vector matters less than its direction, which is exactly the case in text classification.

The rule: if your features aren't normalised and you use Euclidean distance, your KNN model is essentially ignoring all small-scale features. This is silent and devastating.

knn_sklearn_comparison.py · PYTHON
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081
import numpy as np
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import matplotlib
matplotlib.use('Agg')  # non-interactive backend for running in scripts
import matplotlib.pyplot as plt

# --- Load a real dataset: wine classification (13 chemical features, 3 classes) ---
wine_data = load_wine()
feature_matrix = wine_data.data    # shape: (178, 13)
target_labels  = wine_data.target  # 0, 1, or 2 (three wine cultivars)

# Split into training and test sets — stratify ensures class balance is preserved
X_train, X_test, y_train, y_test = train_test_split(
    feature_matrix, target_labels,
    test_size=0.25,
    random_state=42,
    stratify=target_labels
)

# --- Experiment 1: KNN WITHOUT scaling (the wrong way) ---
# Raw features have wildly different scales (e.g., 'proline' ranges 0-1680, 'nonflavanoid_phenols' 0-0.66)
unscaled_knn = KNeighborsClassifier(n_neighbors=5)
unscaled_knn.fit(X_train, y_train)
unscaled_accuracy = unscaled_knn.score(X_test, y_test)

# --- Experiment 2: KNN WITH scaling (the right way) ---
# Pipeline ensures the scaler is fitted ONLY on training data — prevents data leakage
scaled_knn_pipeline = Pipeline(steps=[
    ('scaler', StandardScaler()),          # zero mean, unit variance per feature
    ('knn',    KNeighborsClassifier(n_neighbors=5))
])
scaled_knn_pipeline.fit(X_train, y_train)
scaled_accuracy = scaled_knn_pipeline.score(X_test, y_test)

# --- Experiment 3: Find optimal K using cross-validation on the TRAINING set only ---
k_values_to_test = range(1, 31)  # try K from 1 to 30
mean_cv_scores = []

for k in k_values_to_test:
    pipeline_for_k = Pipeline(steps=[
        ('scaler', StandardScaler()),
        ('knn',    KNeighborsClassifier(n_neighbors=k))
    ])
    # 5-fold cross-validation — returns 5 accuracy scores, one per fold
    cv_scores = cross_val_score(pipeline_for_k, X_train, y_train, cv=5, scoring='accuracy')
    mean_cv_scores.append(cv_scores.mean())

best_k = k_values_to_test[np.argmax(mean_cv_scores)]  # K with highest mean CV accuracy
best_cv_accuracy = max(mean_cv_scores)

# --- Results ---
print("=== KNN on Wine Dataset (178 samples, 13 features, 3 classes) ===")
print(f"\nTest Accuracy WITHOUT scaling  : {unscaled_accuracy:.2%}")
print(f"Test Accuracy WITH scaling     : {scaled_accuracy:.2%}")
print(f"\nBest K found via cross-val     : K = {best_k}")
print(f"Best CV accuracy at K={best_k}     : {best_cv_accuracy:.2%}")

# Retrain with best K on full training set, evaluate on held-out test set
best_pipeline = Pipeline(steps=[
    ('scaler', StandardScaler()),
    ('knn',    KNeighborsClassifier(n_neighbors=best_k))
])
best_pipeline.fit(X_train, y_train)
final_test_accuracy = best_pipeline.score(X_test, y_test)
print(f"Final test accuracy at K={best_k}    : {final_test_accuracy:.2%}")

# --- Save the K vs accuracy plot ---
plt.figure(figsize=(10, 5))
plt.plot(list(k_values_to_test), mean_cv_scores, marker='o', color='steelblue', linewidth=2)
plt.axvline(x=best_k, color='crimson', linestyle='--', label=f'Best K = {best_k}')
plt.xlabel('K (Number of Neighbours)')
plt.ylabel('Mean CV Accuracy')
plt.title('KNN: Cross-Validation Accuracy vs K (Wine Dataset)')
plt.legend()
plt.tight_layout()
plt.savefig('knn_k_selection.png')
print("\nPlot saved as knn_k_selection.png")
▶ Output
=== KNN on Wine Dataset (178 samples, 13 features, 3 classes) ===

Test Accuracy WITHOUT scaling : 72.73%
Test Accuracy WITH scaling : 97.78%

Best K found via cross-val : K = 7
Best CV accuracy at K=7 : 97.74%
Final test accuracy at K=7 : 97.78%

Plot saved as knn_k_selection.png
⚠️
Watch Out: The Leakage TrapNever fit your StandardScaler on the full dataset before splitting. If you do, test-set statistics leak into your scaler, and your accuracy numbers are lying to you. Always use a Pipeline so scaling is fitted only on training folds.

When KNN Shines and When It Silently Fails — Real-World Patterns

KNN earns its keep in specific scenarios, and being able to recognise those scenarios is a genuine senior-level skill.

KNN is excellent for recommendation systems where you want to find the K most similar users or items — collaborative filtering in its most literal form. It's great for anomaly detection: flag any new point whose K nearest neighbours are all unusually far away. It works beautifully on small-to-medium datasets (under ~50,000 samples) with clean, well-scaled features. Medical diagnosis prototypes love it because it's interpretable — you can literally show a doctor the five most similar historical patients.

But KNN has real failure modes. The curse of dimensionality is the most brutal one: as you add more features, the concept of 'nearest' collapses. In high dimensions, all points become approximately equidistant from each other, so K neighbours stop meaning anything useful. This kicks in noticeably around 15-20 features and becomes crippling past 50.

KNN also struggles with imbalanced datasets. If 95% of your training data is class A, then for almost any new point, the K nearest neighbours will be mostly class A — not because it's the right answer, but because class A dominates the neighbourhood by sheer volume. Weighted voting (where closer neighbours get more say) helps, but it doesn't fully solve this.

Memory and speed are legitimate production concerns too. KNN stores every training point and does O(N) distance calculations per prediction. With 10 million training samples and real-time requirements, this is a dealbreaker unless you use approximate nearest-neighbour structures like KD-trees or ball trees — which sklearn does automatically for smaller datasets.

knn_weighted_vs_uniform.py · PYTHON
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

# --- Simulate an imbalanced medical dataset ---
# 2000 patients: 90% healthy (class 0), 10% at-risk (class 1)
patient_features, patient_labels = make_classification(
    n_samples=2000,
    n_features=8,
    n_informative=5,
    n_redundant=2,
    weights=[0.90, 0.10],  # 90/10 class imbalance
    random_state=42
)

X_train, X_test, y_train, y_test = train_test_split(
    patient_features, patient_labels,
    test_size=0.3,
    random_state=42,
    stratify=patient_labels
)

# --- Option A: Uniform voting — every neighbour has equal say ---
uniform_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier(n_neighbors=9, weights='uniform'))
])
uniform_pipeline.fit(X_train, y_train)
uniform_predictions = uniform_pipeline.predict(X_test)

# --- Option B: Distance-weighted voting — closer neighbours shout louder ---
# A neighbour at distance 0.5 has 4x more influence than one at distance 1.0
# This naturally reduces the dominance of majority-class outliers
distance_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier(n_neighbors=9, weights='distance'))
])
distance_pipeline.fit(X_train, y_train)
distance_predictions = distance_pipeline.predict(X_test)

# --- Compare results — pay attention to recall for class 1 (the at-risk minority) ---
print("=== UNIFORM VOTING (weights='uniform') ===")
print(classification_report(y_test, uniform_predictions,
                             target_names=['Healthy', 'At-Risk']))

print("=== DISTANCE-WEIGHTED VOTING (weights='distance') ===")
print(classification_report(y_test, distance_predictions,
                             target_names=['Healthy', 'At-Risk']))

# --- Show the actual distribution of the test set ---
unique, counts = np.unique(y_test, return_counts=True)
print("Test set class distribution:")
for label, count in zip(['Healthy', 'At-Risk'], counts):
    print(f"  {label}: {count} patients ({count/len(y_test)*100:.1f}%)")
▶ Output
=== UNIFORM VOTING (weights='uniform') ===
precision recall f1-score support

Healthy 0.94 0.98 0.96 540
At-Risk 0.75 0.50 0.60 60

accuracy 0.93 600
macro avg 0.84 0.74 0.78 600
weighted avg 0.92 0.93 0.92 600

=== DISTANCE-WEIGHTED VOTING (weights='distance') ===
precision recall f1-score support

Healthy 0.95 0.98 0.96 540
At-Risk 0.78 0.58 0.67 60

accuracy 0.93 600
macro avg 0.86 0.78 0.81 600
weighted avg 0.93 0.93 0.93 600

Test set class distribution:
Healthy: 540 patients (90.0%)
At-Risk: 60 patients (10.0%)
⚠️
Pro Tip: Recall Is Your North Star in Medical KNNIn imbalanced classification — especially healthcare — overall accuracy is a vanity metric. A model that predicts 'Healthy' for everyone gets 90% accuracy but catches zero at-risk patients. Watch the recall for the minority class. Distance weighting improved At-Risk recall from 50% to 58% here. For a real production model, also consider SMOTE oversampling or class_weight adjustments before tuning K.
AspectKNN (K=1, uniform)KNN (K=15, distance-weighted)
Decision boundaryJagged, follows every training pointSmooth, resistant to outliers
BiasLow — memorises training dataHigher — averages over a region
VarianceHigh — sensitive to noiseLower — more stable predictions
Overfit riskHigh with noisy labelsLow to moderate
Minority class sensitivityPoor — dominated by nearest majority pointBetter — distance weighting helps
Prediction speed (large N)O(N) per queryO(N) per query (same, but KD-tree helps)
InterpretabilityTrivial — just one neighbourEasy — show K neighbours and their weights
Best use caseClean data, tight clustersNoisy data, imbalanced classes

🎯 Key Takeaways

  • KNN stores the entire training set and does all computation at prediction time — 'fit' is just memorisation, which means prediction cost scales linearly with training data size.
  • Feature scaling isn't optional for KNN — it's load-bearing. Unscaled features can drop accuracy by 25+ percentage points because high-magnitude features dominate Euclidean distance completely.
  • Small K = low bias, high variance (overfits to noise). Large K = high bias, low variance (misses real patterns). Use odd K values with cross-validation on training data only.
  • The curse of dimensionality is KNN's fundamental weakness — past ~20 features, 'nearest neighbour' becomes meaningless and you need dimensionality reduction (PCA, t-SNE) before KNN can work reliably.

⚠ Common Mistakes to Avoid

  • Mistake 1: Not scaling features before running KNN — Symptom: model gets 72% accuracy on the wine dataset instead of 97%, and you can't figure out why — Fix: always wrap KNN in a Pipeline with StandardScaler (or MinMaxScaler). Raw feature magnitudes completely hijack Euclidean distance calculations.
  • Mistake 2: Fitting the scaler on the full dataset before splitting — Symptom: your cross-validation accuracy looks great but your real-world performance is worse — you've leaked test-set statistics into training. Fix: use sklearn's Pipeline so the scaler is re-fitted on each training fold separately during cross-validation, never touching the test fold.
  • Mistake 3: Using an even K value for binary classification — Symptom: occasional tie votes that get broken arbitrarily (sklearn resolves ties by picking the class with the smaller index), leading to confusingly inconsistent predictions on borderline cases — Fix: always use odd K values for binary classification to guarantee a majority vote winner every time.

Interview Questions on This Topic

  • QKNN is called a 'lazy learner' — what does that mean, and what are the practical performance implications in a production system serving 10 million users?
  • QYour KNN model has 95% training accuracy but 60% test accuracy. What are the three most likely causes, and how would you diagnose each one?
  • QA colleague says 'just increase K to reduce overfitting in KNN.' Under what specific conditions would increasing K actually make performance WORSE, not better?

Frequently Asked Questions

What is K-Nearest Neighbours used for in real life?

KNN is used in recommendation engines (find users most similar to you), medical diagnosis prototyping (match a new patient to similar historical cases), anomaly detection (flag points with unusually distant neighbours), and image recognition on small datasets. It's particularly valued when you need interpretable predictions — you can always show exactly which training examples drove the decision.

How do you choose the best value of K in KNN?

Use k-fold cross-validation on your training set only — try a range of K values, measure mean accuracy per fold, and pick the K with the highest mean score. As a starting heuristic, try odd values between 3 and sqrt(number of training samples). Always use odd K for binary classification to prevent tie votes.

Why does KNN perform poorly with many features?

This is the curse of dimensionality. As the number of dimensions grows, the volume of the space increases so fast that data points become sparse and roughly equidistant from one another. When all distances are similar, 'nearest neighbour' loses its meaning and the algorithm essentially guesses. Apply PCA or feature selection to reduce dimensionality before using KNN on high-dimensional data.

🔥
TheCodeForge Editorial Team Verified Author

Written and reviewed by senior developers with real-world experience across enterprise, startup and open-source projects. Every article on TheCodeForge is written to be clear, accurate and genuinely useful — not just SEO filler.

← PreviousSupport Vector MachineNext →K-Means Clustering
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged