Junior 5 min · March 06, 2026

SVM — RBF Kernel Margin Collapse from Unscaled Features

Recall dropped 0.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • Support Vector Machines find the decision boundary that maximizes the margin between classes
  • Only 'support vectors' — the closest points to the boundary — define the hyperplane
  • Kernel trick maps data to higher dimensions without explicit transformation
  • Soft-margin parameter C controls how much misclassification is tolerated
  • Training scales O(n^2) to O(n^3) — not for big data without subsampling
  • Biggest mistake: using RBF without scaling features first — models converge to one-class predictions
Plain-English First

Imagine you have a table covered in red and blue marbles, and you need to draw a line that separates them. A Support Vector Machine doesn't just draw any line — it finds the line that keeps the most space between itself and the nearest marble on each side. Those nearest marbles are the 'support vectors' — the ones doing all the work. If you could pick up the table and tilt it (that's the kernel trick), marbles that were impossible to separate flat on the table suddenly become separable in 3D.

Support Vector Machines quietly power some of the most reliable classifiers in production today — from spam filters and medical image classifiers to anomaly detection in financial fraud systems. They're not the flashiest algorithm in the ML toolbox, but when your dataset is small-to-medium, high-dimensional, or you need a model that generalises well without mountains of data, SVMs consistently punch above their weight. Understanding them deeply separates engineers who can tune a model from engineers who can reason about why it's failing.

The core problem SVMs solve is deceptively simple: given labelled training data, find the decision boundary that maximises the gap between classes. But the real magic — and the real complexity — lives in how they do it. The kernel trick lets SVMs operate in infinite-dimensional feature spaces without ever computing coordinates in those spaces. The soft-margin formulation handles real-world noise without breaking. And the dual optimisation problem, solved by Sequential Minimal Optimisation, is what makes training on thousands of samples feasible.

By the end of this article you'll understand the primal and dual SVM formulations, know exactly when to reach for an RBF kernel versus a linear one, be able to debug common training failures (class imbalance, feature scale, C vs gamma interaction), and have production-ready Python code you can drop into a real pipeline. You'll also walk into any ML interview knowing the answers to the questions that trip most people up.

SVMs aren't dead — they're still the go-to for tabular data with fewer than 100k samples. Deep learning needs data; SVMs need support vectors. Know the difference.

What is Support Vector Machine?

Support Vector Machine is a core concept in ML / AI. Rather than starting with a dry definition, let's see it in action and understand why it exists. At its heart, an SVM is a binary linear classifier that finds the separating hyperplane with the maximum margin. But simple linear classification isn't what makes SVMs special. What sets them apart is the combination of three ideas: the max-margin principle, the kernel trick, and the dual optimisation that turns everything into dot products. These three pillars let SVMs handle non-linearity, high dimensions, and sparse solutions.

Here's a quick example: suppose we have 2D points with labels. A linear SVM finds the line that not only separates the classes but also maximises the distance to the nearest points. Those nearest points are the support vectors — they hold up the decision boundary. If you remove any other point, the line stays exactly the same. This sparsity is why SVMs generalise well and predict fast.

svm_basic_demo.pyPYTHON
1
2
3
4
5
6
7
8
9
10
from sklearn.svm import SVC
import numpy as np

X = np.array([[1,2], [2,3], [3,1], [6,5], [7,7], [8,6]])
y = np.array([0,0,0,1,1,1])

model = SVC(kernel='linear', C=1.0)
model.fit(X, y)
print('Support vectors:', model.support_vectors_)
print('Number of SVs:', len(model.support_vectors_))
Output
Support vectors: [[2. 3.]
[3. 1.]
[6. 5.]]
Number of SVs: 3
Forge Tip:
Type this code yourself rather than copy-pasting. The muscle memory of writing it will help it stick.
Production Insight
Hard margin SVMs (C → ∞) fail on real data because they overfit to outliers.
Always use a soft margin (C between 0.1 and 100) and tune via cross-validation.
The support vector count tells you about your data complexity: many SVs = hard problem.
Monitor support vector count growth during retraining — sudden spikes indicate data distribution shift.
Key Takeaway
The margin is the key to SVM's generalisation.
Support vectors alone define the model — everything else is ignored.
Always use a soft margin (C finite) in production.
Should You Use SVM or Logistic Regression?
IfData is linearly separable and n_samples < 10k
UseBoth work. SVM may generalise slightly better due to margin maximisation.
IfData is not linearly separable and n_samples < 50k
UseUse SVM with RBF kernel. Logistic regression will underfit.
Ifn_samples > 100k
UseUse logistic regression or linear SVM (LinearSVC). Non-linear SVM is too slow.
IfYou need calibrated probabilities
UseLogistic regression gives natural probabilities. SVM requires Platt scaling (extra cost).

The Max-Margin Intuition Behind SVMs

An SVM selects the hyperplane that maximizes the geometric margin to the nearest training points of any class. Imagine drawing a line between two clusters — the line that gives the widest gutter on both sides is the SVM's choice. Why does this matter? Because a larger margin means lower VC dimension, which generalises better on unseen data.

The support vectors are the data points that lie exactly on the margin boundary. They're the only points that influence the decision boundary — moving any other point (as long as it stays on its side of the margin) changes nothing. This sparsity is what makes SVMs efficient at inference time.

But the margin isn't just a pretty picture — it has a direct impact on how your model behaves in production. If your data has outliers (and it always does), a hard margin will contort itself to fit those outliers, making the margin razor-thin. That's why we soften the margin with parameter C: allow some misclassifications in exchange for a wider, more robust boundary.

svm_margin_demo.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from sklearn.svm import SVC
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt

X, y = make_blobs(n_samples=50, centers=2, random_state=42)
model = SVC(kernel='linear', C=1e5)  # large C = hard margin
model.fit(X, y)

# Plot decision boundary and support vectors
plt.scatter(X[:,0], X[:,1], c=y, cmap='bwr', edgecolors='k')
ax = plt.gca()
xlim = ax.get_xlim()
ylim = ax.get_ylim()

# Create mesh
import numpy as np
xx, yy = np.meshgrid(np.linspace(xlim[0], xlim[1], 50),
                     np.linspace(ylim[0], ylim[1], 50))
Z = model.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.contour(xx, yy, Z, levels=[0], colors='k')
plt.scatter(model.support_vectors_[:,0], model.support_vectors_[:,1],
            s=100, facecolors='none', edgecolors='k', label='support vectors')
plt.legend()
plt.show()
Think of it like a tightrope walker
  • Support vectors are the rope anchors — they define the only stable path.
  • A wider margin means you can wobble and still stay on the rope.
  • Hard margin (C very large) means the walker never leaves the beam — not realistic in data.
  • Soft margin (reasonable C) lets the walker step off a little for noisy data.
Production Insight
Hard margin SVMs (C → ∞) fail on real data because they overfit to outliers.
Always use a soft margin (C between 0.1 and 100) and tune via cross-validation.
The support vector count tells you about your data complexity: many SVs = hard problem.
Hard margin is only safe for perfectly separable toy data — never use it in production.
Key Takeaway
The margin is the key to SVM's generalisation.
Support vectors alone define the model — everything else is ignored.
Always use a soft margin (C finite) in production.
Which Kernel to Try First?
IfData is linearly separable or has many features (n_features > n_samples)
UseUse linear kernel. Fast to train, interpretable coefficients.
IfData is not linearly separable and n_samples < 50k
UseTry RBF kernel. One hyperparameter gamma to tune.
IfData has structure similar to polynomial (e.g., circles, parabolas)
UseTry polynomial kernel, degree 2 or 3. More hyperparameters to tune.
Ifn_samples > 100k
UseAvoid non-linear SVM. Use linear SVM with SGD or a different classifier.
IfData has known similarity structure (e.g., text using cosine similarity)
UseTry linear kernel after normalisation, or custom kernel using your own similarity function.

The Kernel Trick: Magic Without the Cost

The kernel trick lets you compute dot products in a high-dimensional feature space without ever visiting it. Instead of explicitly mapping data to that space, you use a kernel function that computes the same dot product cheaply. The RBF kernel, for instance, is equivalent to an infinite-dimensional polynomial expansion — but you compute it in O(n_features) time.

This is what makes SVMs powerful: you can learn non-linear decision boundaries with the computational cost of a linear model. But there's a catch — the kernel trick only works if you can express the optimisation in terms of dot products, which is why SVMs use the dual formulation.

Not all kernels are created equal. Linear is fastest, RBF is most flexible, polynomial is rarely used because it's numerically unstable and has more parameters to tune. There's also the sigmoid kernel (not recommended — doesn't satisfy Mercer's condition in many cases) and custom kernels (you can define your own, but must be positive semi-definite).

kernel_trick_demo.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import numpy as np
from sklearn.svm import SVC

# 1D data not separable
X = np.linspace(-3, 3, 100).reshape(-1,1)
y = np.where(X.ravel()**2 > 1, 1, 0)

# Map to 2D: (x, x^2) — explicit mapping
phi = np.hstack([X, X**2])
# Train linear SVM on mapped data
model = SVC(kernel='linear', C=1.0)
model.fit(phi, y)

# Equivalent to RBF kernel on 1D data
rbf_model = SVC(kernel='rbf', gamma=1.0)
rbf_model.fit(X, y)

# Both produce same shape of decision boundary
Common Kernel Functions
Linear: K(x,z) = x·z. Polynomial: K(x,z) = (γ x·z + r)^d. RBF: exp(-γ||x-z||²). Sigmoid: tanh(γ x·z + r). RBF is the default for a reason: it works well when you tune γ.
Production Insight
RBF kernel's gamma controls the influence radius — too large and you overfit, too small and you underfit.
A good starting gamma is 1/(n_features * X.var()).
Polynomial kernels often explode in value — scale gamma and r carefully to avoid numerical instability.
Custom kernels must be positive semi-definite or the solver may fail silently.
Key Takeaway
The kernel trick makes non-linear SVMs efficient.
RBF is the safe default — but tune gamma.
Explicit feature mapping is rarely needed when you have a good kernel.

Primal vs Dual Formulation — And Why You Need SMO

The classic SVM objective is a convex optimisation problem: minimize ||w||² subject to constraints that all points lie on the correct side of the margin. That's the primal problem. But the dual problem is where the kernel trick lives — it replaces w·x with Σ α_i y_i K(x_i, x). The α_i are zero for all non-support vectors, making inference sparse.

Sequential Minimal Optimisation (SMO) solves the dual problem by repeatedly picking two α's and optimising them analytically. It's the algorithm behind libsvm and scikit-learn's SVC. SMO converges in O(n²) to O(n³) steps — for large datasets, you must use alternative solvers.

Understanding the difference is essential: the primal solution gives you the weight vector w directly. The dual solution gives you the α coefficients and works implicitly with the kernel. In practice, you'll almost always use the dual for non-linear kernels. But if you need fast predictions and your kernel is linear, solve the primal — it's what LinearSVC does.

primal_vs_dual.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from sklearn.svm import LinearSVC, SVC
from sklearn.datasets import make_classification
import time

X, y = make_classification(n_samples=20000, n_features=20, random_state=42)

# Primal linear SVM (LinearSVC uses liblinear — O(n))
start = time.time()
primal = LinearSVC(dual=False, max_iter=10000, tol=1e-4)
primal.fit(X, y)
print(f"Primal (liblinear): {time.time()-start:.2f}s")

# Dual RBF SVM (libsvm SMO — O(n^2) to O(n^3))
start = time.time()
dual = SVC(kernel='rbf', gamma='scale', C=1.0)
dual.fit(X, y)
print(f"Dual (libsvm SMO): {time.time()-start:.2f}s")
Output
Primal (liblinear): 0.23s
Dual (libsvm SMO): 14.67s
Watch out for large n_samples
Non-linear SVMs with SMO become impractical above ~100k samples. Use LinearSVC, SGDClassifier, or approximate kernel methods like Nystroem + LinearSVC.
Production Insight
SMO's quadratic scaling hits hard in production — test with a 10% sample first.
If dual training exceeds 30 minutes on a modern CPU, consider switching to linear SVM or random forest.
The number of support vectors in the dual solution is a direct proxy for model complexity — monitor it in retraining pipelines.
Always set cache_size in SVC to speed up kernel computations — start with 500 MB.
Key Takeaway
Dual formulation enables the kernel trick.
SMO is the workhorse but scales poorly — know when to switch to primal solvers.
Support vector count ≈ model complexity — track it.

Hyperparameter Tuning: C and Gamma Are Not Independent

C controls the penalty for misclassification (small C = softer margin, may underfit; large C = harder margin, may overfit). Gamma controls the influence of a single training example (small gamma = far-reaching, smooth boundary; large gamma = local, wiggly boundary). These two interact: a high gamma with high C will almost certainly overfit, while low gamma with low C underfits.

Grid search on both is essential. Use logarithmic spacing: C ∈ [0.01, 100], gamma ∈ [0.001, 1000]. Also consider class_weight='balanced' when classes are imbalanced — it adjusts C per class.

Don't forget that the optimal C and gamma depend on your feature scale. That's why you must scale before tuning. If you change features, retune. A common mistake is to tune on unscaled data, scale later, and wonder why the performance is different.

svm_grid_search.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

param_grid = {
    'svc__C': [0.01, 0.1, 1, 10, 100],
    'svc__gamma': [0.001, 0.01, 0.1, 1, 10]
}

pipe = make_pipeline(StandardScaler(), SVC(kernel='rbf'))

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
grid = GridSearchCV(pipe, param_grid, cv=cv, scoring='f1_macro', n_jobs=-1)
grid.fit(X_train, y_train)

print(f"Best params: {grid.best_params_}")
print(f"Best CV F1: {grid.best_score_:.3f}")
Faster search with Halving Grid
HalvingGridSearchCV trains on a subset first, then doubles the sample size for promising candidates. Can be 5-10x faster than full grid search with minimal accuracy loss.
Production Insight
Not scaling features before tuning C/gamma wastes compute — you'll get different optimal values on scaled vs unscaled data.
Class imbalance requires higher C for the minority class — use class_weight='balanced'.
C and gamma often follow a trade-off curve: if grid search finds a point on the boundary, expand the grid in that direction.
Use RandomizedSearchCV for high-dimensional parameter spaces — it finds good regions faster than full grid.
Key Takeaway
C controls margin softness, gamma controls influence radius.
Always tune both together via grid search (log scale).
Scale features first, then tune — order matters.
Setting Initial C and Gamma Search Ranges
IfData is high-dimensional (n_features > 100) and scaled
UseStart gamma around 0.01, C around 1. Narrower gamma range: [0.001, 1].
IfData has few features (n_features < 10) and scaled
UseStart gamma around 1, C around 1. Wider gamma range: [0.1, 10].
IfClass imbalance is present
UseUse class_weight='balanced' and increase C upper bound to 1000 for minority class.

SVM in Production: The Real Pipeline

A production SVM pipeline rarely ends at the classifier. You need feature scaling (StandardScaler), handling missing values, class weights, and a decision threshold calibration. SVMs output decision function values (signed distance from the hyperplane) — these are not probabilities. For probability calibration, use Platt scaling (probability=True in SVC), but it adds a cross-validation step and slows training.

Also, SVM inference is O(n_support_vectors), so if the support vector count is large, inference latency can be high. For low-latency applications, consider LinearSVC or approximate the kernel.

Beyond modeling, production pipelines need monitoring: watch the distribution of decision function values over time. Drift in those distributions often precedes a drop in accuracy. You also need a retraining strategy — SVMs don't support online learning natively, so you'll need to schedule retraining or use incremental SVM implementations (not in scikit-learn).

svm_production_pipeline.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import pandas as pd

df = pd.read_csv('fraud_data.csv')
X = df.drop('is_fraud', axis=1)
y = df['is_fraud']

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', SVC(kernel='rbf', C=10, gamma='scale',
                class_weight='balanced',
                probability=True,  # Platt scaling
                random_state=42))
])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)
pipeline.fit(X_train, y_train)

y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred))
Why SVMs still beat neural nets on small tabular data
  • SVM's objective is convex — no local minima issues.
  • Support vector sparsity means fast inference for low-SV-count models.
  • Kernel trick is more data-efficient than learning deep representations.
  • Neural nets need more data to learn feature interactions — SVMs encode them via kernel.
Production Insight
Platt scaling (probability=True) is expensive — it adds a 5-fold CV inside the training loop. For large datasets, use probability=False and calibrate with a separate CalibratedClassifierCV.
Monitoring decision function distributions in production can catch feature drift before accuracy drops.
Retraining frequency matters: SVMs are 'lazy' in the sense that they only remember support vectors — but if new data changes the margin significantly, retraining is non-trivial (incremental SVM exists but not in scikit-learn).
Use joblib to serialise the full pipeline for versioned model deployment.
Key Takeaway
Always scale features in the pipeline.
Use class_weight='balanced' for imbalanced data.
Probabilities are extra work — calibrate only when needed.
SVMs are not incremental — retrain from scratch on new data.
Should You Enable probability=True?
IfYou need calibrated probabilities and n_samples < 10k
UseEnable probability=True. The 5-fold CV overhead is acceptable.
IfYou need probabilities and n_samples > 10k
UseSkip probability=True. Use CalibratedClassifierCV on a held-out validation set instead.
IfYou only need class labels (not probabilities)
UseSet probability=False. Use decision_function values for threshold tuning.
● Production incidentPOST-MORTEMseverity: high

When the RBF Kernel Predicted Everything as Class 0

Symptom
The model's recall dropped from 0.87 to 0.0 after retraining with new features. All predictions were the majority class.
Assumption
The team assumed the new features were irrelevant. They did not check feature scaling because 'the pipeline already scales'. Turns out, the scaler was fit on old features only.
Root cause
Two new features had magnitudes 100x higher than existing ones. Without retraining the scaler, RBF kernel computed distances dominated by these features. The gamma value (default 1/n_features) was too small to separate classes — every point looked equidistant.
Fix
Re-fit the StandardScaler on the full feature set. Then perform a grid search over C and gamma on log scale. Set gamma = 1 / (n_features * X.var()) as starting point. Validated with stratified cross-validation.
Key lesson
  • Always re-fit feature scalers when adding new features — even if the pipeline code exists.
  • RBF kernels are sensitive to feature scale: check that all features have roughly unit variance.
  • Plot decision function values to spot when margins collapse — zero variance means one-class output.
  • Grid search on C and gamma is not optional for RBF — default parameters rarely work in production.
  • Use stratified K-fold to keep class distribution in each fold — imbalanced folds mislead CV scores.
  • Monitor decision function distribution in production — sudden collapse to near-zero variance signals margin failure.
Production debug guideSymptom → root cause → fix in 5 minutes or less6 entries
Symptom · 01
Training never converges; loss oscillates or plateaus high
Fix
Check if data is scaled. SVM assumes zero mean unit variance. Use StandardScaler. Also increase max_iter or adjust tol.
Symptom · 02
All predictions are the majority class after training
Fix
Feature magnitude imbalance. Scale features, then tune C and gamma. Also check class weights — use 'balanced' if minority class is small.
Symptom · 03
Training takes hours on 10k samples
Fix
SVM is O(n^2) with typical SMO. Use LinearSVC (liblinear) for linear kernel. For non-linear, reduce dataset size via stratified subsampling or switch to SGDClassifier with hinge loss.
Symptom · 04
Test accuracy is high but validation accuracy is low
Fix
Overfitting. Increase C (softer margin) or decrease gamma for RBF. Also add more training data or regularise.
Symptom · 05
Probability outputs are poorly calibrated
Fix
Use CalibratedClassifierCV on the decision function instead of setting probability=True inside SVC. Avoid Platt scaling for very large datasets.
Symptom · 06
Cross-validation scores vary dramatically between folds
Fix
Check for class imbalance in folds. Use StratifiedKFold with shuffle=True. Also examine feature distributions per fold for drift.
★ SVM Quick Debug Cheat SheetThree commands to diagnose when your SVM model fails in production.
Model predicts only one class
Immediate action
Check decision_function() values — are they all the same sign?
Commands
python -c "import numpy as np; model = ...; dec = model.decision_function(X); print(dec.min(), dec.max(), dec.mean())"
python -c "from sklearn.preprocessing import StandardScaler; scaler = StandardScaler(); X_scaled = scaler.fit_transform(X); print(X_scaled.mean(axis=0), X_scaled.std(axis=0))"
Fix now
Re-scale features and perform grid search on C and gamma using halving grid search to reduce computation.
Training does not converge+
Immediate action
Check if max_iter is reached and if dual problem is feasible.
Commands
python -c "print(model.n_iter_, model.fit_status_)" # 0 = converged, 1 = not
Increase tolerance: python -c "model.set_params(tol=1e-3, max_iter=10000)"
Fix now
Scale features, then increase max_iter to 10000 and set tol to 1e-3. If still fails, check for duplicate or constant features.
Model is too slow to train on 20k samples+
Immediate action
Switch to LinearSVC or use SGDClassifier with hinge loss.
Commands
python -c "from sklearn.svm import LinearSVC; model = LinearSVC(dual=False, max_iter=10000); model.fit(X, y)"
python -c "from sklearn.linear_model import SGDClassifier; model = SGDClassifier(loss='hinge', max_iter=1000, tol=1e-3); model.fit(X, y)"
Fix now
Reduce dataset size via stratified subsampling (e.g., 10k samples) before training the full SVM. Use RBF on the full set only if essential.
Cross-validation scores differ significantly from test performance+
Immediate action
Check for data leakage — did the scaler fit on the entire dataset?
Commands
python -c "from sklearn.model_selection import cross_val_score; print('CV scores:', cross_val_score(model, X, y, cv=5))"
python -c "print('Train acc:', model.score(X_train, y_train), 'Test acc:', model.score(X_test, y_test))"
Fix now
If train >> test, reduce complexity (increase C or lower gamma) or use more data. Re-run with proper train/test split — fit scaler only on training set.
Kernel Comparison
KernelHyperparametersTraining SpeedWhen to Use
LinearCFast (O(n))Many features or linear separability
RBFC, gammaMedium (O(n^2) to O(n^3))General purpose, small-to-medium datasets
PolynomialC, gamma, degree, coef0Slow (O(n^3)), high degreeSpecific data shapes (e.g., circles)
SigmoidC, gamma, coef0Medium (O(n^2))Not recommended — may violate Mercer's condition

Key takeaways

1
SVMs maximise the margin
support vectors alone define the decision boundary.
2
The kernel trick enables non-linear separation without explicit feature mapping.
3
Always scale features before training
RBF kernel is especially sensitive.
4
Tune C and gamma together on a log-scale grid
defaults are never optimal.
5
For large datasets, drop RBF and use linear SVM or approximate kernels.
6
Monitor support vector count and decision function distribution in production.
7
Probabilities via Platt scaling are expensive
calibrate separately if needed.

Common mistakes to avoid

6 patterns
×

Forgetting to scale features before fitting SVM

Symptom
Model predicts only one class or training loss never drops. With RBF kernel, unscaled features cause distance domination by large-magnitude features.
Fix
Always apply StandardScaler (or MinMaxScaler) as the first step in a pipeline. Fit on training set only, then transform validation/test.
×

Using default C and gamma without tuning

Symptom
The model underfits (low training and test accuracy) or overfits (high train, low test). Defaults are rarely optimal for real data.
Fix
Perform grid search with logarithmic spacing: C ∈ [0.01, 100], gamma ∈ [0.001, 1000]. Use StratifiedKFold to handle class imbalance.
×

Applying SVM directly to large datasets (n > 100k)

Symptom
Training takes hours or never finishes. SMO is O(n^2) to O(n^3) — not designed for big data.
Fix
Use LinearSVC (liblinear) which scales linearly, or approximate the kernel with Nystroem + LinearSVC. For non-linear, subsample the data first.
×

Ignoring class imbalance

Symptom
High accuracy but low recall on minority class. Default C treats all classes equally — minority points become outliers.
Fix
Set class_weight='balanced' or manually compute class weights. Alternatively, use oversampling (SMOTE) before training.
×

Using probability=True on very large datasets

Symptom
Training time doubles or triples because Platt scaling adds 5-fold CV inside SVC.
Fix
Set probability=False and use CalibratedClassifierCV on the decision function separately. Or skip probabilities and use decision_function values with threshold tuning.
×

Not handling missing values before training

Symptom
SVM raises ValueError because NaN values are present. SVM cannot handle missing data natively.
Fix
Impute missing values before scaling. Use SimpleImputer (mean, median) or IterativeImputer in a pipeline. Always impute after splitting to avoid data leakage.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
Explain the difference between primal and dual formulations of SVM. Why ...
Q02SENIOR
How does the kernel trick work, and why is the RBF kernel a default choi...
Q03JUNIOR
What is the role of the C hyperparameter in SVM?
Q04SENIOR
How would you handle class imbalance when training an SVM?
Q05SENIOR
Why does SVM training become slow for large datasets, and what are the a...
Q06JUNIOR
What is the difference between hard-margin and soft-margin SVM? When wou...
Q01 of 06SENIOR

Explain the difference between primal and dual formulations of SVM. Why does the dual matter?

ANSWER
The primal problem minimises ||w||² subject to constraints. The dual replaces w with a sum over α_i y_i x_i, turning the optimisation into a function of dot products. This dual formulation allows the kernel trick: any dot product can be replaced by a kernel function, enabling non-linear boundaries without explicit feature maps. The dual also reveals that only support vectors (α > 0) define the model, making inference sparse.
FAQ · 6 QUESTIONS

Frequently Asked Questions

01
What is Support Vector Machine in simple terms?
02
When should I use SVM vs logistic regression?
03
How do I choose between linear and RBF kernel?
04
Why does my SVM always predict the same class?
05
Can SVM handle missing values?
06
What is the difference between SVC and LinearSVC in scikit-learn?
🔥

That's Algorithms. Mark it forged?

5 min read · try the examples if you haven't

Previous
Random Forest Algorithm Explained
5 / 14 · Algorithms
Next
K-Nearest Neighbours