Home ML / AI Introduction to Scikit-Learn — Machine Learning in Python

Introduction to Scikit-Learn — Machine Learning in Python

⚡ Quick Answer
Scikit-Learn is like a Swiss Army knife for machine learning. Just as every tool in the knife follows the same basic shape so you can pick it up and use it without re-learning, every algorithm in scikit-learn follows the same interface: fit() to learn from data, predict() to make predictions, score() to evaluate. You swap algorithms in one line of code.

Scikit-Learn is the most widely used machine learning library in Python — and for good reason. It provides clean, consistent implementations of hundreds of algorithms, from linear regression to random forests, all behind the same simple interface.

Most machine learning tutorials start with theory and work their way to code. This article does the opposite: you'll train a real classifier in the first five minutes, then understand why each step works the way it does.

By the end you'll understand scikit-learn's core design philosophy, know how to evaluate a model properly, and have a working classification pipeline you can apply to any dataset.

The fit/predict Interface — Scikit-Learn's Killer Feature

Every estimator in scikit-learn implements the same two methods: fit(X, y) to train the model, and predict(X) to use it. This consistency means you can swap a LogisticRegression for a RandomForestClassifier in one line without changing anything else. This design decision is what makes scikit-learn so powerful for experimentation.

first_classifier.py · PYTHON
12345678910111213141516171819202122232425262728293031
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load the classic Iris dataset — 150 flowers, 4 features, 3 species
iris = load_iris()
X = iris.data    # Features: sepal length, sepal width, petal length, petal width
y = iris.target  # Labels: 0=setosa, 1=versicolor, 2=virginica

# Split: 80% for training, 20% for testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train a K-Nearest Neighbours classifier
classifier = KNeighborsClassifier(n_neighbors=3)
classifier.fit(X_train, y_train)  # Learn from training data

# Predict on unseen test data
predictions = classifier.predict(X_test)

# Evaluate
accuracy = accuracy_score(y_test, predictions)
print(f"Test accuracy: {accuracy:.2%}")

# Swap to a different algorithm — only ONE line changes
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
print(f"Random Forest accuracy: {accuracy_score(y_test, rf.predict(X_test)):.2%}")
▶ Output
Test accuracy: 100.00%
Random Forest accuracy: 100.00%
🔥
Why 100% Accuracy?The Iris dataset is very clean and well-separated. Real datasets won't be this easy. The key lesson here is the consistent fit/predict API — not the accuracy number.

Train/Test Split — Why You Must Never Evaluate on Training Data

Evaluating a model on the same data it trained on is like giving students an exam using the exact questions they studied. Of course they'll score 100%. The model has memorised the training data and tells you nothing about whether it can generalise. Always hold out a test set the model never sees during training.

overfitting_demo.py · PYTHON
1234567891011121314151617181920
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Unlimited depth tree — will memorise every training example
overfitted_tree = DecisionTreeClassifier(max_depth=None)
overfitted_tree.fit(X_train, y_train)

train_acc = accuracy_score(y_train, overfitted_tree.predict(X_train))
test_acc  = accuracy_score(y_test,  overfitted_tree.predict(X_test))

print(f"Training accuracy: {train_acc:.2%}")  # Perfect — it memorised
print(f"Test accuracy:     {test_acc:.2%}")   # Lower — it can't generalise
print(f"Overfitting gap:   {train_acc - test_acc:.2%}")
▶ Output
Training accuracy: 100.00%
Test accuracy: 96.67%
Overfitting gap: 3.33%
⚠️
Watch Out:Even a small gap between training and test accuracy signals overfitting. In real-world datasets with noise, this gap is often 10–30%. Always report test accuracy, never training accuracy.
Algorithm TypeScikit-Learn ClassBest For
Linear ClassificationLogisticRegressionLinearly separable data, interpretable results
Tree-basedRandomForestClassifierMixed feature types, robust to outliers
Nearest NeighboursKNeighborsClassifierSmall datasets, non-linear boundaries
Support VectorSVCHigh-dimensional data, clear margin problems
Gradient BoostingGradientBoostingClassifierTabular data, competitions

🎯 Key Takeaways

  • All scikit-learn estimators share the same fit()/predict() interface — swap algorithms in one line
  • Always split into train and test sets before any preprocessing
  • Fit preprocessors (scalers, encoders) on training data only, then transform test data
  • Accuracy is misleading for imbalanced datasets — use F1-score, precision, and recall

⚠ Common Mistakes to Avoid

  • Mistake 1: Fitting the scaler on the entire dataset before splitting — this leaks test data statistics into your preprocessing. Always fit the scaler on training data only, then transform both train and test.
  • Mistake 2: Using accuracy for imbalanced datasets — if 95% of samples are class 0, a model that always predicts 0 gets 95% accuracy. Use precision, recall, and F1-score for imbalanced problems.
  • Mistake 3: Not setting random_state — without a fixed seed, train_test_split gives different splits each run, making results unreproducible. Always set random_state=42 (or any fixed number).

Interview Questions on This Topic

  • QWhat is the difference between fit(), transform(), and fit_transform() in scikit-learn?
  • QWhy should you never fit a scaler on the test set?
  • QWhat metric would you use for a classification problem where only 1% of samples are positive?
🔥
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

← PreviousTensorFlow Lite for Mobile DeploymentNext →Scikit-Learn Pipeline Explained
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged