Introduction to Scikit-Learn — Machine Learning in Python
Scikit-Learn is the most widely used machine learning library in Python — and for good reason. It provides clean, consistent implementations of hundreds of algorithms, from linear regression to random forests, all behind the same simple interface.
Most machine learning tutorials start with theory and work their way to code. This article does the opposite: you'll train a real classifier in the first five minutes, then understand why each step works the way it does.
By the end you'll understand scikit-learn's core design philosophy, know how to evaluate a model properly, and have a working classification pipeline you can apply to any dataset.
The fit/predict Interface — Scikit-Learn's Killer Feature
Every estimator in scikit-learn implements the same two methods: fit(X, y) to train the model, and predict(X) to use it. This consistency means you can swap a LogisticRegression for a RandomForestClassifier in one line without changing anything else. This design decision is what makes scikit-learn so powerful for experimentation.
from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier from sklearn.metrics import accuracy_score # Load the classic Iris dataset — 150 flowers, 4 features, 3 species iris = load_iris() X = iris.data # Features: sepal length, sepal width, petal length, petal width y = iris.target # Labels: 0=setosa, 1=versicolor, 2=virginica # Split: 80% for training, 20% for testing X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 ) # Train a K-Nearest Neighbours classifier classifier = KNeighborsClassifier(n_neighbors=3) classifier.fit(X_train, y_train) # Learn from training data # Predict on unseen test data predictions = classifier.predict(X_test) # Evaluate accuracy = accuracy_score(y_test, predictions) print(f"Test accuracy: {accuracy:.2%}") # Swap to a different algorithm — only ONE line changes from sklearn.ensemble import RandomForestClassifier rf = RandomForestClassifier(n_estimators=100, random_state=42) rf.fit(X_train, y_train) print(f"Random Forest accuracy: {accuracy_score(y_test, rf.predict(X_test)):.2%}")
Random Forest accuracy: 100.00%
Train/Test Split — Why You Must Never Evaluate on Training Data
Evaluating a model on the same data it trained on is like giving students an exam using the exact questions they studied. Of course they'll score 100%. The model has memorised the training data and tells you nothing about whether it can generalise. Always hold out a test set the model never sees during training.
from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import accuracy_score iris = load_iris() X, y = iris.data, iris.target X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Unlimited depth tree — will memorise every training example overfitted_tree = DecisionTreeClassifier(max_depth=None) overfitted_tree.fit(X_train, y_train) train_acc = accuracy_score(y_train, overfitted_tree.predict(X_train)) test_acc = accuracy_score(y_test, overfitted_tree.predict(X_test)) print(f"Training accuracy: {train_acc:.2%}") # Perfect — it memorised print(f"Test accuracy: {test_acc:.2%}") # Lower — it can't generalise print(f"Overfitting gap: {train_acc - test_acc:.2%}")
Test accuracy: 96.67%
Overfitting gap: 3.33%
| Algorithm Type | Scikit-Learn Class | Best For |
|---|---|---|
| Linear Classification | LogisticRegression | Linearly separable data, interpretable results |
| Tree-based | RandomForestClassifier | Mixed feature types, robust to outliers |
| Nearest Neighbours | KNeighborsClassifier | Small datasets, non-linear boundaries |
| Support Vector | SVC | High-dimensional data, clear margin problems |
| Gradient Boosting | GradientBoostingClassifier | Tabular data, competitions |
🎯 Key Takeaways
- All scikit-learn estimators share the same fit()/predict() interface — swap algorithms in one line
- Always split into train and test sets before any preprocessing
- Fit preprocessors (scalers, encoders) on training data only, then transform test data
- Accuracy is misleading for imbalanced datasets — use F1-score, precision, and recall
⚠ Common Mistakes to Avoid
- ✕Mistake 1: Fitting the scaler on the entire dataset before splitting — this leaks test data statistics into your preprocessing. Always fit the scaler on training data only, then transform both train and test.
- ✕Mistake 2: Using accuracy for imbalanced datasets — if 95% of samples are class 0, a model that always predicts 0 gets 95% accuracy. Use precision, recall, and F1-score for imbalanced problems.
- ✕Mistake 3: Not setting random_state — without a fixed seed, train_test_split gives different splits each run, making results unreproducible. Always set random_state=42 (or any fixed number).
Interview Questions on This Topic
- QWhat is the difference between fit(), transform(), and fit_transform() in scikit-learn?
- QWhy should you never fit a scaler on the test set?
- QWhat metric would you use for a classification problem where only 1% of samples are positive?
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.