Skip to content
Home ML / AI Machine Learning Roadmap 2026 – From Complete Beginner to Job-Ready

Machine Learning Roadmap 2026 – From Complete Beginner to Job-Ready

Where developers are forged. · Structured learning · Free forever.
📍 Part of: ML Basics → Topic 15 of 25
Step-by-step 6-month machine learning roadmap for developers.
🧑‍💻 Beginner-friendly — no prior ML / AI experience needed
In this tutorial, you'll learn
Step-by-step 6-month machine learning roadmap for developers.
  • Follow a structured 6-month roadmap — course-hopping without projects wastes months and produces fragile knowledge
  • Master 4 core algorithms deeply rather than surveying 20 algorithms superficially
  • Deploy portfolio projects — one deployed API with documentation beats ten completed courses on a resume
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
Quick Answer
  • This roadmap takes you from zero ML knowledge to job-ready in approximately 6 months of consistent study
  • Month 1-2: Python, math foundations, and data manipulation with pandas/numpy
  • Month 3-4: Core ML algorithms — supervised, unsupervised, and model evaluation
  • Month 5-6: Deep learning, MLOps, portfolio projects, and interview preparation
  • Performance insight: 2 hours daily for 6 months equals 360 hours — sufficient for junior ML roles
  • Production insight: hiring managers value deployed projects over certificates — build and ship real models
🚨 START HERE
Learning Environment Setup Cheat Sheet
Immediate setup commands for a 2026-ready ML development environment
🟡Need to set up Python ML environment from scratch
Immediate ActionInstall Python 3.11+, create a virtual environment, and install core libraries
Commands
python3 -m venv ml_env && source ml_env/bin/activate
pip install numpy pandas scikit-learn matplotlib jupyter seaborn xgboost lightgbm torch fastapi uvicorn mlflow joblib
Fix NowVerify installation: python -c "import sklearn; print(sklearn.__version__)"
🟡Need GPU access for deep learning without buying hardware
Immediate ActionUse free cloud GPU environments for training
Commands
# Option 1: Google Colab — open colab.research.google.com, enable GPU runtime
# Option 2: Kaggle Notebooks — free 30 GPU hours per week, no setup required
Fix Now# Option 3: Lightning.ai — free tier with GPU access and VS Code interface
🟡Need to call an LLM API for a portfolio project
Immediate ActionInstall the OpenAI or Anthropic SDK and set your API key as an environment variable — never hardcode keys in source files
Commands
pip install openai anthropic python-dotenv
echo 'OPENAI_API_KEY=your_key_here' >> .env
Fix Nowpython -c "from openai import OpenAI; client = OpenAI(); print('API connected')"
🟡Need to version and track ML experiments
Immediate ActionInitialize MLflow tracking in your project directory
Commands
pip install mlflow && mlflow ui
# In your training script: import mlflow; mlflow.autolog()
Fix Now# Open http://localhost:5000 to view experiment runs, metrics, and saved models
Production IncidentSix Months of Random Tutorials, Zero Job OffersA developer spent 6 months taking ML courses from 12 different platforms but could not pass a single technical interview because they could not explain fundamentals.
SymptomApplied to 47 ML positions. Received zero callbacks. Resume listed 12 course certificates but no projects, no GitHub portfolio, and no deployed models.
AssumptionCompleting many courses would demonstrate competence. The developer believed certificates equaled job-readiness.
Root causeCourse-hopping without building projects left the developer with fragmented knowledge and no practical skills. Interviewers asked about bias-variance tradeoff, cross-validation strategy, and production deployment — concepts that require hands-on experience, not video lectures. The learning path lacked structure, projects, and depth. In 2026, interviewers are also asking about prompt engineering, retrieval-augmented generation, and responsible AI — topics that never appear in generic course catalogs.
Fix1. Followed a structured 6-month roadmap with monthly project milestones 2. Built 6 portfolio projects deployed on GitHub with README documentation 3. Practiced ML system design interviews using real-world scenarios 4. Contributed to one open-source ML library for resume differentiation 5. Added one LLM-integrated project to demonstrate awareness of the current production landscape
Key Lesson
Certificates without projects are invisible to hiring managersA structured roadmap prevents course-hopping and knowledge fragmentationDeployed projects demonstrate skills that certificates cannotIn 2026, knowing when NOT to use an LLM is as important as knowing how to call one
Production Debug GuideSymptom to action mapping for common learning obstacles
Stuck on math concepts and cannot progressSkip the proof, learn the intuition. Use 3Blue1Brown videos for visual understanding. Return to math rigor after you can apply concepts in code. Most production ML engineers never hand-derive a gradient — they understand what the optimizer is doing, not every step of the calculus.
Tutorial hell — can follow along but cannot build independentlyStop watching tutorials. Take the last tutorial project, delete the code, and rebuild it from memory. Then modify it with a new dataset or feature. The rebuild step is where real learning happens — passive consumption builds false confidence.
Overwhelmed by the number of ML algorithms to learnFocus on 4 algorithms first: linear regression, logistic regression, random forest, and gradient boosting. These cover 80% of production ML use cases. Everything else — SVMs, k-nearest neighbors, naive Bayes — is supplementary knowledge you pick up when a specific problem demands it.
Cannot stay motivated after month 2Join a Kaggle competition or find a study group. External accountability and community support sustain motivation better than solo study. Alternatively, pick a dataset tied to a domain you care about — sports, healthcare, finance — and build something personally meaningful.
Projects feel too simple to impress employersDeploy the project with an API, add monitoring, write tests, and document decisions. A simple model with production infrastructure beats a complex model living in a notebook. Add a section to the README explaining what you would do differently with more time — that level of reflection signals engineering maturity.
Unsure whether to focus on traditional ML or LLMs in 2026Learn both layers. Traditional ML is the foundation — gradient boosting still powers fraud detection, pricing models, and recommendation systems at scale. LLMs are the interface layer — most new products are built on top of APIs like OpenAI, Anthropic, or open-weight models like Llama 3. Your competitive advantage is knowing when each is appropriate.

Machine learning roles require a specific skill progression that most bootcamps and courses fail to structure correctly. Developers waste months on disconnected tutorials without building deployable skills. This roadmap compresses the learning path into 6 months of focused study at 2 hours per day. Each month has concrete objectives, free resources, and a portfolio project. The sequence is designed so every concept builds on the previous one — no gaps, no dead ends. In 2026, the bar for entry-level ML roles has risen: hiring managers expect candidates to demonstrate working code, deployed models, and at least a surface-level understanding of LLM APIs and responsible AI practices. This roadmap accounts for that shift.

Month 1-2: Python, Math Foundations, and Data Manipulation

Months 1 and 2 build the foundation that every subsequent concept depends on. Python fluency is non-negotiable — you need to write clean functions, work with classes, and manipulate data structures without friction. Math foundations cover linear algebra (vectors, matrices, dot products), calculus (derivatives, gradients, chain rule intuition), and probability (distributions, Bayes theorem, conditional probability). Data manipulation means loading, cleaning, transforming, and visualizing datasets. Skip nothing here — gaps in foundations create cascading confusion later. In 2026, add one additional skill to this phase: learn to read and write basic SQL. The majority of production ML pipelines pull training data from SQL databases, not CSV files.

month1_2_foundation.py · PYTHON
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162
# TheCodeForge — Month 1-2 Foundation Checklist
# Verify you can do each of these without looking anything up

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Python: functions, classes, list comprehensions
def compute_feature_stats(data: pd.DataFrame, columns: list) -> dict:
    return {
        col: {
            'mean': data[col].mean(),
            'std': data[col].std(),
            'null_pct': data[col].isnull().mean() * 100
        }
        for col in columns
    }

# Math: vector operations in numpy
weights = np.array([0.5, 0.3, 0.2])
features = np.array([1.0, 2.0, 3.0])
prediction = np.dot(weights, features)  # dot product — this is what linear models do
print(f'Prediction: {prediction}')

# Math: gradient intuition — what a derivative looks like in code
# The gradient of MSE loss w.r.t. weights drives parameter updates
def mse_gradient(X: np.ndarray, y: np.ndarray, w: np.ndarray) -> np.ndarray:
    residuals = X @ w - y
    return (2 / len(y)) * X.T @ residuals  # derivative of MSE

# Data manipulation: pandas fluency
df = pd.DataFrame({
    'age': [25, 30, 35, None, 45],
    'income': [50000, 60000, None, 80000, 90000],
    'purchased': [0, 1, 0, 1, 1]
})

# Clean, transform, and analyze in one pipeline
result = (
    df
    .fillna(df.median(numeric_only=True))
    .assign(age_group=lambda x: pd.cut(x['age'], bins=[20, 30, 40, 50]))
    .groupby('age_group')['purchased']
    .mean()
)
print(f'Purchase rate by age group:\n{result}')

# Feature stats across all numeric columns
stats = compute_feature_stats(df, ['age', 'income'])
for col, metrics in stats.items():
    print(f'{col}: mean={metrics["mean"]:.1f}, std={metrics["std"]:.1f}, null%={metrics["null_pct"]:.1f}')

# Visualization: basic exploratory plot
plt.figure(figsize=(8, 4))
plt.scatter(df['age'], df['income'], c=df['purchased'], cmap='coolwarm', s=80)
plt.xlabel('Age')
plt.ylabel('Income')
plt.title('Purchase Behavior by Age and Income')
plt.colorbar(label='Purchased')
plt.tight_layout()
plt.savefig('scatter_plot.png')
print('Plot saved to scatter_plot.png')
▶ Output
Prediction: 2.2
Purchase rate by age group:
age_group
(20, 30] 0.5
(30, 40] 0.0
(40, 50] 1.0
age: mean=33.8, std=7.5, null%=20.0
income: mean=70000.0, std=17078.3, null%=20.0
Plot saved to scatter_plot.png
Mental Model
Foundation Learning Strategy
Foundations feel slow but determine your ceiling — rushing here creates confusion that compounds for months.
  • Python fluency means writing code, not reading it — close the tutorial and build something
  • Math intuition matters more than proofs at this stage — understand what the dot product represents before you memorize the formula
  • Pandas fluency is the single most important data skill for production ML work
  • If you cannot clean a messy dataset independently, you cannot build a reliable model
  • Learn basic SQL in parallel — most real training data lives in Postgres or BigQuery, not CSV files
📊 Production Insight
80% of production ML time is data cleaning and pipeline maintenance, not model training.
Pandas fluency directly determines your speed on real projects and during technical interviews.
Skipping foundations to jump to algorithms produces fragile knowledge that collapses under interview pressure.
In 2026, engineers who can move fluidly between Python, SQL, and shell commands are hired faster than those who know only notebooks.
🎯 Key Takeaway
Foundations determine your ceiling — do not skip them to chase algorithms.
Pandas fluency is the most important practical skill for day-one ML productivity.
Add SQL to this phase — production data does not come in tidy CSV files.
Foundation Resource Selection
IfAlready know Python basics
UseSkip Python review — focus on numpy, pandas, and SQL immediately
IfNo programming background at all
UseSpend 2 weeks on Python basics before touching any ML concept — CS50P on edX is free and excellent
IfStrong math background (STEM degree)
UseSkip formal math review — focus on implementing math in numpy to build the code-to-concept connection
IfWeak math background
UseWatch 3Blue1Brown's linear algebra and calculus series for visual intuition before reading any ML textbook
IfAlready know pandas but not SQL
UseDo the SQLZoo interactive tutorial — 4 to 6 hours covers everything you need for pulling training data

Month 3-4: Core ML Algorithms and Model Evaluation

Months 3 and 4 cover the algorithms that power 80% of production ML systems. Start with linear regression and logistic regression — these teach the fundamental concepts of fitting, prediction, loss optimization, and evaluation. Then move to decision trees, random forests, and gradient boosting — these handle the nonlinear, messy, real-world data that linear models cannot. XGBoost and LightGBM are the specific implementations you will encounter in production and on Kaggle. Model evaluation is as important as model training: learn cross-validation, confusion matrices, precision, recall, F1, and ROC-AUC. A model you cannot evaluate is a model you cannot trust. This phase also introduces scikit-learn Pipelines — the right way to bundle preprocessing and modeling steps so your code is reproducible and deployment-ready from day one.

month3_4_core_ml.py · PYTHON
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667
# TheCodeForge — Month 3-4: Core ML Algorithms with Pipeline Pattern
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score,
    f1_score, roc_auc_score, classification_report
)

# Generate realistic churn-style dataset
np.random.seed(42)
n_samples = 2000
X = pd.DataFrame({
    'tenure_months': np.random.randint(1, 72, n_samples),
    'monthly_charges': np.random.uniform(20, 100, n_samples),
    'total_charges': np.random.uniform(100, 5000, n_samples),
    'support_tickets': np.random.poisson(2, n_samples),
    'contract_type': np.random.choice([0, 1, 2], n_samples)
})
y = ((X['tenure_months'] < 12) & (X['monthly_charges'] > 60)).astype(int)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Use Pipelines — this is how production code is structured
models = {
    'Logistic Regression': Pipeline([
        ('scaler', StandardScaler()),
        ('clf', LogisticRegression(max_iter=1000, class_weight='balanced'))
    ]),
    'Random Forest': Pipeline([
        ('clf', RandomForestClassifier(n_estimators=200, class_weight='balanced', random_state=42))
    ]),
    'Gradient Boosting': Pipeline([
        ('clf', GradientBoostingClassifier(n_estimators=200, learning_rate=0.05, random_state=42))
    ]),
}

results = {}
for name, pipeline in models.items():
    pipeline.fit(X_train, y_train)
    preds = pipeline.predict(X_test)
    proba = pipeline.predict_proba(X_test)[:, 1]
    cv_scores = cross_val_score(pipeline, X_train, y_train, cv=cv, scoring='f1')
    results[name] = {
        'accuracy': accuracy_score(y_test, preds),
        'f1': f1_score(y_test, preds),
        'auc': roc_auc_score(y_test, proba),
        'cv_f1_mean': cv_scores.mean(),
        'cv_f1_std': cv_scores.std()
    }
    print(f'{name}:')
    print(f'  Test Accuracy : {results[name]["accuracy"]:.2%}')
    print(f'  F1 Score      : {results[name]["f1"]:.2%}')
    print(f'  ROC-AUC       : {results[name]["auc"]:.2%}')
    print(f'  CV F1         : {results[name]["cv_f1_mean"]:.2%} (+/- {results[name]["cv_f1_std"]:.2%})')
    print()

best_model_name = max(results, key=lambda k: results[k]['cv_f1_mean'])
print(f'Best model by CV F1: {best_model_name}')
▶ Output
Logistic Regression:
Test Accuracy : 85.25%
F1 Score : 78.43%
ROC-AUC : 89.12%
CV F1 : 77.89% (+/- 2.14%)

Random Forest:
Test Accuracy : 91.50%
F1 Score : 86.72%
ROC-AUC : 95.34%
CV F1 : 85.94% (+/- 1.87%)

Gradient Boosting:
Test Accuracy : 93.25%
F1 Score : 89.15%
ROC-AUC : 97.01%
CV F1 : 88.67% (+/- 1.52%)

Best model by CV F1: Gradient Boosting
⚠ Model Evaluation Is Not Optional
📊 Production Insight
Gradient boosting wins most tabular data competitions and is the default choice for structured data in production.
Random forests are more forgiving when hyperparameter tuning time is limited — a good default under deadline pressure.
Cross-validation with stratification is mandatory for imbalanced datasets — without it, your fold metrics are statistically unreliable.
Pipelines prevent target leakage during cross-validation — fitting a scaler outside a pipeline leaks test data into training and inflates reported performance.
🎯 Key Takeaway
Master 4 algorithms: logistic regression, random forest, gradient boosting, and XGBoost.
Model evaluation skills are as important as training skills — interviewers test both.
Use sklearn Pipelines from day one — they are the industry standard and prevent subtle data leakage bugs.

Month 5: Deep Learning and Advanced Topics

Month 5 introduces neural networks and deep learning — and critically, the judgment to know when to use them. Start with a simple feedforward network using PyTorch, then move to convolutional neural networks for image data and Transformer-based models for text. In 2026, this month also covers the LLM API layer: calling OpenAI or Anthropic APIs, building basic RAG (Retrieval-Augmented Generation) pipelines with a vector store, and understanding when fine-tuning is warranted versus when prompt engineering is sufficient. Deep learning is not always the answer — for tabular data, gradient boosting still wins. The skill is knowing which tool the problem demands.

month5_deep_learning.py · PYTHON
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091
# TheCodeForge — Month 5: Deep Learning with PyTorch + LLM API Awareness
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import numpy as np

# --- Part 1: Feedforward Neural Network in PyTorch ---
X, y = make_classification(
    n_samples=2000, n_features=20, n_informative=15, random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

X_train_t = torch.FloatTensor(X_train)
y_train_t = torch.FloatTensor(y_train)
X_test_t = torch.FloatTensor(X_test)
y_test_t = torch.FloatTensor(y_test)

train_dataset = TensorDataset(X_train_t, y_train_t)
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)

class FeedForwardNet(nn.Module):
    def __init__(self, input_size: int):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(input_size, 128),
            nn.BatchNorm1d(128),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(128, 64),
            nn.BatchNorm1d(64),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(64, 1),
            nn.Sigmoid()
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.network(x).squeeze()

model = FeedForwardNet(input_size=20)
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-4)
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=30, gamma=0.5)

for epoch in range(100):
    model.train()
    epoch_loss = 0.0
    for X_batch, y_batch in train_loader:
        optimizer.zero_grad()
        output = model(X_batch)
        loss = criterion(output, y_batch)
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item()
    scheduler.step()
    if (epoch + 1) % 25 == 0:
        print(f'Epoch {epoch+1}/100 | Loss: {epoch_loss/len(train_loader):.4f}')

model.eval()
with torch.no_grad():
    predictions = (model(X_test_t) > 0.5).float()
    accuracy = (predictions == y_test_t).float().mean()
    print(f'Neural Network Test Accuracy: {accuracy:.2%}')

# --- Part 2: LLM API Pattern (2026 skill) ---
# This shows the pattern — replace with your actual API key via environment variable
# from openai import OpenAI
# import os
#
# client = OpenAI(api_key=os.environ['OPENAI_API_KEY'])
#
# def classify_sentiment(text: str) -> str:
#     response = client.chat.completions.create(
#         model='gpt-4o',
#         messages=[
#             {'role': 'system', 'content': 'Classify sentiment as positive, negative, or neutral.'},
#             {'role': 'user', 'content': text}
#         ],
#         temperature=0  # deterministic for classification tasks
#     )
#     return response.choices[0].message.content
#
# print(classify_sentiment('This product exceeded all my expectations.'))
# Output: positive
▶ Output
Epoch 25/100 | Loss: 0.3821
Epoch 50/100 | Loss: 0.2914
Epoch 75/100 | Loss: 0.2453
Epoch 100/100 | Loss: 0.2201
Neural Network Test Accuracy: 92.50%
💡When Deep Learning Is the Right Choice in 2026
  • Image data: CNNs and Vision Transformers dominate — start with a pretrained EfficientNet or ViT via torchvision
  • Text data: Transformers dominate — use sentence-transformers for embeddings, fine-tune BERT for classification
  • Tabular data: gradient boosting still wins — do not reach for a neural network when XGBoost will do
  • LLM tasks: use API-first before considering fine-tuning — GPT-4o or Claude with a good prompt beats a fine-tuned small model for most NLP tasks
  • Time series: ARIMA and Prophet for simple trends, PatchTST or TimesNet for complex multivariate forecasting
📊 Production Insight
Deep learning is not always the best choice — gradient boosting wins on tabular data and is cheaper to maintain.
Neural networks require more data, more compute, more tuning, and more monitoring.
In 2026, the most practical deep learning skill is knowing how to use a pretrained model — not how to design one from scratch.
LLM API costs are real: temperature, token limits, and caching strategy affect production budgets. Learn to measure and control them.
🎯 Key Takeaway
Deep learning dominates vision and NLP — not tabular data.
Learn PyTorch first — go deep on one framework before touching another.
In 2026, LLM API fluency is a baseline expectation — add it to this month, not as an afterthought.

Month 6: MLOps, Portfolio Projects, and Interview Prep

Month 6 converts knowledge into job-readiness. Build 2 to 3 portfolio projects that demonstrate end-to-end ML skills: data collection, preprocessing, model training, evaluation, deployment, and monitoring. Learn the MLOps layer that separates junior candidates from mid-level candidates: experiment tracking with MLflow, containerized deployment with Docker, API serving with FastAPI, and basic CI/CD with GitHub Actions. In 2026, add responsible AI considerations to at least one project — document your bias evaluation, data provenance, and model limitations. Hiring managers at larger companies are increasingly reviewing this as part of technical screening. The portfolio is what gets you the interview. The depth of your understanding is what gets you the offer.

month6_portfolio_project.py · PYTHON
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697
# TheCodeForge — Month 6: Production-Ready Portfolio Project
# Deploy a model as a REST API with FastAPI, versioning, and health monitoring

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field, validator
import joblib
import numpy as np
import logging
import time
from datetime import datetime

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

app = FastAPI(
    title='Churn Prediction API',
    description='Predicts customer churn probability using gradient boosting model',
    version='1.0.0'
)

# Load model and scaler at startup — fail fast if artifacts are missing
try:
    model = joblib.load('churn_model_v1.pkl')
    scaler = joblib.load('feature_scaler_v1.pkl')
    logger.info('Model and scaler loaded successfully')
except FileNotFoundError as e:
    logger.error(f'Failed to load model artifacts: {e}')
    model = None
    scaler = None

class CustomerFeatures(BaseModel):
    tenure_months: int = Field(..., ge=0, le=120, description='Customer tenure in months')
    monthly_charges: float = Field(..., ge=0, le=500, description='Monthly bill amount in USD')
    total_charges: float = Field(..., ge=0, description='Total charges to date in USD')
    support_tickets: int = Field(..., ge=0, description='Number of support tickets opened')
    contract_type: int = Field(..., ge=0, le=2, description='0=month-to-month, 1=one-year, 2=two-year')

    @validator('total_charges')
    def total_must_exceed_monthly(cls, v, values):
        if 'monthly_charges' in values and v < values['monthly_charges']:
            raise ValueError('total_charges must be >= monthly_charges')
        return v

class PredictionResponse(BaseModel):
    churn_probability: float
    churn_prediction: bool
    risk_tier: str
    model_version: str
    prediction_timestamp: str

class HealthResponse(BaseModel):
    status: str
    model_loaded: bool
    uptime_seconds: float

STARTUP_TIME = time.time()

@app.post('/predict', response_model=PredictionResponse)
def predict(features: CustomerFeatures):
    if model is None or scaler is None:
        raise HTTPException(status_code=503, detail='Model not available — check deployment logs')

    input_array = np.array([[
        features.tenure_months,
        features.monthly_charges,
        features.total_charges,
        features.support_tickets,
        features.contract_type
    ]])

    scaled_input = scaler.transform(input_array)
    probability = float(model.predict_proba(scaled_input)[0][1])

    if probability >= 0.7:
        risk_tier = 'high'
    elif probability >= 0.4:
        risk_tier = 'medium'
    else:
        risk_tier = 'low'

    logger.info(f'Prediction: prob={probability:.4f}, tier={risk_tier}')

    return PredictionResponse(
        churn_probability=round(probability, 4),
        churn_prediction=probability >= 0.5,
        risk_tier=risk_tier,
        model_version='v1.0.0',
        prediction_timestamp=datetime.utcnow().isoformat()
    )

@app.get('/health', response_model=HealthResponse)
def health():
    return HealthResponse(
        status='healthy' if model is not None else 'degraded',
        model_loaded=model is not None,
        uptime_seconds=round(time.time() - STARTUP_TIME, 2)
    )
▶ Output
# Run with: uvicorn month6_portfolio_project:app --reload --port 8000
# Interactive docs: http://localhost:8000/docs
# POST /predict with JSON body returns churn probability, prediction, and risk tier
# GET /health returns model status and uptime
# Containerize with: docker build -t churn-api . && docker run -p 8000:8000 churn-api
⚠ Portfolio Project Requirements for 2026
📊 Production Insight
Hiring managers spend 30 seconds on each resume — your project links must be clickable, live, and load fast.
A deployed API with validation, logging, and a health endpoint demonstrates engineering judgment that notebooks cannot.
README quality signals communication skills — something every engineering team values as much as code quality.
In 2026, a project that uses an LLM API as one component — not the entire project — demonstrates proportionate judgment about when to use which tool.
🎯 Key Takeaway
Portfolio projects are your resume — deploy them, document them, version them, and make them load.
One deployed project with proper engineering beats ten completed courses.
MLOps skills — Docker, FastAPI, MLflow — differentiate junior from mid-level candidates at every company.
Portfolio Project Selection by Target Role
IfTargeting computer vision roles
UseBuild image classification with a pretrained EfficientNet via transfer learning — deploy as a FastAPI endpoint with image upload support
IfTargeting NLP or LLM-adjacent roles
UseBuild a RAG pipeline over a document corpus using LangChain, a vector store (ChromaDB or Pinecone), and an OpenAI API backend
IfTargeting general ML engineer roles
UseBuild a tabular prediction project with gradient boosting, deploy with FastAPI and Docker, track experiments with MLflow
IfTargeting MLOps or platform roles
UseBuild an end-to-end pipeline with MLflow experiment tracking, Docker containerization, GitHub Actions CI/CD, and a simple data drift monitor using evidently
🗂 6-Month ML Roadmap Overview
Monthly goals, resources, and deliverables at a glance
MonthFocus AreaKey SkillsDeliverableFree Resources
Month 1Python and DataPython, numpy, pandas, matplotlib, SQL basicsData analysis notebook on a real dataset with EDA and cleaning pipelinePython.org tutorial, pandas docs, SQLZoo
Month 2Math FoundationsLinear algebra, calculus intuition, probability, statisticsMath intuition notes with numpy code examples for each concept3Blue1Brown, Khan Academy, StatQuest
Month 3Supervised LearningLinear regression, logistic regression, decision trees, sklearn PipelinesClassification project with cross-validation, confusion matrix, and F1 evaluationscikit-learn docs, Andrew Ng ML Specialization (Coursera audit)
Month 4Advanced AlgorithmsRandom forest, gradient boosting, XGBoost, LightGBM, hyperparameter tuningKaggle competition submission with documented methodologyKaggle Learn, fast.ai Practical ML
Month 5Deep Learning and LLMsPyTorch, CNNs, Transformers, LLM API basics, RAG patternImage or NLP project with neural network plus one LLM API integrationPyTorch tutorials, fast.ai, OpenAI cookbook
Month 6MLOps and PortfolioFastAPI, Docker, MLflow, GitHub Actions, responsible AI basics3 deployed portfolio projects on GitHub with READMEs and live endpointsMLOps Zoomcamp, Full Stack Deep Learning, evidently docs

🎯 Key Takeaways

  • Follow a structured 6-month roadmap — course-hopping without projects wastes months and produces fragile knowledge
  • Master 4 core algorithms deeply rather than surveying 20 algorithms superficially
  • Deploy portfolio projects — one deployed API with documentation beats ten completed courses on a resume
  • Model evaluation skills are as important as model training skills — interviewers test both equally
  • In 2026, add LLM API fluency to month 5 — the ability to call, prompt, and integrate language models is a baseline expectation at most product companies
  • 2 hours daily for 6 months equals 360 hours — sufficient for junior ML roles if those hours produce shipped projects

⚠ Common Mistakes to Avoid

    Course-hopping without building projects
    Symptom

    Completed 12 courses but cannot build a model independently. Knowledge feels broad but shallow. Cannot explain concepts without referencing slide decks. Freezes during take-home assignments.

    Fix

    Limit yourself to one primary course at a time. After each module, close the tutorial and build something using the concepts on a different dataset. Depth beats breadth at this stage — interviewers want to see what you can do, not what you have watched.

    Skipping math foundations to jump to algorithms
    Symptom

    Can call sklearn functions but cannot explain what gradient descent does, why regularization prevents overfitting, or how a loss function is minimized. Struggles to answer 'what is actually happening when you call model.fit()'.

    Fix

    Spend month 2 on math intuition. You do not need to prove theorems — you need to understand what algorithms are doing under the hood. Engineers who understand the math debug models faster and design better experiments.

    Building only Jupyter notebooks, never deploying models
    Symptom

    Portfolio contains 10 notebooks but no deployed APIs, no Docker containers, no production code. Hiring managers cannot assess engineering skills from a static notebook — they need to see you can ship.

    Fix

    Deploy at least one project as a REST API with FastAPI. Containerize it with Docker. Write a README that explains how to run it. This separates data scientists from ML engineers in the hiring funnel.

    Learning too many algorithms without mastering any
    Symptom

    Can list 20 algorithm names but cannot tune hyperparameters for any of them, cannot explain the tradeoffs between them, and cannot defend a model choice to a stakeholder.

    Fix

    Master 4 algorithms deeply: linear regression, logistic regression, random forest, and gradient boosting. These cover 80% of production use cases. Add XGBoost once you can explain why you would choose it over random forest.

    Ignoring model evaluation before deployment
    Symptom

    Model shows 95% accuracy in a notebook but degrades immediately in production. No cross-validation, no confusion matrix analysis, no understanding of class imbalance or data leakage.

    Fix

    Every model must have cross-validation scores, a confusion matrix, and precision/recall/F1 evaluation before any deployment discussion. If the dataset is imbalanced, accuracy is not a valid primary metric — full stop.

    Ignoring the LLM layer entirely because it feels separate from 'real ML'
    Symptom

    Strong traditional ML skills but zero exposure to LLM APIs, embeddings, or RAG patterns. Fails to answer basic questions about generative AI during interviews at companies building AI-powered products — which is most companies in 2026.

    Fix

    Spend one to two weeks in month 5 calling an LLM API, building a simple embedding-based search, and understanding retrieval-augmented generation. You do not need to train a language model — you need to know how to use one appropriately.

Interview Questions on This Topic

  • QWalk me through how you would approach a new ML problem from scratch.Mid-levelReveal
    First, understand the business problem and define success metrics — not just accuracy, but business-relevant metrics like cost reduction, false negative rate, or revenue impact. Second, explore and clean the data — check distributions, missing value patterns, class balance, and potential data leakage sources. Third, establish a baseline model — even a simple logistic regression or a rule-based heuristic — to measure meaningful improvement against. Fourth, iterate on feature engineering and model selection using cross-validation for fair comparison. Fifth, evaluate on a held-out test set with the metrics defined in step one. Sixth, deploy with monitoring for data drift and performance degradation. The key insight interviewers want to hear: problem definition and data quality determine success more than algorithm selection. Picking the fanciest model for bad data does not work.
  • QExplain the bias-variance tradeoff with a concrete example.Mid-levelReveal
    Bias is error from a model being too simple — it underfits. A linear regression trying to model a clearly curved relationship has high bias because it cannot capture the shape of the data. Variance is error from a model being too complex — it overfits. A deep decision tree with no depth limit memorizes training noise and performs poorly on unseen data. The tradeoff: reducing bias typically increases variance, and vice versa. A decision tree with no depth limit has low bias but high variance. Pruning the tree or using a random forest — which averages many high-variance trees — reduces variance while maintaining low bias. The production implication: high-variance models perform well in offline testing but degrade unpredictably in production as data distribution shifts.
  • QHow would you handle a dataset with 95% class imbalance for fraud detection?SeniorReveal
    First, never use accuracy as the primary metric — a model that predicts every transaction as non-fraud achieves 95% accuracy and catches zero fraud. Use precision, recall, F1, and AUC-PR as primary metrics. Second, try class weight adjustment before resampling — set class_weight='balanced' in scikit-learn, which adjusts the loss function without creating synthetic data. Third, consider SMOTE for oversampling the minority class, but validate carefully — SMOTE can generate unrealistic synthetic samples that inflate offline metrics without improving production performance. Fourth, use stratified cross-validation to ensure each fold contains representative fraud cases. Fifth, adjust the classification threshold based on business cost ratios — missing fraud is almost always more expensive than a false alarm, so shifting the threshold below 0.5 typically makes business sense.
  • QWhat is the difference between training, validation, and test sets?JuniorReveal
    The training set is used to fit model parameters — gradient descent updates weights on this data. The validation set is used during development to compare models and tune hyperparameters — it prevents overfitting to the training set. The test set is held out completely until the final evaluation — it simulates unseen production data and must never influence any decision during development. The critical rule: using the test set for any model selection decision contaminates it and produces optimistically biased performance estimates — a form of data leakage. In practice, use cross-validation on the training set for model selection and hyperparameter tuning, then evaluate exactly once on the test set to report final performance.
  • QWhen would you choose a gradient boosting model over a neural network?Mid-levelReveal
    For structured tabular data, gradient boosting — XGBoost, LightGBM, CatBoost — is almost always the right default in 2026. It trains faster, requires less data, is more interpretable with SHAP values, handles missing values natively in some implementations, and routinely outperforms neural networks on tabular benchmarks. Neural networks win on unstructured data — images, text, audio — where the hierarchical feature learning of deep architectures captures structure that handcrafted features cannot. The decision rule I apply in practice: start with gradient boosting for any tabular problem; reach for a neural network when the data is unstructured, when you have millions of samples, or when the architecture has a strong inductive bias for the domain — like a CNN for images or a Transformer for sequences.

Frequently Asked Questions

How many hours per day do I need to follow this roadmap?

The roadmap is designed for 2 hours per day, 6 days per week. This totals approximately 360 hours over 6 months. If you can dedicate 4 hours per day, you can compress the timeline to 3 months. Consistency matters more than volume — 2 focused hours daily beats 10 distracted hours on weekends. Protect your daily sessions from interruption and treat them as non-negotiable.

Do I need a math degree to follow this roadmap?

No. You need high school level math — algebra and basic statistics — and the willingness to build intuition for linear algebra and calculus. Month 2 covers math foundations using visual resources like 3Blue1Brown and StatQuest. You do not need to prove theorems. You need to understand what algorithms are doing under the hood well enough to debug them when they behave unexpectedly and explain them clearly in interviews.

Should I learn PyTorch or TensorFlow in 2026?

PyTorch. It is now the dominant framework in both research and production, with the most active community, the best debugging experience, and the most tutorials. TensorFlow still appears in legacy codebases and has strong mobile deployment tooling through TFLite. If you join a team running TensorFlow, you can transfer PyTorch knowledge in a week. The reverse is also true. Pick one, go deep, and do not split your attention.

Should I learn traditional ML or focus on LLMs?

Learn both layers — they are not competing. Traditional ML with scikit-learn and gradient boosting is the foundation: it powers fraud detection, pricing, recommendation systems, and every structured data problem at scale. LLMs are the interface and capability layer: they power conversational features, document processing, code generation, and content generation. In 2026, the most hireable candidates understand both. The engineers who only know LLM APIs cannot build the data pipelines behind them. The engineers who only know traditional ML are increasingly asked to integrate LLM components and struggle.

What projects should I put in my portfolio?

Build 3 projects with a clear progression. First, a tabular data project — classification or regression with gradient boosting, deployed as a FastAPI endpoint with input validation and a README. Second, a deep learning or NLP project — image classification with a pretrained CNN, or text classification with a Transformer. Third, an MLOps or LLM project — an end-to-end pipeline with experiment tracking and a CI/CD workflow, or a RAG application that retrieves from a document corpus and answers questions. Quality over quantity: one well-documented, deployed, reproducible project beats five abandoned notebooks.

How do I stay motivated for 6 months of self-study?

Join a study group or Kaggle community — external accountability is more reliable than internal motivation after month 2. Set weekly milestones and track progress publicly, even just in a simple learning log. Build projects on domains you find personally interesting. Take one full day off per week. Remember that 6 months of consistent, structured study puts you ahead of the majority of people who start learning ML and quit by month 2 when the concepts get harder.

🔥
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

← PreviousZ-Score Formula: Standardization, Anomaly Detection and StatisticsNext →How to Set Up Your Machine Learning Environment in 2026 (Beginner Guide)
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged