Skip to content
Home ML / AI Image Classification with TensorFlow and Keras — From Pixels to Predictions

Image Classification with TensorFlow and Keras — From Pixels to Predictions

Where developers are forged. · Structured learning · Free forever.
📍 Part of: TensorFlow & Keras → Topic 6 of 10
Learn to build a Convolutional Neural Network (CNN) for image classification.
⚙️ Intermediate — basic ML / AI knowledge assumed
In this tutorial, you'll learn
Learn to build a Convolutional Neural Network (CNN) for image classification.
  • CNNs are superior to standard Dense networks for images because they preserve spatial structure and use fewer parameters.
  • Data normalization (0 to 1 range) is non-negotiable for stable and efficient training.
  • The 'Flatten' layer acts as the critical bridge between spatial feature maps and the final logical classification decision.
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
Quick Answer
  • CNNs use Conv2D filters to detect spatial patterns — edges, textures, shapes — preserving pixel locality that Dense layers destroy
  • MaxPooling reduces spatial dimensions, making the model translation-invariant and computationally lighter
  • Always normalize pixel values to [0, 1] before training — raw 0–255 values cause gradient explosion
  • Final layer activation: softmax for multi-class, sigmoid for binary — wrong choice produces nonsensical probabilities
  • Overfitting signal: training accuracy 99%, validation accuracy 60% — add Dropout and data augmentation
  • Biggest mistake: wrong input shape to Conv2D — (32, 32) instead of (32, 32, 3) crashes immediately
Production IncidentValidation Accuracy 70%, Production Accuracy 41% — A Preprocessing MismatchA CIFAR-10 CNN hit 70% validation accuracy in training but dropped to 41% in the production REST endpoint. The model was not broken — the preprocessing was.
SymptomEvery online prediction returned high-confidence wrong answers. Confidence scores were in the 0.8–0.95 range, but classifications were consistently incorrect.
AssumptionThe team assumed that since the model output probabilities confidently, the preprocessing must be fine. High confidence was interpreted as correctness.
Root causeTraining data was divided by 255.0 (normalization to [0, 1]). The production endpoint received JPEG bytes, decoded them with PIL, and passed raw uint8 arrays (range 0–255) directly to the model. The model's first Conv2D layer received inputs 255x larger than anything it had seen during training, pushing activations into the fully saturated region of ReLU.
FixAdd explicit normalization at the model level — tf.keras.layers.Rescaling(1.0/255) as the first layer. This bakes preprocessing into the SavedModel, making it impossible to skip at inference. Validate inference inputs with tf.debugging.assert_less_equal(input_tensor, tf.ones_like(input_tensor)).
Key Lesson
Never rely on external preprocessing code matching training preprocessing — they will divergeBake normalization into the Keras model as a Rescaling layer so it is part of the saved artifactHigh model confidence does not imply correct predictions — always validate against a labeled holdout set in production
Production Debug GuideDiagnosing the most common failures when deploying image classifiers
Model accuracy is near random (10% for 10-class CIFAR-10)Check class balance in your training data. Verify that labels are correctly aligned with images — a shuffled dataset without re-pairing labels/images causes exactly this. Print a sample batch: for x, y in train_ds.take(1): print(x.shape, y)
Training loss decreases but validation loss immediately divergesClassic overfitting. Add Dropout(0.3–0.5) after Dense layers. Add data augmentation: tf.keras.layers.RandomFlip(), RandomRotation(0.1). Reduce model capacity (fewer filters) or reduce epochs.
Conv2D layer crashes with ValueError on input shapeVerify your input has 3 dimensions (height, width, channels). PIL images are (H, W) not (H, W, C). Fix: np.expand_dims(img, axis=-1) for grayscale or ensure RGB conversion: img = img.convert('RGB').
GPU memory OOM during training on large imagesReduce batch size, reduce image resolution with tf.image.resize(), or use mixed precision: tf.keras.mixed_precision.set_global_policy('mixed_float16'). This halves VRAM usage with negligible accuracy impact.

Image classification is the 'Hello World' of Computer Vision. While a standard neural network sees an image as just a flat list of numbers, TensorFlow uses Convolutional Neural Networks (CNNs) to maintain the spatial relationship between pixels. This allows the model to 'see' patterns like ears on a cat or wheels on a bus regardless of where they appear in the photo.

In this guide, we will build a CNN using the Keras Sequential API, explain the 'magic' behind convolution layers, and train a model to recognize objects from the CIFAR-10 dataset. At TheCodeForge, we emphasize that a robust model isn't just about the code—it's about how you manage the data and the environment it lives in.

1. The Architecture of a CNN

A typical image classifier consists of three main parts: Convolutional layers (feature extractors), Pooling layers (data compressors), and Dense layers (the final decision makers). Each Convolutional layer applies a set of learnable filters to the input image. These filters slide across the image to create 'feature maps' that highlight specific visual patterns.

cnn_structure.py · PYTHON
1234567891011121314151617181920212223242526272829
from tensorflow.keras import layers, models

# io.thecodeforge: Standard CNN Architecture for CIFAR-10
def build_forge_cnn():
    model = models.Sequential([
        # Bake normalization into the model — never skip at inference
        layers.Rescaling(1.0/255, input_shape=(32, 32, 3)),

        # First Layer: 32 filters, 3x3 size, ReLU activation
        layers.Conv2D(32, (3, 3), activation='relu'),
        layers.MaxPooling2D((2, 2)),

        # Second Layer: Extracting more complex features
        layers.Conv2D(64, (3, 3), activation='relu'),
        layers.MaxPooling2D((2, 2)),

        # Third Layer: Deeper feature extraction
        layers.Conv2D(64, (3, 3), activation='relu'),

        # Flattening the 2D maps into a 1D vector for the final classifier
        layers.Flatten(),
        layers.Dense(64, activation='relu'),
        layers.Dropout(0.3),
        layers.Dense(10, activation='softmax') # 10 output classes for CIFAR-10
    ])
    return model

model = build_forge_cnn()
model.summary()
▶ Output
Model: "sequential" | Total params: 122,570 | Trainable params: 122,570
💡Why Pooling?
MaxPooling reduces the dimensions of the image. This makes the model 'translation invariant,' meaning it can recognize a cat whether it's in the top-left or bottom-right corner. It also significantly reduces the computational load for the following layers.
📊 Production Insight
Baking Rescaling(1.0/255) into the model is the most important production discipline for image models.
Externalized preprocessing inevitably drifts between training and serving — the Rescaling layer eliminates the class of bugs entirely.
For reference implementations, see the transfer-learning-with-tensorflow guide where this pattern is applied with MobileNetV2.
🎯 Key Takeaway
Conv2D + MaxPooling builds hierarchical feature detectors — early layers detect edges, deep layers detect objects.
Bake normalization into the model itself — external preprocessing is a liability.
Dropout after Dense layers is non-negotiable for CIFAR-10 scale datasets.

2. Data Preprocessing & Training

Computers struggle with large raw numbers. Image pixels range from 0 to 255; scaling them to a range of 0 to 1 helps the model converge (learn) much faster. Without this step, your weights might become unstable early in the training process.

train_model.py · PYTHON
12345678910111213141516171819202122232425262728293031323334
import tensorflow as tf
from tensorflow.keras.datasets import cifar10

# io.thecodeforge: Scalable Data Loading and Training
# Load raw data — Rescaling layer handles normalization inside the model
(train_images, train_labels), (test_images, test_labels) = cifar10.load_data()

# Build tf.data pipeline with augmentation for training set
train_ds = tf.data.Dataset.from_tensor_slices((train_images, train_labels))
train_ds = (
    train_ds
    .shuffle(buffer_size=10000)
    .batch(64)
    .map(lambda x, y: (tf.image.random_flip_left_right(tf.cast(x, tf.float32)), y))
    .prefetch(tf.data.AUTOTUNE)
)

test_ds = (
    tf.data.Dataset.from_tensor_slices((test_images, test_labels))
    .batch(64)
    .prefetch(tf.data.AUTOTUNE)
)

# Compile with Adam and sparse labels (integer class indices)
model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

# Early stopping prevents wasted compute on overfit models
early_stop = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)

history = model.fit(train_ds, epochs=50, validation_data=test_ds, callbacks=[early_stop])
▶ Output
Epoch 28/50: loss: 0.68 - accuracy: 0.76 - val_loss: 0.82 - val_accuracy: 0.72
📊 Production Insight
tf.data with .prefetch(AUTOTUNE) overlaps preprocessing and GPU computation — this alone gives 2x–3x throughput on large datasets.
EarlyStopping with restore_best_weights=True is mandatory in production pipelines — saves the best checkpoint, not the last one.
Data augmentation (random flips, rotations) during training, never during inference — the test pipeline must be deterministic.
🎯 Key Takeaway
tf.data.Dataset is not optional for production — loading NumPy batches manually bottlenecks the GPU.
prefetch(AUTOTUNE) + EarlyStopping is the minimum viable training pipeline.
Augment training data only; test data must be clean and deterministic.

3. Deployment and Persistence

In a professional environment, once your model achieves acceptable accuracy, you must persist it. We use SQL to track model versions and Docker to ensure the inference environment is consistent across all production clusters.

io/thecodeforge/db/model_registry.sql · SQL
1234567891011121314
-- io.thecodeforge: Registering trained CNN artifacts
INSERT INTO io.thecodeforge.model_registry (
    model_uid,
    architecture_type,
    val_accuracy,
    artifact_path,
    training_date
) VALUES (
    'cnn_cifar10_v1_2',
    'Sequential-CNN',
    0.7042,
    's3://forge-ml-artifacts/models/cnn_v1_2.h5',
    CURRENT_TIMESTAMP
);
📊 Production Insight
Store the data_augmentation_config alongside the model artifact — if you cannot reproduce training exactly, you cannot debug production regressions.
For full serialization patterns, the tensorflow-save-load-model guide covers SavedModel format (preferred over H5) for cross-platform loading including Java backends.
🎯 Key Takeaway
The model artifact without its training config is an archaeological mystery after six months.
Store val_accuracy, data_hash, and augmentation config — not just the weights path.
H5 is convenient; SavedModel is the production standard.

4. Packaging for Production

To serve this model at scale, we containerize the prediction engine. This Docker setup includes the necessary libraries to handle high-concurrency image inference requests.

Dockerfile · DOCKERFILE
1234567891011121314
# io.thecodeforge: Standardized CNN Inference Container
FROM tensorflow/tensorflow:2.14.0-gpu

WORKDIR /app

# Copy requirements and trained model
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY trained_cnn_v1.h5 /app/model.h5
COPY serve.py /app/serve.py

EXPOSE 8080
CMD ["python", "serve.py"]
▶ Output
Successfully built image thecodeforge/cnn-inference:latest
📊 Production Insight
GPU TF images are 2–4 GB — use multi-stage builds to separate training and inference environments.
For inference-only deployments, the CPU TF image (tensorflow:2.14.0) is sufficient for most latency budgets and is 4x smaller.
For containerization best practices in the ML context, see docker-ml-models.
🎯 Key Takeaway
Use CPU-only TF image for inference if p99 latency target is above 100ms — 4x smaller image, same accuracy.
Multi-stage Docker builds keep your inference image lean.
Pin the exact model artifact path — never load 'the latest model' without a version reference.
🗂 CNN Layer Types Explained
What each layer does and when to reach for it
Layer TypePurposeAnalogy
Conv2DFeature ExtractionLooking through a magnifying glass for edges.
MaxPoolingDownsamplingSquinting to see the main shape while ignoring noise.
FlattenData PrepUnrolling a 2D map into a single line of data.
DenseClassificationThe final 'brain' making a logical guess based on features.
DropoutRegularizationTesting a student by randomly hiding parts of the textbook.

🎯 Key Takeaways

  • CNNs are superior to standard Dense networks for images because they preserve spatial structure and use fewer parameters.
  • Data normalization (0 to 1 range) is non-negotiable for stable and efficient training.
  • The 'Flatten' layer acts as the critical bridge between spatial feature maps and the final logical classification decision.
  • Keras makes it easy to experiment with different architectures, but production deployment requires SQL tracking and Docker containerization.
  • Always monitor validation loss to detect overfitting early in the training lifecycle.

⚠ Common Mistakes to Avoid

    Not normalizing pixel values before training
    Symptom

    Training loss immediately explodes to NaN or oscillates between very large values in the first few epochs — the gradients overflow

    Fix

    Either divide raw images by 255.0 in preprocessing, or add tf.keras.layers.Rescaling(1.0/255) as the first layer in the model to bake normalization in permanently.

    Using the wrong activation on the output layer
    Symptom

    For multi-class: loss decreases but accuracy never exceeds 1/num_classes. For binary: loss goes to zero but predictions are always 0.5. Probabilities do not sum to 1.

    Fix

    Multi-class classification (10 CIFAR-10 classes): use softmax. Binary classification (cat vs. dog): use sigmoid with binary_crossentropy. Never mix these — wrong activation produces nonsensical probability distributions.

    Training accuracy 99%, validation accuracy 60% — classic overfitting
    Symptom

    The model memorizes training samples instead of learning generalizable patterns. Performance on any unseen data is near-random.

    Fix

    Add Dropout(0.3–0.5) after Dense layers. Add data augmentation layers (RandomFlip, RandomRotation) at the start of the model. Reduce model capacity if the problem warrants it. Use EarlyStopping(patience=5, restore_best_weights=True).

    Passing wrong input shape to Conv2D
    Symptom

    ValueError: Input 0 of layer conv2d is incompatible with the layer: expected ndim=4, found ndim=3 — crashes on the first forward pass

    Fix

    Color images must have shape (batch, H, W, 3). Grayscale must be (batch, H, W, 1) — not (batch, H, W). Use np.expand_dims(img, axis=-1) or tf.expand_dims(img, axis=-1) to add the channel dimension.

Interview Questions on This Topic

  • QWhat is a 'Kernel' in a Convolutional layer, and how does its size affect feature extraction?Mid-levelReveal
    A kernel (also called a filter) is a small weight matrix — typically 3x3 or 5x5 — that slides across the input image performing element-wise multiplication and summing the result into a single output value per position. This operation is a discrete convolution. Smaller kernels (3x3) capture fine-grained local patterns like edges and corners with fewer parameters. Larger kernels (5x5, 7x7) capture wider spatial context but require more parameters and computation. Modern architectures (VGG, ResNet) prefer stacking multiple 3x3 Conv layers over single large kernels — two 3x3 layers see a 5x5 receptive field with fewer parameters and more non-linearity.
  • QWhy do we use Dropout layers during training but disable them during inference?JuniorReveal
    Dropout randomly sets a fraction of neuron outputs to zero during each forward pass, forcing the network to learn redundant representations and preventing co-adaptation of neurons. This is a regularization technique — it reduces overfitting by preventing any single neuron from becoming indispensable. During inference, we want deterministic, reproducible predictions — randomly dropping neurons would change the prediction every time the same image is fed. Keras handles this automatically: model.fit() sets the training flag to True (Dropout active), model.predict() and model.evaluate() set it to False (Dropout disabled, all neurons active with scaled weights).
  • QExplain the difference between 'sparse_categorical_crossentropy' and 'categorical_crossentropy'. In what format should labels be for each?JuniorReveal
    Both loss functions compute cross-entropy between the predicted probability distribution and the true label, but they expect different label formats. categorical_crossentropy expects one-hot encoded labels: [0, 0, 1, 0, 0, 0, 0, 0, 0, 0] for class 2. sparse_categorical_crossentropy expects integer class indices: 2 for class 2. Sparse is more memory-efficient for many classes — a single integer per sample vs. a vector of length num_classes. Standard practice: keep CIFAR-10 labels as integers (0–9) and use sparse_categorical_crossentropy to avoid the to_categorical() conversion step. Both produce mathematically identical gradients.
  • QWhat is 'Global Average Pooling' and how does it differ from a standard Flatten layer in deep CNN architectures?SeniorReveal
    Flatten converts a feature map of shape (H, W, C) into a 1D vector of length HWC by concatenating all values. For a 7x7x512 map, that is 25,088 parameters feeding into the Dense layer — substantial memory and overfitting risk. Global Average Pooling (GAP) takes the spatial average of each channel: a (7, 7, 512) map becomes a (512,) vector by averaging each 7x7 slice. This is 49x fewer parameters connecting to the Dense layer, dramatically reducing overfitting risk. GAP is standard in transfer learning architectures (MobileNet, ResNet, EfficientNet) and is used in the transfer-learning-with-tensorflow guide.
  • QHow does a 1x1 Convolution work, and why is it used for dimensionality reduction in networks like Inception?SeniorReveal
    A 1x1 convolution applies a kernel of size 1x1 across the spatial dimensions, performing a linear transformation only along the channel axis. It does not capture spatial patterns — it mixes information across channels at each pixel independently. The key use: if you have a (H, W, 256) feature map and apply 64 1x1 filters, the output is (H, W, 64) — you have reduced the channel count by 4x with minimal computation. In the Inception architecture, 1x1 convolutions act as 'bottleneck' layers before expensive 3x3 and 5x5 operations, dramatically reducing the computational cost. They are also used in ResNet bottleneck blocks for the same reason.

Frequently Asked Questions

What is scikit-learn vs TensorFlow for image classification?

While scikit-learn is great for tabular data and simpler algorithms like SVMs, TensorFlow is specifically optimized for deep learning and the complex matrix math required for high-accuracy image classification.

How many convolutional layers should I add?

There is no magic number, but deeper is often better for complex images. However, more layers increase training time and the risk of overfitting. Start small and increase complexity only if the model underperforms. For most practical problems, use transfer learning from MobileNetV2 or EfficientNet instead of designing from scratch — see transfer-learning-with-tensorflow.

Can I use this for real-time video classification?

Yes. A video is just a sequence of images. You can apply the same classification logic to individual frames extracted from a video stream using libraries like OpenCV.

What happens if my images have different sizes?

Neural networks require a fixed input size. You must use a preprocessing step to resize all images to the same dimensions (e.g., 32x32 or 224x224) before feeding them into the model. Use tf.image.resize(image, [height, width]) inside your tf.data pipeline for efficient batch resizing.

When should I use transfer learning instead of training a CNN from scratch?

Almost always — unless you have over 100,000 labeled images and a unique visual domain (medical imaging, satellite data). For standard object recognition tasks, MobileNetV2 or EfficientNetB0 with a custom head will outperform a custom CNN trained from scratch in both accuracy and training time. See transfer-learning-with-tensorflow for the implementation pattern.

🔥
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

← PreviousKeras Sequential vs Functional APINext →Keras Callbacks — ModelCheckpoint and EarlyStopping
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged