Beginner 6 min · March 09, 2026

Keras First Network — 12 GPU-Hours Lost to No Normalization

Raw inputs up to 50,000 caused flat loss and 50% accuracy.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
Quick Answer
  • Keras Sequential API chains layers linearly — each layer transforms the tensor flowing through it before passing it forward
  • model.compile() maps optimizer, loss, and metrics to the architecture before training begins — skip it and model.fit() throws immediately
  • Input shape must match training data dimensions exactly — shape mismatches cause errors at fit() time, not at model definition time
  • Scaling input data to [0, 1] or standardizing to zero mean is mandatory — raw integers break gradient flow during backpropagation silently
  • Production models need Docker with pinned image versions for reproducibility — CUDA driver mismatches cause silent accuracy drops that take days to diagnose
  • Biggest mistake: reaching for a neural network when Random Forest or XGBoost would outperform on tabular data with a fraction of the compute cost
  • Keras 3.0 is backend-agnostic — import from keras directly, not tensorflow.keras, if you want to keep the option of switching to PyTorch or JAX later

Keras Sequential API lets you define neural networks as a linear stack of layers — no manual gradient computation, no raw TensorFlow graph management, no hand-written backpropagation. It closes the gap between reading a paper about neural networks and having a model that actually trains.

But 'trains' and 'trains correctly' are different things. In production, a misconfigured model compiles without error and runs model.fit() to completion while learning nothing. Wrong loss function, unscaled inputs, shape mismatches that only surface at inference time — these don't produce exceptions. They produce a model with 50% accuracy on a balanced binary classification task, which looks like random guessing because it is.

I've watched teams spend two weeks tuning architecture hyperparameters on a model that was silently broken at the data preprocessing step. The fix took four lines of code. The diagnosis took twelve engineering days.

This guide covers the architectural decisions, the failure modes that don't announce themselves, the operational infrastructure that separates a prototype notebook from a deployable model, and the honest answer to 'do I actually need a neural network for this problem.' By the end, you'll be able to build, train, debug, and reason about a Keras model in a production context — not just copy-paste one from a tutorial.

What Is the Keras Sequential API and Why Does It Exist?

Before Keras, writing a neural network in Python meant manually implementing matrix multiplications, writing gradient computation by hand, managing weight update loops, and debugging raw numerical operations at every step. Theano and early TensorFlow required you to define computational graphs as static objects before execution — not as Python code you could inspect and modify interactively.

Keras was created to close that abstraction gap. The Sequential API specifically exists to handle the most common neural network pattern: a linear pipeline where data flows in one direction, each layer transforming it before passing it to the next. You describe the architecture declaratively — 'I want a Dense layer with 128 units and ReLU activation, then Dropout, then a 10-unit softmax output' — and Keras manages the graph construction, weight initialization, gradient computation, and training loop.

The important mental model: Sequential is a structural constraint, not just a convenience wrapper. It enforces that your model has exactly one input, exactly one output, and no branching. Within that constraint, it does everything for you. The moment your architecture needs to branch — two inputs, skip connections, shared embeddings, multiple output heads — Sequential can't express it and you need the Functional API.

In 2026, there's also the Keras 3.0 reality to account for. Keras is now backend-agnostic. If you import from tensorflow.keras, you're locked to TensorFlow. If you import from keras directly, you can switch backends to PyTorch or JAX with a single environment variable. For new projects, always import from keras — it costs nothing and preserves future flexibility.

Data Pipelines and Database Integration for Production Training

In a production environment, your Keras model doesn't load data from a CSV file on a laptop. It reads from a database, a feature store, or a distributed object store — and the way that data is fetched, ordered, and transformed directly affects whether your training runs are reproducible and whether your model generalizes correctly.

The most common data pipeline mistake I see in ML systems is non-deterministic batch ordering. SQL queries without an explicit ORDER BY clause return rows in an unspecified order that depends on the database engine's internal state — index pages, query planner decisions, concurrent writes. Two training runs on the same logical dataset can produce different models because the batches were ordered differently. In practice this means experiments aren't reproducible, A/B comparisons between model versions are unreliable, and debugging a regression becomes nearly impossible.

The second most common mistake is not versioning training data alongside model weights. When a model starts underperforming in production, the root cause is usually one of two things: a code change or a data distribution shift. If you can't answer 'what dataset was this model trained on, and how does its distribution compare to today's data,' you've lost the ability to diagnose the regression.

A production-grade data pipeline for Keras training has three non-negotiable properties: deterministic ordering, version tracking, and feature normalization at the pipeline level (not inside the model). Normalization that lives inside a preprocessing layer in Keras is portable with the model. Normalization that lives in an external script that gets modified and re-run is a future bug waiting to happen.

Dockerizing Your Training Environment for Reproducibility

TensorFlow's GPU support depends on three things aligning perfectly: the TensorFlow version, the CUDA version, and the cuDNN version. If any of these three diverges between the machine where you train and the machine where you serve, you're in one of two situations: either the model fails to load (the obvious case), or the model loads and produces subtly different numerical outputs because floating-point operations are handled differently across CUDA versions (the silent case).

I've seen the silent case in production. A model trained on TF 2.13 with CUDA 11.8 was deployed to a server running TF 2.15 with CUDA 12.2 because 'it's just a minor version bump.' Accuracy dropped from 94.2% to 91.7% — 2.5 percentage points on a fraud detection model. No error. No warning. The serving API returned 200 OK on every request. The accuracy regression was discovered three weeks later when someone audited the confusion matrix.

Docker solves this by making the environment a deployable artifact, not a configuration assumption. The training environment and the serving environment are the same container image. There is no version drift because there is nothing to drift.

The critical mistake in Docker-based ML setups: using :latest tags. tensorflow/tensorflow:latest-gpu will resolve to a different image tomorrow than it does today. Pin to an exact versioned tag. Better: pin to the image digest hash, which is immutable even if the tag is updated.

Common Mistakes, Loss Function Selection, and When Not to Use Neural Networks

The most expensive mistake in applied machine learning is not choosing the wrong architecture or the wrong learning rate. It's choosing the wrong model family entirely. Neural networks require large amounts of data, significant compute, careful hyperparameter tuning, and non-trivial infrastructure to deploy reliably. On tabular data with fewer than 100,000 samples, a well-tuned gradient boosting model (XGBoost, LightGBM, CatBoost) will almost always outperform a neural network — and it will do so in seconds of training time rather than hours.

I've watched teams spend three weeks building a Keras pipeline for a churn prediction problem with 15,000 training samples and 40 features. The final model achieved 82% AUC after extensive tuning. A default XGBClassifier with no tuning achieved 84% AUC in 4 seconds. The neural network was the wrong tool, and nobody stopped to benchmark the alternative first.

Beyond model selection, the loss function and output activation pairing is where most Keras configurations go silently wrong. This combination must be consistent: the math of the loss function assumes a specific probability interpretation of the model's output. Softmax outputs sum to 1.0 across all classes and represent a categorical distribution — pairing this with binary_crossentropy produces gradient updates based on incorrect mathematical assumptions. The model will train. The loss will decrease. The output probabilities will be meaningless.

Data normalization errors are the third most common issue, and they're covered in depth in the production incident above — but the mechanics are worth restating clearly: neural network weights are initialized in a small range (±0.05 for Glorot), and the mathematical stability of gradient descent assumes input magnitudes are in a comparable range. Raw pixel values (0-255) and raw financial amounts (0-50,000) both violate this assumption. The fix is always normalization, and it always happens before model.fit().

Traditional ML vs Neural Networks — Choosing the Right Tool
AspectTraditional ML (scikit-learn, XGBoost)Neural Networks (Keras)
Feature EngineeringHigh effort — domain expertise required to hand-craft features. But those features are interpretable and debuggable.Lower effort — the network learns feature interactions automatically. But you lose insight into what it's learning.
Data VolumeWorks well with 1,000–100,000 samples. Often matches or beats neural networks in this range even without tuning.Typically requires 100,000+ samples to outperform tree ensembles on tabular data. Excels at scale.
Hardware RequirementsStandard CPU. A 4-core machine runs a Random Forest in seconds. Training is reproducible across hardware.GPU or TPU strongly preferred for anything beyond toy datasets. CUDA version management is a real operational burden.
Training TimeSeconds to low minutes for most tabular datasets. Fast iteration cycle means more experiments per day.Minutes to days depending on architecture and data size. Slow iteration cycle raises the cost of each experiment.
InterpretabilityHigh — Decision Tree feature importances, SHAP values for Random Forest/XGBoost are production-grade. Regulators accept them.Low by default — saliency maps and LIME provide approximations but no ground truth explanation. Harder to audit in regulated industries.
When to useTabular data under 100k samples, regulated industries requiring model explanations, tight latency budgets, small engineering teams.Images, audio, text, time series with complex patterns, tabular data above 100k samples, when automatic feature learning provides a measurable lift over feature-engineered baselines.
Operational complexityLow — model is a Python object serialized to disk. No GPU infrastructure, no CUDA management, no Docker required for serving.High — requires GPU infrastructure, pinned CUDA/cuDNN versions, Docker for reproducibility, model serving infrastructure (TF Serving, FastAPI + Keras).

Key Takeaways

  • Keras Sequential is a linear stack — if your architecture needs to branch at any point (multiple inputs, skip connections, multiple output heads), use the Functional API from day one. Rewriting Sequential to Functional after the fact costs more than starting Functional would have.
  • model.compile() is not optional ceremony — it binds the optimizer, loss function, and metrics to the computational graph. Call it immediately after model definition, before any call to fit(), evaluate(), or predict(). Structure model-building code so compile() is the last line of the builder function.
  • Input normalization is the single highest-ROI preprocessing step in deep learning. A flat loss curve in epochs 1-2 is a data problem 80% of the time. Check np.min(X_train) and np.max(X_train) before submitting any training job. If range exceeds 10, normalize first.
  • Save the preprocessing scaler alongside the model weights — they are co-dependencies. A model loaded without its scaler will receive out-of-distribution inputs at inference time and produce degraded predictions with no error raised. Treat the scaler as part of the model artifact.
  • Docker reproducibility with pinned image versions is the mechanism that makes ML experiments verifiable. Using :latest is non-determinism. Pin to exact versioned tags (2.16.1-gpu). The training environment is the model's runtime contract.
  • Benchmark against scikit-learn and XGBoost before building a neural network pipeline for tabular data. On datasets under 100k samples, tree ensembles frequently match or beat neural networks with a fraction of the compute cost and operational complexity. Run the benchmark. Let the numbers decide.

Common Mistakes to Avoid

  • Using a neural network when XGBoost or Random Forest would suffice
    Symptom: Model trains for hours on tabular CSV data with 20,000 rows and achieves 81% F1 after extensive hyperparameter tuning. A colleague runs sklearn RandomForestClassifier with default settings in 8 seconds and gets 83% F1. The team spent two weeks on infrastructure for a result that was worse than the baseline.
    Fix: Run benchmark_first. Use sklearn.ensemble.RandomForestClassifier and xgboost.XGBClassifier with no tuning before building any Keras pipeline. If the baseline meets the business requirement, deliver the baseline. Use neural networks for unstructured data (images, text, audio) or tabular datasets with 100k+ samples where feature interaction complexity genuinely exceeds what tree ensembles can capture. The decision should be driven by benchmarks, not by which tool feels more sophisticated.
  • Calling model.fit() before model.compile()
    Symptom: RuntimeError: You must compile your model before training/testing it. Use model.compile(optimizer, loss). This at least fails loudly — the larger risk is calling compile() with incorrect arguments and not noticing until training behavior looks wrong.
    Fix: Always call model.compile(optimizer='adam', loss='...', metrics=[...]) immediately after model definition and before any call to fit(), evaluate(), or predict(). Structure your training code so compile() is the last line of the model-building function — this way it's impossible to return an uncompiled model.
  • Ignoring input shape mismatches — wrong reshape before model.fit()
    Symptom: ValueError: Input 0 of layer 'sequential' is incompatible with the layer: expected shape=(None, 784), found shape=(32, 28, 28). The batch dimension is correct (32) but the spatial dimensions are not flattened.
    Fix: Always print both model.input_shape and X_train.shape before the first model.fit() call. If data is (N, 28, 28), either add layers.Flatten() as the first layer in the model (preferred — the reshape is part of the model and is applied automatically at inference too), or reshape manually with X_train.reshape(-1, 784). The Flatten-in-model approach is safer because it ensures inference code applies the same reshape consistently.
  • Not normalizing input data — feeding raw values to the first Dense layer
    Symptom: Training loss is flat from epoch 1. Accuracy stuck at the random baseline. Adding more layers and epochs makes no difference. The model is not learning — it is saturated.
    Fix: Normalize before model.fit(). For pixel data: X = X.astype('float32') / 255.0. For tabular data: apply StandardScaler or MinMaxScaler fit on training data only, then transform both train and validation sets. Always audit with np.min(X_train), np.max(X_train) before training — if range exceeds 10, normalize. Save the scaler to disk alongside the model weights — inference requires identical preprocessing.
  • Fitting the preprocessing scaler on the full dataset instead of training data only
    Symptom: Model performs well on the validation set during training but accuracy degrades when the model is evaluated on a truly held-out test set or in production. The validation metrics during training were artificially optimistic.
    Fix: This is data leakage. The scaler's fit() method computes mean and std from whatever data you pass it. If you pass the full dataset, validation and test samples contribute to the normalization parameters — the model has seen statistical information about those samples during preprocessing. Always call scaler.fit() on X_train only, then scaler.transform() on X_val and X_test separately. The training set defines the normalization contract. Everything else is transformed to match that contract.
  • Not using callbacks — training to a fixed epoch count without EarlyStopping
    Symptom: Model overfits after epoch 20 but continues training to epoch 100. The saved model weights are from epoch 100, not epoch 20 where validation AUC peaked. The deployed model performs worse than the best checkpoint that was never saved.
    Fix: Always use EarlyStopping with restore_best_weights=True and ModelCheckpoint to save the best model to disk. These two callbacks together ensure that training stops at the optimal point and the best weights are both restored in memory and persisted to disk. Without restore_best_weights=True, EarlyStopping stops training but leaves the model at the weights from the last epoch, not the best epoch.

Interview Questions on This Topic

  • QWhat is the mathematical purpose of the ReLU activation function, and why is it preferred over Sigmoid in hidden layers of deep networks?JuniorReveal
    ReLU (Rectified Linear Unit) is defined as f(x) = max(0, x). For negative inputs it outputs zero; for positive inputs it outputs the input unchanged. The gradient of ReLU is 1 for positive inputs and 0 for negative inputs — it's piecewise constant. ReLU is preferred over Sigmoid in hidden layers for three concrete reasons: First, vanishing gradients. Sigmoid squashes its output to the range (0, 1) and its derivative approaches zero for large positive and negative inputs — the gradient saturates. In a 20-layer network, each backpropagation step multiplies gradients through Sigmoid derivatives, and the product of 20 numbers less than 0.25 approaches zero rapidly. Early layers stop receiving meaningful gradient signal and stop learning. ReLU's gradient is exactly 1 for positive inputs, so it doesn't contribute to gradient shrinkage. Second, computational cost. ReLU is a single max() operation. Sigmoid computes an exponential: 1/(1+e^-x). At the scale of millions of activations per forward pass, this difference is measurable. Third, sparsity. Roughly half of all ReLU units output zero for any given input (those receiving negative pre-activations). This sparse activation pattern creates efficient internal representations where only relevant neurons fire for each input. The trade-off: the 'dying ReLU' problem — neurons that receive large negative inputs in early training may never activate again, creating permanently dead units. Mitigation: Leaky ReLU (f(x) = max(0.01x, x)), He Normal initialization to keep initial activations in the positive range, and careful learning rate selection.
  • QExplain the vanishing gradient problem. How does He Normal initialization and BatchNormalization work together to mitigate it in deep Keras networks?Mid-levelReveal
    The vanishing gradient problem occurs when gradients computed during backpropagation shrink exponentially as they propagate through layers toward the input. If each layer multiplies the gradient by a value less than 1 (which happens with Sigmoid/Tanh activations and poorly initialized weights), a 50-layer network's early layers receive gradients on the order of 10^-20 — effectively zero. Those layers cannot update their weights and don't learn. He Normal initialization addresses the weight initialization component. It samples initial weights from a normal distribution with variance = 2/fan_in, where fan_in is the number of input connections to the layer. The factor of 2 accounts for ReLU zeroing approximately half of all inputs — without this correction, the output variance of each layer would halve with every layer, and signal magnitude would decay exponentially through depth. He Normal keeps the variance of activations approximately constant across layers at initialization, preventing the signal from dying before training even begins. BatchNormalization addresses the runtime component. It normalizes the pre-activation values within each mini-batch to have zero mean and unit variance, then applies learned scale and shift parameters. This means regardless of how weight updates shift the distribution of layer inputs during training, each layer always receives a normalized input distribution. BN effectively decouples layer training — each layer optimizes against a stable input distribution rather than a shifting one. It also allows higher learning rates because the normalization prevents runaway activation magnitudes. In practice, He Normal + ReLU + BatchNormalization is the standard combination for training networks with 20-100+ layers. This combination enabled ResNet (152 layers), DenseNet, and modern vision transformers to train stably without carefully tuned learning rate schedules.
  • QWhen should you use the Keras Functional API over the Sequential API? Provide a concrete architectural example with code structure.Mid-levelReveal
    Use the Functional API whenever your model cannot be expressed as a strictly linear layer stack. There are four clear triggers: 1. Multiple inputs: a model that takes both a text embedding and a set of tabular features. 2. Multiple outputs: a model that simultaneously predicts user intent (classification) and session duration (regression). 3. Skip connections: ResNet-style architectures where a layer's input is added to its output before being passed forward. 4. Shared weights: Siamese networks where two inputs are processed by the same layer with the same weights. Concrete example — a fraud detection model with two input branches: inputs_tabular = keras.Input(shape=(40,), name='transaction_features') inputs_image = keras.Input(shape=(64, 64, 3), name='merchant_logo') tabular_branch = layers.Dense(64, activation='relu')(inputs_tabular) tabular_branch = layers.Dense(32, activation='relu')(tabular_branch) image_branch = layers.Conv2D(32, 3, activation='relu')(inputs_image) image_branch = layers.GlobalAveragePooling2D()(image_branch) image_branch = layers.Dense(32, activation='relu')(image_branch) merged = layers.Concatenate()([tabular_branch, image_branch]) output = layers.Dense(1, activation='sigmoid', name='fraud_probability')(merged) model = keras.Model( inputs=[inputs_tabular, inputs_image], outputs=output, name='multi_modal_fraud_detector' ) Sequential cannot express this. The two branches are processed independently and merged — that's a directed acyclic graph, not a linear sequence. The Functional API makes the data flow explicit and the model graph inspectable with keras.utils.plot_model().
  • QWhat is the difference between sparse_categorical_crossentropy and categorical_crossentropy in Keras, and when does choosing the wrong one cause a training failure?JuniorReveal
    Both compute the same cross-entropy loss: -sum(y_true * log(y_pred)). The difference is entirely in how y_true is expected to be formatted. categorical_crossentropy expects y_true to be one-hot encoded: for 5 classes, class 2 is represented as [0, 0, 1, 0, 0]. sparse_categorical_crossentropy expects y_true to be an integer: for 5 classes, class 2 is the integer 2. Using the wrong one causes a specific failure mode depending on the direction of the mismatch: If you use categorical_crossentropy with integer labels (e.g., y_train = [0, 2, 1, 3]), Keras interprets each integer as a probability vector of length 1 — a scalar probability for one class. This shape mismatch typically raises a ValueError or produces incorrect loss values that look numerically plausible but are wrong. If you use sparse_categorical_crossentropy with one-hot labels (e.g., y_train = [[1,0,0], [0,1,0]]), Keras treats each one-hot vector as a sequence of class indices. With 3 classes, the one-hot vector [0,1,0] gets interpreted as indices 0, 1, 0 — meaningless for classification. Training may proceed without error but the loss calculation is wrong. Practical guideline: use sparse_categorical_crossentropy by default — it avoids the memory overhead of storing one-hot matrices (10,000 samples × 1,000 classes = 10M floats vs 10,000 integers) and is the natural format for SQL-sourced labels and tf.data pipelines.
  • QDescribe how the Adam optimizer works mechanically. Why does it outperform vanilla SGD on most Keras training jobs, and when might SGD with momentum be the better choice?SeniorReveal
    Backpropagation computes gradients — the direction and magnitude of change needed for each weight to reduce the loss. The optimizer decides how to translate those gradients into actual weight updates. Vanilla SGD applies the update: w = w - lr gradient. Every parameter uses the same learning rate and the gradient is applied directly, which means updates are noisy on mini-batches and the learning rate must be tuned carefully for each problem. Adam (Adaptive Moment Estimation) maintains two exponentially weighted running averages per parameter: - m (first moment): mean of gradients — similar to momentum in SGD - v (second moment): mean of squared gradients — similar to RMSProp The weight update is: w = w - lr m_hat / (sqrt(v_hat) + epsilon) where m_hat and v_hat are bias-corrected estimates that account for the initialization of m and v at zero. The practical effect: parameters that receive consistently large gradients (like weights connected to high-variance features) get smaller effective learning rates. Parameters that receive small or infrequent gradients get larger effective learning rates. Adam adapts the step size per parameter based on historical gradient information. Why Adam outperforms SGD on most Keras tasks: it converges faster because it doesn't require hand-tuned learning rate schedules, handles sparse gradients naturally (important for embedding layers in NLP), and is robust to noisy gradients from mini-batch sampling. When SGD with momentum is better: large-scale computer vision training (ResNet, ViT training from scratch on ImageNet) consistently shows that SGD + momentum with a carefully tuned learning rate schedule achieves slightly higher final accuracy than Adam, though it converges more slowly. The intuition: Adam's adaptive rates can prevent convergence to the sharpest minima, which sometimes correspond to the best generalization. Papers like 'The Marginal Value of Momentum for Small Learning Rate SGD' document this in detail. For most Keras beginners and production tabular models, Adam is the correct default.
  • QWhat is training-serving skew in a Keras ML system, and what specific implementation decisions prevent it?SeniorReveal
    Training-serving skew occurs when the data transformation pipeline at inference time differs from the pipeline used during training. The model was optimized for inputs with specific statistical properties. If the serving pipeline produces inputs with different properties — different normalization parameters, different feature ordering, missing features filled with different defaults — the model receives out-of-distribution inputs and produces degraded predictions with no error raised. Common causes and prevention: 1. Scaler not saved: StandardScaler computed from training data is used during training but not saved. At inference time, a new scaler is fit on the serving data, producing different mean and std. Fix: joblib.dump(scaler, 'scaler.pkl') alongside model.save('model.keras'). Inference code loads both and applies scaler.transform() to incoming data. 2. Feature order mismatch: training data has features in columns [A, B, C]; serving data arrives as a dict and is assembled in order [B, A, C]. The model receives the wrong values for each weight connection. Fix: enforce feature ordering at the pipeline level with a declared schema, not by relying on column order in a DataFrame. 3. Missing value handling differs: training replaces NaN with column mean; serving replaces NaN with -1. Fix: embed the imputation strategy in a sklearn Pipeline or a Keras preprocessing layer so the same logic runs in both environments. 4. Keras preprocessing layers vs external preprocessing: wrapping normalization inside a keras.layers.Normalization layer means the normalization is part of the saved model and applied automatically at inference — zero possibility of skew. This is the architecturally cleanest solution when using TensorFlow backend. In production: treat the preprocessing pipeline as part of the model artifact, not as surrounding infrastructure. The model should be a self-contained unit that accepts raw inputs and produces predictions. Every transformation that happens outside the model is a potential source of skew.

Frequently Asked Questions

Can I use Keras without TensorFlow?

Yes, as of Keras 3.0. Keras is now backend-agnostic and supports TensorFlow, PyTorch, and JAX. To switch backends, set the KERAS_BACKEND environment variable before importing Keras: os.environ['KERAS_BACKEND'] = 'jax'. Import from keras directly rather than tensorflow.keras — importing from tensorflow.keras locks you to the TensorFlow backend regardless of the environment variable.

The most production-mature backend remains TensorFlow, which offers TF Serving for model deployment, TFLite for mobile/edge inference, and the widest ecosystem of deployment tooling. PyTorch backend is the correct choice if your team's inference infrastructure is already PyTorch-based. JAX backend is useful for research and TPU-heavy workloads.

For new projects in 2026: import from keras, not tensorflow.keras, even if you plan to use TensorFlow. It costs nothing and preserves the option to switch backends without rewriting model code.

How do I know if my model is overfitting, and what's the correct order of interventions?

The diagnostic is straightforward: plot training loss and validation loss on the same graph across epochs. Overfitting is present when training loss continues decreasing while validation loss stops decreasing or starts increasing. The gap between the two curves is the overfitting signal.

Interventions in order of invasiveness — try each before escalating to the next:

  1. EarlyStopping with restore_best_weights=True: stops training at the best validation point. Costs nothing. Always do this first.
  2. Dropout: add Dropout(0.3-0.5) after Dense layers. Randomly zeroes activations during training, forcing the network to learn redundant representations.
  3. L2 regularization: add kernel_regularizer=keras.regularizers.l2(1e-4) to Dense layers. Penalizes large weights, discouraging memorization.
  4. Reduce model complexity: fewer layers or smaller layer widths reduce capacity for memorization.
  5. More training data or data augmentation: the fundamental fix — overfitting is a capacity-to-data ratio problem.

If validation loss is consistently improving alongside training loss, you're not overfitting — you're in the normal training regime and should let it run until EarlyStopping triggers.

Should I use GPU for my first neural network, and how do I verify it's being used?

For learning on small datasets like MNIST (60k samples, 28x28 images), a CPU is sufficient — training completes in under a minute. For datasets with 100k+ samples, image data, or any architecture with convolutional or recurrent layers, a GPU reduces training time by 10-50x and is effectively required to iterate at a useful pace.

Verify GPU availability: python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))". If this returns an empty list, TensorFlow is using CPU — check your CUDA installation with nvidia-smi. If nvidia-smi shows a GPU but TensorFlow doesn't see it, the TensorFlow version and CUDA version are mismatched — this is exactly the problem Docker with pinned base images prevents.

For free GPU access during experimentation: Google Colab provides T4 GPUs in the free tier and A100s in Colab Pro. Kaggle Notebooks provide weekly GPU quota. Both are sufficient for learning and small project work. For production training: cloud GPUs (AWS p3/p4, GCP A100s, Azure NC-series) or on-premise GPU servers with Docker are the correct infrastructure.

What is the correct file format to save a Keras model, and what's the difference between .keras, .h5, and SavedModel?

In 2026 with Keras 3.0, the recommended format is .keras (the native Keras format). Use model.save('model.keras') and keras.models.load_model('model.keras').

.keras (recommended): the native Keras 3.0 format. Stores architecture, weights, optimizer state, training configuration, and custom objects. Fully self-contained. Correct choice for Keras-to-Keras save/load.

SavedModel (TensorFlow-specific): the format used by TF Serving and TFLite conversion. Use model.export('saved_model_dir') to produce a SavedModel from a Keras model. Required when deploying to TensorFlow Serving or converting to TFLite for mobile inference.

.h5 (legacy): the HDF5 format from Keras 1/2. Still supported but deprecated — does not support all Keras 3.0 features. If you're loading an existing .h5 model, it works; for new projects, use .keras.

Critical detail: always save the preprocessing scaler (joblib.dump(scaler, 'scaler.pkl')) alongside the model file. The model file contains only the neural network weights and architecture. The scaler contains the normalization parameters required to transform raw inputs into the format the model expects. Both files are required for a complete inference artifact.

How should I structure a Keras model for deployment to a REST API in production?

The two-artifact model — a .keras file and a scaler .pkl — is not the ideal serving architecture. A better approach wraps both into a single callable that accepts raw inputs and returns predictions.

Option 1: keras.layers.Normalization preprocessing layer. Adapt the normalization layer to your training data statistics (normalization_layer.adapt(X_train)), then include it as the first layer inside the model. The saved model applies normalization automatically — no external scaler needed. Correct when using TensorFlow backend.

Option 2: sklearn Pipeline with the Keras model wrapped in a KerasClassifier (scikeras library). The pipeline chains StandardScaler and the Keras model — joblib.dump(pipeline, 'pipeline.pkl') saves the complete preprocessing + model bundle.

For REST API serving: FastAPI + Uvicorn is the most common pattern in 2026 for Python-based ML serving. Load the model at startup (not per-request), apply preprocessing, call model.predict(), and return the result. Avoid TensorFlow Serving unless you need gRPC or the specific features it provides — it adds infrastructure complexity that a well-structured FastAPI service doesn't need for most use cases.

🔥

That's TensorFlow & Keras. Mark it forged?

6 min read · try the examples if you haven't

Previous
Introduction to Keras
4 / 10 · TensorFlow & Keras
Next
Keras Sequential vs Functional API