Senior 4 min · March 06, 2026
Activation Functions in Neural Networks

Sigmoid Hidden Layers Cause NaN Loss — Activation Fix

TF 2.15 mixed precision + sigmoid caused float16 underflow and CrashLoopBackOff.

N
Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Drawn from code that ran under real load.

Follow
Production
production tested
May 24, 2026
last updated
1,554
articles · all by Naren
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • Activation functions introduce non-linearity, allowing neural networks to learn complex patterns.
  • Key types include ReLU (fast, sparsely activated), Sigmoid (probabilistic output), and Softmax (multi-class probability).
  • ReLU avoids vanishing gradients but can cause dead neurons if learning rates are too high.
  • Sigmoid outputs between 0 and 1, making it ideal for binary classification output layers.
  • Softmax ensures all class probabilities sum to 1, perfect for multi-class classification.
  • In production, ReLU variants like Leaky ReLU often outperform vanilla ReLU for stability.
✦ Definition~90s read
What is Activation Functions in Neural Networks?

Activation functions are the non-linear decision gates inside neural networks that determine whether a neuron should fire. Without them, stacking layers would collapse into a single linear transformation, making deep networks useless. The choice of activation function directly impacts training stability, gradient flow, and whether your loss explodes to NaN.

Imagine your brain deciding whether to feel excited about something — a tiny stimulus barely registers, but a loud noise makes you jump.

Sigmoid and Tanh, once standard, suffer from vanishing gradients in deep networks — outputs saturate near 0 or 1, killing gradient updates and often causing NaN loss when combined with certain loss functions or weight initializations. ReLU (Rectified Linear Unit) largely replaced them in hidden layers because its gradient is 1 for positive inputs, avoiding saturation and enabling faster, more stable training.

Softmax is a different beast: it's used exclusively in the output layer for multi-class classification, converting raw logits into a probability distribution that sums to 1. It's not an alternative to ReLU — it solves a different problem. If you're hitting NaN loss, the first suspect is often a sigmoid hidden layer paired with a loss like cross-entropy, especially with poor initialization or unnormalized inputs.

ReLU or its variants (Leaky ReLU, ELU) are the standard fix for hidden layers, while Softmax remains the go-to for probabilistic outputs.

Plain-English First

Imagine your brain deciding whether to feel excited about something — a tiny stimulus barely registers, but a loud noise makes you jump. Activation functions are that decision-maker inside every artificial neuron. They take in a raw number and decide: 'Is this signal strong enough to pass forward, and if so, how strongly?' Without them, your entire neural network is just a fancy calculator doing basic multiplication — it can't learn curves, patterns, or anything complex. They're the on/off switches (and everything in between) that give neural networks their power.

Your neural network is dead on arrival without the right activation function. It's that simple. Pick wrong, and your model won't train. It'll just sit there, stuck. The math collapses.

Activation functions aren't just math. They're the decision engine. They take a raw input signal and decide how much of it gets passed on. No activation function means your multi-layer network is mathematically identical to a single layer. It can only draw straight lines. Real data isn't linear.

You need non-linearity. That's what these functions inject. They let networks model curves, complex boundaries, and the messy patterns of real-world data. We'll cut through the theory and show you exactly which one to use, where, and why. You'll see the code, understand the trade-offs, and fix the common pitfalls that kill models in production.

Why Sigmoid Hidden Layers Cause NaN Loss

Activation functions in neural networks are the non-linear transformations applied to each neuron's output. Without them, a network collapses into a single linear transformation, regardless of depth. The core mechanic: an activation function introduces non-linearity, enabling the network to learn complex patterns by deciding which signals pass forward. In practice, the sigmoid function squashes any input to a value between 0 and 1, but its derivative peaks at 0.25 and vanishes near the tails. This causes gradients to shrink exponentially with depth — the vanishing gradient problem — making deep networks untrainable. For hidden layers, sigmoid is a poor choice because its output is not zero-centered, leading to zigzagging updates and slower convergence. Worse, when outputs saturate near 0 or 1, gradients become effectively zero, and loss can explode to NaN as weights oscillate or diverge. Use ReLU or its variants for hidden layers; reserve sigmoid for binary classification output layers where probability interpretation is needed.

Sigmoid in Hidden Layers
Sigmoid's non-zero-centered output and vanishing gradient make it a common cause of NaN loss in deep networks. Use ReLU instead.
Production Insight
A team trained a 10-layer network with sigmoid hidden layers on a fraud detection dataset. After 50 epochs, loss became NaN and the model output all zeros. The root cause: gradients vanished in layers 5–8, causing weight updates to amplify noise until overflow. Rule: never use sigmoid in hidden layers of networks with more than 2–3 layers.
Key Takeaway
Activation functions are the only source of non-linearity in neural networks.
Sigmoid causes vanishing gradients and NaN loss in deep hidden layers.
Use ReLU for hidden layers; reserve sigmoid for binary output layers only.
Activation Functions: From Sigmoid NaN to Swish THECODEFORGE.IO Activation Functions: From Sigmoid NaN to Swish Why sigmoid hidden layers cause NaN loss and how to fix it Sigmoid Hidden Layers Causes vanishing gradients → NaN loss ReLU, Sigmoid, Tanh Classic activations with gradient issues Softmax Probability output for classification Leaky ReLU Fixes dead ReLU with small negative slope Swish Smooth, non-monotonic, works reliably ⚠ Sigmoid in hidden layers → vanishing gradients Use ReLU, Leaky ReLU, or Swish instead THECODEFORGE.IO
thecodeforge.io
Activation Functions: From Sigmoid NaN to Swish
Activation Functions Neural Networks

The Big Three: ReLU, Sigmoid, and Tanh

ReLU's the king for a reason—we've watched models using Sigmoid in hidden layers stall for days. Its zero gradient for negatives isn't a bug, it's the feature that finally lets deep networks learn. You'll see training loss plummet once you swap those S-shaped functions out.

Don't get me wrong, Sigmoid and Tanh still have their place in output layers. But put them anywhere else and you'll be debugging vanishing gradients all night. Those tiny slopes multiply across layers until your early weights barely budge.

We learned this the hard way when our LSTM's first layer refused to update. The fix? Swapping Tanh for ReLU in the hidden states gave us 3x faster convergence. Your team should treat anything but ReLU in hidden layers as a performance red flag.

activations.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# Package: io.thecodeforge.ml.core
import numpy as np
import torch
import torch.nn as nn

class ActivationForge:
    @staticmethod
    def manual_relu(input_tensor):
        """Standard ReLU: f(x) = max(0, x)"""
        return np.maximum(0, input_tensor)

    @staticmethod
    def manual_sigmoid(input_tensor):
        """Sigmoid: 1 / (1 + exp(-x))"""
        return 1 / (1 + np.exp(-input_tensor))

# Production usage with PyTorch
model_layer = nn.Sequential(
    nn.Linear(784, 128),
    nn.ReLU(),           # Use ReLU for hidden layers
    nn.Linear(128, 10),
    nn.Softmax(dim=1)    # Use Softmax for multi-class output
)
Output
# Ready for high-performance training
Forge Tip: The Dying ReLU Problem
While ReLU is fast, it has a 'dead' zone for negative values. If too many neurons receive negative inputs, they stop updating (gradient is 0). If your model stops learning, try 'Leaky ReLU', which adds a tiny slope (like 0.01) to negative inputs.
Production Insight
Sigmoid in hidden layers caused 72-hour training stalls.
Vanishing gradients killed early layer updates in production LSTMs.
Default to ReLU for hidden layers—treat others as exceptions.
Key Takeaway
ReLU enables deep learning by avoiding gradient saturation.
Sigmoid/Tanh gradients vanish, stalling early layer training.
Hidden layers belong to ReLU.

Softmax — The Probability Architect

Softmax doesn't work on one neuron — it sees the whole output layer at once. It takes every raw score (logit) and squashes them into a probability distribution that sums to exactly 1.0. That's what makes it the right call for multi-class classification. Your model isn't just picking a winner — it's telling you how confident it is across every possible class.

Here's the problem you'll hit in production: Softmax amplifies small logit differences into near-certainty. A model that's barely leaning toward 'Cat' will report 94% confidence. We've seen this burn teams using Softmax outputs directly in risk-scoring pipelines. It's great for ranking, terrible for calibration.

Watch out for extreme logits during training too. When logit values get large, Softmax saturates — gradients vanish and the model stops learning. That's why you'll see temperature scaling and label smoothing in real production systems. They're not optional polish — they're fixes for Softmax's core overconfidence problem.

softmax_impl.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Package: io.thecodeforge.ml.output
import torch

def calculate_probabilities(logits):
    """
    Convert raw model scores (logits) into probabilities.
    Formula: exp(i) / sum(exp(j))
    """
    # Use torch.softmax for numerical stability (prevents overflow)
    probabilities = torch.softmax(logits, dim=0)
    return probabilities

# Example output for 3 classes: [Dog, Cat, Bird]
raw_scores = torch.tensor([2.0, 1.0, 0.1])
probs = calculate_probabilities(raw_scores)
print(f"Probabilities: {probs.tolist()}")
Output
Probabilities: [0.659, 0.242, 0.098]
Interview Gold: Softmax vs Sigmoid
Never use Softmax for binary classification (Yes/No). Use Sigmoid. Softmax is strictly for multi-class problems where classes are mutually exclusive.
Production Insight
Overconfident wrong predictions are a classic Softmax failure.
Use temperature scaling to calibrate outputs post-training.
Never treat raw Softmax outputs as true probabilities for risk decisions.
Key Takeaway
Turns scores into a probability distribution.
Amplifies small differences into high confidence.
Great for picking a winner, terrible for assessing true uncertainty.

The Leaky ReLU Fix: What Your Gradients Are Begging For

Dead neurons happen. You train a network, loss plateaus, and half your ReLU units sit at zero forever, contributing nothing. That's the 'dying ReLU' problem — negative inputs get clamped to zero, gradients stop flowing, and those neurons are effectively dead. A ReLU with a slope of exactly zero for negative values means no gradient update, ever. Enter Leaky ReLU. It replaces that flat zero with a small positive slope — typically 0.01 — allowing a tiny gradient to flow even when the input is negative. That small change keeps gradients alive and neurons trainable. Parametric ReLU goes further by making that slope a learnable parameter, so the network can decide how much to leak. You don't have to guess α. The optimizer will learn it. Why does this matter? Because in deep networks with many layers, dead ReLUs compound. One dead neuron in early layers can kill signal for the entire downstream path. Use Leaky ReLU as your default for hidden layers in regression and classification tasks. The performance gain is small but real, and it eliminates a silent failure mode.

leaky_relu_demo.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
// io.thecodeforge
import torch
import torch.nn as nn

class LeakyNet(nn.Module):
    def __init__(self, alpha=0.01):
        super().__init__()
        self.fc1 = nn.Linear(784, 256)
        self.fc2 = nn.Linear(256, 128)
        self.fc3 = nn.Linear(128, 10)
        self.leaky = nn.LeakyReLU(negative_slope=alpha)

    def forward(self, x):
        x = self.leaky(self.fc1(x))
        x = self.leaky(self.fc2(x))
        return self.fc3(x)

model = LeakyNet()
x = torch.randn(32, 784)
print(f"Output shape: {model(x).shape}")
# Gradients flow even for negative activations
Output
Output shape: torch.Size([32, 10])
Production Trap:
Never use a standard ReLU in the output layer. If your target range includes negative values, the output will be capped at zero. Use linear activation for regression, or sigmoid/tanh for bounded outputs.
Key Takeaway
Leaky ReLU prevents dead neurons by letting a trickle of gradient pass through negative inputs. Use it as the default hidden-layer activation in deep networks.

Swish: The Activation That Just Works (And Why It's Not a Gimmick)

Google's Swish activation — x * sigmoid(x) — isn't just a trendy ReLU replacement. It's a smooth, non-monotonic function that empirically outperforms ReLU on deep nets, especially with batch normalization. Why? Because it doesn't zero out negative values completely. Instead, it allows a small negative output that can help regularize the network and smooth the loss landscape. The non-monotonic dip near zero is the secret sauce: it provides a gentle 'off ramp' for negative inputs rather than a sharp cutoff. This means gradients are more stable and training is less sensitive to initialization. In practice, Swish often matches or beats ReLU on ImageNet-scale tasks with no hyperparameter tuning. The trade-off? It's computationally heavier — sigmoid is expensive. But with modern GPU hardware, the cost is negligible for most architectures. Replace ReLU with Swish in your next model and watch validation loss drop. If you're worried about compute, use its close cousin, Hard Swish, which replaces sigmoid with a piecewise linear approximation. Hard Swish is quantized-friendly and mobile-optimized.

swish_vs_relu.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// io.thecodeforge
import torch
import torch.nn as nn

class Swish(nn.Module):
    def forward(self, x):
        return x * torch.sigmoid(x)

class HardSwish(nn.Module):
    def forward(self, x):
        # From MobileNetV3: x * ReLU6(x+3) / 6
        return x * nn.functional.relu6(x + 3) / 6

model_swish = nn.Sequential(
    nn.Linear(784, 256),
    Swish(),
    nn.Linear(256, 10)
)

x = torch.randn(1, 784)
print(f"Swish output: {model_swish(x).tolist()[0][:3]}...")

# Hard Swish is ~15% faster on mobile CPUs
Output
Swish output: [-0.017, 0.023, 0.091]...
Deep Dive:
Swish's non-monotonicity acts as a mild regularizer. The dip below zero penalizes small positive activations, discouraging neurons from hovering near zero and forcing them to be more decisive.
Key Takeaway
Swish outperforms ReLU on deep architectures by smoothing gradients and reducing sensitivity to initialization. Use Hard Swish for mobile or quantized models.
● Production incidentPOST-MORTEMseverity: high

Model Degradation After TensorFlow 2.15 Upgrade: NaN Losses and Vanishing Gradients

Symptom
Kubernetes pods restarting with 'CrashLoopBackOff', logs show 'loss = nan', Prometheus metrics show inference latency spikes from 50ms to 2s, model outputs all zeros.
Assumption
Initial assumption was GPU memory leak or data corruption in feature pipeline.
Root cause
Mixed precision policy (tf.keras.mixed_precision.set_global_policy('mixed_float16')) combined with sigmoid activations in hidden layers caused underflow in gradient calculations during backpropagation. The TF 2.15 update changed default gradient scaling behavior for float16 operations.
Fix
1) Rolled back to TF 2.14 immediately via kubectl rollout undo deployment/model-serving. 2) Replaced all sigmoid activations with tf.keras.layers.LeakyReLU(alpha=0.01) in hidden layers. 3) Added explicit gradient scaling: optimizer = tf.keras.optimizers.Adam(learning_rate=0.001); optimizer = tf.keras.mixed_precision.LossScaleOptimizer(optimizer). 4) Added validation: assert not tf.math.reduce_any(tf.math.is_nan(loss)), 'NaN loss detected'.
Key lesson
  • Never use sigmoid/tanh in deep network hidden layers - they saturate and cause vanishing gradients
  • Always validate mixed precision configurations with gradient scaling when using float16
  • Test activation function outputs range: assert tf.reduce_max(activations) < 100.0
  • Monitor gradient norms during training: tf.summary.scalar('grad_norm', tf.linalg.global_norm(gradients))
  • Deploy model updates with canary releases: kubectl set image deployment/model-serving model=image:v2 --record
Production debug guideSymptom → Action for training failures and inference issues4 entries
Symptom · 01
Training loss plateaus or diverges to NaN after 1000 steps
Fix
Check gradient flow: add gradient norm logging tf.print('Grad norm:', tf.linalg.global_norm(gradients)). Visualize activation distributions: tf.summary.histogram('layer_output', activations). Disable mixed precision temporarily: tf.keras.mixed_precision.set_global_policy('float32'). Add gradient clipping: optimizer = tf.keras.optimizers.Adam(clipvalue=1.0).
Symptom · 02
Model outputs all zeros or constant values during inference
Fix
Check for dead ReLU neurons — inspect weight norms. Replace ReLU with LeakyReLU(alpha=0.01). Verify weight initializer is HeNormal or GlorotUniform. Restart the serving pod after model reload.
Symptom · 03
Sigmoid activations causing training stall in hidden layers
Fix
Swap all hidden-layer sigmoid activations to ReLU or LeakyReLU. Gradient norms near zero confirm vanishing gradient — check tf.summary.scalar('grad_norm'). Never use sigmoid outside output layers for binary classification.
Symptom · 04
Softmax outputs showing extreme overconfidence on wrong predictions
Fix
Apply temperature scaling post-training: logits / temperature before softmax. Use label smoothing during training: tf.keras.losses.CategoricalCrossentropy(label_smoothing=0.1). Do not treat raw softmax outputs as calibrated probabilities for risk decisions.
★ Activation Function Quick DebugFast triage for dead neurons, NaN losses, and inference failures
Model outputs all zeros during inference
Immediate action
Check for dead ReLU neurons and weight initialization
Commands
docker exec -it model-serving python -c "import tensorflow as tf; model = tf.keras.models.load_model('/models/production'); print(tf.reduce_sum(tf.abs(model.layers[2].weights[0])).numpy())"
kubectl logs deployment/model-serving --tail=100 | grep -E '(nan|inf|zero|gradient)'
Fix now
Change kernel_initializer to he_normal: tf.keras.layers.Dense(64, kernel_initializer='he_normal'). Replace ReLU with LeakyReLU: tf.keras.layers.LeakyReLU(alpha=0.01). Restart: kubectl rollout restart deployment/model-serving.
Loss = NaN after mixed precision upgrade+
Immediate action
Disable float16 policy and re-enable explicit loss scaling
Commands
kubectl rollout undo deployment/model-serving
grep -r 'mixed_float16\|float16' src/ --include='*.py'
Fix now
Add LossScaleOptimizer: optimizer = tf.keras.mixed_precision.LossScaleOptimizer(tf.keras.optimizers.Adam(lr=0.001)). Replace sigmoid in hidden layers with LeakyReLU. Add NaN guard: assert not tf.math.reduce_any(tf.math.is_nan(loss)).
Training loss decreasing but validation accuracy stuck+
Immediate action
Check activation outputs for saturation across all hidden layers
Commands
python -c "import torch; model.eval(); [print(f'{n}: mean={a.mean():.4f} std={a.std():.4f}') for n,a in [(name, layer(x)) for name,layer in model.named_modules() if hasattr(layer,'weight')]]"
tensorboard --logdir=runs/ --port=6006
Fix now
Add batch normalization before each activation: nn.Sequential(nn.Linear(in,out), nn.BatchNorm1d(out), nn.ReLU()). Reduce learning rate by 10x and retrain.
Activation Functions
FunctionOutput RangeBest Use CaseMajor DrawbackGradient Behavior
ReLU[0, ∞)Hidden layers in CNNs/MLPsDying ReLU (zero gradients for negatives)Flat for x<0, constant 1 for x>0
Leaky ReLU(-∞, ∞)When Dying ReLU is a concernExtra hyperparameter (α)Small α for x<0, 1 for x>0
Sigmoid(0, 1)Binary classification outputVanishing gradients at tailsSmall when |x| is large
Tanh(-1, 1)RNN hidden statesVanishing gradientsSmall when |x| is large
Softmax(0,1) Sum=1Multi-class output layerComputationally heavy, sensitive to outliersDepends on all inputs
Linear(-∞, ∞)Regression outputNo non-linearityConstant 1

Key takeaways

1
No non-linearity, no deep learning—linear activations collapse networks to single-layer models.
2
ReLU's simplicity and sparsity make it the default, but watch for dead neurons in deep nets.
3
Softmax forces a probability distribution; sigmoid doesn't—pick based on problem structure.
4
Initialization isn't an afterthought; it dictates whether your network trains or dies.
5
Numerical stability isn't optional—always implement softmax with max subtraction.
6
Dying ReLU and exploding gradients stem from weight dynamics, not just activation choice.
7
Output layer activation is dictated by your loss function and task type.

Common mistakes to avoid

6 patterns
×

Using Sigmoid in hidden layers of deep networks

Symptom
Training loss decreases very slowly then plateaus, gradients approach zero (norm < 1e-7), model accuracy stuck at random chance
Fix
Replace all tf.keras.activations.sigmoid with tf.keras.layers.LeakyReLU(alpha=0.01) or tf.keras.layers.ReLU()
×

Forgetting that Softmax requires the dimension (dim) to be specified in PyTorch

Symptom
CrossEntropyLoss returns NaN or extremely large values, model outputs don't sum to 1.0 across classes
Fix
Always specify dim parameter: torch.nn.functional.softmax(logits, dim=-1) for classification or dim=1 for batch processing
×

Applying an activation function to the input data before the first linear layer

Symptom
Model cannot learn simple linear relationships, training loss oscillates wildly, feature importance scores show zero variance
Fix
Remove activation from input layer: model = tf.keras.Sequential([tf.keras.layers.Input(shape=(features,)), tf.keras.layers.Dense(64)]) # NO activation here
×

Using ReLU in the output layer for a probability task

Symptom
Model outputs values > 1.0, probability predictions sum to > 100%, downstream services throw 'probability out of range' errors
Fix
Use appropriate output activation: tf.keras.layers.Dense(1, activation='sigmoid') for binary, tf.keras.layers.Dense(classes, activation='softmax') for multi-class
×

Not setting PyTorch manual seed for reproducibility

Symptom
Training results differ between runs with same hyperparameters, model performance varies by ±2% accuracy across identical experiments
Fix
Set all random seeds at start: torch.manual_seed(42); torch.cuda.manual_seed_all(42); np.random.seed(42); random.seed(42)
×

Using default learning rate for Adam optimizer

Symptom
Training converges extremely slowly (100+ epochs for simple tasks), loss oscillates without decreasing, requires excessive compute
Fix
Tune learning rate: optimizer = torch.optim.Adam(model.parameters(), lr=0.001) # Start with 1e-3, adjust based on validation
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
Walk me through the mathematical proof of why a multi-layer neural netwo...
Q02SENIOR
What is the 'Dying ReLU' problem? Under what conditions does it occur, a...
Q03SENIOR
Explain the 'Exploding Gradient' problem. Does changing the activation f...
Q04JUNIOR
In a multi-class classification problem with 1,000 classes, why is Softm...
Q05SENIOR
LeetCode Style: Implement a numerically stable Softmax function in Pytho...
Q01 of 05SENIOR

Walk me through the mathematical proof of why a multi-layer neural network with only linear activation functions is equivalent to a single-layer perceptron.

ANSWER
Let's say we have a 2-layer network with linear activation f(x) = x. Layer 1: z₁ = W₁x + b₁, a₁ = z₁. Layer 2: z₂ = W₂a₁ + b₂, a₂ = z₂. Substituting: a₂ = W₂(W₁x + b₁) + b₂ = (W₂W₁)x + (W₂b₁ + b₂). That's just W'x + b' where W' = W₂W₁ and b' = W₂b₁ + b₂. So it's still linear! Any deeper network collapses to a single linear transform. Without non-linearities, depth adds zero expressive power.
FAQ · 4 QUESTIONS

Frequently Asked Questions

01
What's the practical difference between Tanh and Sigmoid?
02
When should I use Swish or GELU over ReLU?
03
Can activation functions cause overfitting?
04
Why don't we use Softmax in hidden layers?
N
Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Drawn from code that ran under real load.

Follow
Verified
production tested
May 24, 2026
last updated
1,554
articles · all by Naren
🔥

That's Deep Learning. Mark it forged?

4 min read · try the examples if you haven't

Previous
What is a Neural Network? Explained Simply
2 / 23 · Deep Learning
Next
Backpropagation Explained