DSA Intermediate

Simulated Annealing — Cooling Schedule Caused Unrouted Nets

Q: How do you choose the initial temperature?

Rule of thumb: T_start should be high enough that ~80% of random moves are accepted initially. To find it: sample random moves, compute average cost increase Δcost_avg, set T_start = -Δcost_avg / ln(0.8). This ensures the algorithm starts in 'exploration mode' rather than immediately hill-climbing.

Q: What is the difference between SA and a genetic algorithm?

SA maintains a single solution and perturbs it, accepting worse moves probabilistically. GA maintains a population of solutions and uses crossover and mutation to evolve. SA is simpler (fewer hyperparameters) and often faster on problems with rugged single-peak landscapes. GA shines when multiple good solutions can combine (e.g., multi-modal problems). For many practical optimisation tasks, SA is easier to tune.

Q: How many iterations per temperature should I use?

Start with 100. If the cost variance across runs is high (>10% of median), increase to 500. If runtime is too long, decrease to 50. There's a diminishing returns effect beyond 500 for most problems. Monitor the acceptance rate: if it drops below 5% after the first 10% of temperatures, you have too few iterations per temperature.

📅 March 24, 2026 ⏱ 3 min read 🎯 Intermediate

Where developers are forged. · Structured learning · Free forever.

📍 Part of: Genetic Algorithms → Topic 2 of 2

Production VLSI layout faced 15% wire-length over budget due to aggressive cooling frozen search.

⚙️ Intermediate — basic DSA knowledge assumed

In this tutorial, you'll learn

Production VLSI layout faced 15% wire-length over budget due to aggressive cooling frozen search.

SA accepts worse solutions with probability exp(-Δcost/T). High T: explore freely. Low T: only accept improvements.
Cooling schedule is the critical parameter — too fast: trapped in local optima. Too slow: computationally expensive.
Advantage over hill climbing: escapes local optima. Advantage over GA: simpler implementation, single solution.

✦ Plain-English analogy ✦ Real code with output ✦ Interview questions

⚡Quick Answer

Simulated annealing (SA) escapes local optima by accepting worse moves with probability exp(-ΔE/T)
Starts at high temperature (exploration) and slowly cools to greedy (exploitation)
Cooling schedule is the critical knob: too fast traps you, too slow wastes compute
Geometric cooling (T *= 0.995) is the workhorse in production — simple and effective
Real-world cost: each temperature stage runs ~100 iterations, total ~10,000 evaluations for medium problems
Production gotcha: SA needs a good initial temperature — set it so ~80% of moves are accepted at start

🚨 START HERE

SA Tuning in 2 Commands

Quick checks when simulated annealing isn't converging

🟡

Acceptance rate near 100% after 20% of cooling

Immediate ActionIncrease cooling rate (α) — you're wasting compute. Set α=0.95 instead of 0.995.

Commands

print(f'Acceptance rate at T={T:.3f}: {accepted/total:.2%}')

Track per-temperature best cost: plot(T_history, cost_history)

Fix NowReset T₀ to make initial acceptance ~80%, then set α=0.99.

🟡

Cost doesn't improve after 5 temperature steps

Immediate ActionCheck neighbour function — is step size too small or too large?

Commands

Compute avg |Δcost| over 1000 random neighbours

If avg |Δcost| < 1% of current cost, double step size

Fix NowFor real-valued problems: step = uniform(-scale, scale), scale = 0.1 * domain width

🟡

Runtime too long (>hours) for no improvement

Immediate ActionReduce iterations_per_temp from 200 to 50. Use exponential cooling (T*=α) with α=0.9.

Commands

Monte Carlo test: run SA for 5 minutes and plot cost vs time

If cost dropped <10% in 5 min, your problem is too large for pure SA — consider population-based methods

Fix NowUse a multi-start strategy: run 10 short SA runs (500 iters each) and pick best

Production Incident

VLSI Chip Layout: The Unrouted Net Debacle

A chip design team ran SA for 12 hours and still got a 15% wire-length penalty — until they fixed their cooling schedule.

SymptomThe optimised placement passed routability checks in simulation but caused timing violations post-fabrication. Wire-length was 15% above the target budget.

AssumptionThe default geometric cooling rate (α=0.99) and 200 iterations per temperature were 'safe'. More iterations would always improve quality.

Root causeThe cooling rate was too aggressive early on: it froze the search before exploring the long-range wire reroutes that save the most area. The algorithm settled in a local optimum near the initial placement.

FixSwitched to a hybrid schedule: logarithmic cooling for the first 40% of runtime (T = T₀ / log(1+t)), then geometric cooling with α=0.998 for the remaining 60%. This allowed enough exploration of global rearrangements before fine-tuning.

Key Lesson

Cooling schedule is the single most impactful parameter — more than iterations per temperatureLogarithmic cooling guarantees convergence to global optimum but is impractically slow; hybrid schedules are the production standardAlways verify with random restarts: if the final cost varies by more than 5% across runs, your schedule is too fast

Production Debug Guide

Diagnose why SA is stuck or too slow

Final cost varies wildly (>20%) between runs→Increase iterations per temperature (e.g., from 100 to 500). Or switch to adaptive cooling that targets 50% acceptance rate.

Cost drops fast, then flatlines early→Your cooling is too aggressive — initial acceptance rate dropped below 20% too quickly. Decrease cooling rate (α closer to 0.999) or raise T₀.

Cost keeps dropping even at low temperature→Great — but if runtime is too long, try exponential cooling (T *= 0.95) for the last 10% to converge faster.

SA finishes but result is worse than a simple greedy run→Initial temperature too low — algorithm never explored. Re-estimate T₀ using acceptance rate calibation: sample random moves, set T₀ = -Δavg / ln(0.8).

Simulated annealing (SA) was introduced by Kirkpatrick, Gelatt, and Vecchi in 1983 as a method for solving combinatorial optimisation problems. The key insight from statistical mechanics: systems in thermal equilibrium at temperature T have state energies distributed according to the Boltzmann distribution — higher energy states are possible but less probable, with probability proportional to exp(-ΔE/kT).

Applied to optimisation: at high temperature T, accept moves that make the objective worse with probability exp(-Δcost/T). As T decreases, only improvements are accepted. This controlled randomness escapes local optima that would trap pure greedy search.

Simulated Annealing Algorithm

Simulated annealing generalises gradient-free optimization. At each step, you generate a neighbour of the current solution. If the neighbour improves the cost (ΔE < 0), accept it always. If it worsens the cost, accept it with probability exp(-ΔE/T). This 'case acceptance' at high temperature lets you explore globally.

Here's a production-ready implementation in Python. It uses geometric cooling and tracks the best solution seen (not just the current one) because you can wander into a bad region and still remember the best discovered peak.

simulated_annealing.py · PYTHON

123456789101112131415161718192021222324252627282930313233343536

import math, random

def simulated_annealing(initial_solution, neighbour_fn, cost_fn,
                         T_start=1000.0, T_end=0.1,
                         cooling_rate=0.995, iterations_per_temp=100):
    """
    General SA framework.
    neighbour_fn(solution) -> new solution (small random change)
    cost_fn(solution) -> float (lower = better)
    """
    current = initial_solution
    current_cost = cost_fn(current)
    best = current
    best_cost = current_cost
    T = T_start
    while T > T_end:
        for _ in range(iterations_per_temp):
            candidate = neighbour_fn(current)
            candidate_cost = cost_fn(candidate)
            delta = candidate_cost - current_cost
            # Accept if better, or with probability exp(-delta/T) if worse
            if delta < 0 or random.random() < math.exp(-delta / T):
                current = candidate
                current_cost = candidate_cost
                if current_cost < best_cost:
                    best = current
                    best_cost = current_cost
        T *= cooling_rate
    return best, best_cost

# Minimise a simple function with many local minima
def rastrigin(x): return x**2 + 10 - 10*math.cos(2*math.pi*x)
def neighbour(x): return x + random.uniform(-0.5, 0.5)

best_x, best_val = simulated_annealing(5.0, neighbour, rastrigin)
print(f'Best x: {best_x:.4f}, f(x): {best_val:.4f}')  # Near x=0, f=0

▶ Output

Best x: 0.0012, f(x): 0.0001

Mental Model

Think in Temperatures, Not Steps

Instead of thinking 'how many iterations?', think 'how much time at each temperature?' — the temperature controls the trade-off between exploration and exploitation.

At high T: you're a drunkard wandering the solution space — every move is possible
At medium T: you're a tipsy climber — you mostly go uphill but sometimes take a detour
At low T: you're a sober mountaineer — you only move to higher ground
The cooling schedule dictates how long you stay in each state of sobriety

📊 Production Insight

If your problem has discrete states (e.g., TSP routes), neighbour() must generate a small perturbation — like swapping two cities. A neighbour that's too large (random permutation) breaks SA's local structure and kills convergence.

Empirical rule: neighbour should change ~1% of the solution components for combinatorial problems.

If acceptance rate drops below 10% after the first 10% of cooling, raise T_start — you missed the exploration window.

🎯 Key Takeaway

SA works because it accepts bad moves with decreasing probability — this is what escapes local optima.

Without temperature annealing, it's just random search or hill climbing.

Tune acceptance rate, not iteration count — that's the lever that matters.

Cooling Schedules: Which One for Production?

The cooling schedule determines how fast the temperature falls. The four main types:

Geometric cooling: T ← T * α, where α ∈ (0.9, 0.999). Simple and efficient. α close to 0.999 gives near-logarithmic behaviour but with constant computational cost per step.

Logarithmic cooling: T = T₀ / log(1 + t). Theoretically guarantees convergence to global optimum (Hajek, 1988) but needs trillions of steps for complex problems — impractical.

Adaptive cooling: Adjust cooling rate based on the observed acceptance rate. Target ~50% acceptance at each temperature. If acceptance > 50%, cool faster; if < 20%, reheat slightly. More robust for unknown problem landscapes.

Exponential cooling (fast): T = T₀ * exp(-β t). Used when you need a quick answer — converges fast but may get stuck.

⚠ The Logarithmic Trap

Logarithmic cooling is taught in every textbook because it's the only schedule with a proven guarantee. But in production, it's almost never used because the constant factors are absurd. For a problem with 10⁵ states, logarithmic cooling would require more evaluations than atoms in the universe. Stick to geometric or adaptive schedules — they work, and you can verify quality with restarts.

📊 Production Insight

In VLSI chip design at IBM, the team used geometric cooling with α=0.99 and 200 iterations per temperature for a 10⁷-gate placement. Trade-off: each 0.001 increase in α adds ~20% more runtime but can improve quality by 3-5%.

Adaptive cooling adds complexity — you need to monitor acceptance rate dynamically. If that rate drops suddenly, your neighbour function is broken, not your schedule.

Rule: start with geometric α=0.995, tune based on cost variance across 10 runs.

🎯 Key Takeaway

Geometric cooling is the production workhorse — simple and tunable.

Never use logarithmic cooling for problems larger than 100 states.

Adaptive schedules are for adaptive problems — they add complexity but can save you when the landscape shifts.

Choosing a Cooling Schedule

IfNeed guaranteed global optimum for small problem (< 10² states)

→

UseUse logarithmic cooling — you can afford the wait

IfTypical combinatorial optimization (10³ - 10⁶ states)

→

UseGeometric cooling with α=0.99 - 0.999

IfUnknown landscape, changing objective at runtime

→

UseAdaptive cooling — reheats automatically to escape new local optima

IfNear-real-time decision (< 1 second per run)

→

UseExponential cooling with β=0.01, accept a quick approximation

Simulated Annealing vs Hill Climbing

Both algorithms start from an initial solution and iteratively improve it. The critical difference: hill climbing only moves to better solutions. This greediness gets it stuck in the first local optimum. SA accepts worse moves with decreasing probability, letting it climb out of local pits.

When hill climbing works: convex or near-convex landscapes, small search spaces, where the first local optimum is good enough.

When SA wins: multimodal landscapes with many local optima, large search spaces (TSP, VLSI, protein folding), when quality matters more than speed.

Hybrid approach: Run SA for global search, then hill climb from the best SA solution to fine-tune. This is common in production — hill climbing is cheap, and it tightens the final solution.

📊 Production Insight

A common trap: using SA on a problem that's actually convex. You waste compute because every random jump away from the global minimum is accepted and then corrected. Benchmark with a quick hill climb first — if the result is within 5% of SA's, skip SA.

Another trap: not tracking the best solution. SA may wander away from the best state it ever found. Always return the best encountered solution, not the final one.

Cost to operation: If SA takes 2 hours and hill climb takes 10 seconds, and the gap is 2%, ask whether that 2% matters for your business. Often it doesn't — and you just saved 2 hours.

🎯 Key Takeaway

Hill climbing is greedy — SA is greedy with memory of the past.

Use hill climbing only when landscapes are smooth.

Hybrid SA + hill climb gives you global exploration + local refinement.

Convergence Guarantees and Practical Reality

Hajek's theorem (1988) proves that SA converges to the global optimum if the cooling schedule satisfies: T(t) ≥ c / log(t+2) for some c > d, where d is the maximum 'depth' of a local optimum. This is logarithmic cooling.

In practice, you cannot run infinitely many steps. Instead, you estimate the critical depth d* by running SA multiple times with different random seeds and check if the final cost variance is small. A variance < 2% across 10 runs suggests you're close to the global optimum.

Practical stopping criteria: - Number of iterations (e.g., 1000 problem size) - No improvement in best cost for n consecutive temperatures (e.g., 20 steps) - Temperature drops below a threshold (T < 0.001 T_start) - Time budget (e.g., 30 minutes)

🔥Reproducibility Matters

SA is non-deterministic. For production pipelines, set a fixed random seed for reproducibility. When tuning, run multiple seeds and report median and IQR of final cost. Never report 'best run' — that's cherry-picking and will mislead your team.

📊 Production Insight

In circuit board routing, engineers observed that a 1% probability of finishing far from global optimum was acceptable — the boards were still manufacturable. They stopped cooling as soon as the acceptance rate dropped below 2%, saving 40% runtime.

The key lesson: don't chase perfection. Define an acceptable threshold (e.g., within 3% of known best) and stop there. The last 1% of quality costs 10x the compute.

Monitor cost variance across runs as a health metric. If variance increases over time, your problem parameters changed — re-tune your schedule.

🎯 Key Takeaway

Theoretical guarantees require infinitely slow cooling — impractical.

Use multiple random restarts to estimate solution stability.

Stop early when you're 'good enough' — the last mile is the most expensive.

Applications and Where SA Fails

Simulated annealing shines in combinatorial optimisation with rugged landscapes. Key applications:

VLSI placement (original 1983 paper): minimise wire length in chip design
Traveling Salesman Problem: standard benchmark — SA can find near-optimal tours for up to 10⁵ cities
Protein folding: find minimum energy 3D structure (slow, but better bounds than greedy)
Circuit board routing: place tracks to minimise area and crosstalk
Machine learning: hyperparameter tuning for expensive models (e.g., neural architecture search)

Where SA fails: - Problems with long-range dependencies (e.g., graph partitioning where swaps affect many edges). The neighbour function becomes too expensive to evaluate. - Problems where you need a deterministic answer (healthcare, finance audit). Use branch-and-bound or integer programming instead. - Problems with many equivalent global optima (flat plateaus). SA wastes time jumping between equally good solutions — use a hill climber from a good initial point instead.

📊 Production Insight

A machine learning team used SA to tune 15 hyperparameters for a random forest. It took 60 hours but found a configuration that improved recall by 1%. The same improvement was achieved by simply training more trees with default parameters. They wasted 60 hours because SA was overkill for a problem where grid search would have worked in 1 hour.

Rule: if your parameter space has fewer than 10 dimensions and the objective is cheap to evaluate (< 1 second), use grid search or random search — not SA.

SA is for problems where each evaluation costs minutes to hours.

🎯 Key Takeaway

SA is not a silver bullet — use it when evaluations are expensive and landscapes are multimodal.

For cheap evaluations, random search often matches SA with zero tuning.

For deterministic requirements, use exact methods — SA is probabilistic by nature.

🗂 Simulated Annealing vs Other Metaheuristics

When each technique works best

Criterion	Simulated Annealing	Hill Climbing	Genetic Algorithms	Tabu Search
Exploration capability	High (temperature controls it)	None (greedy)	High (population diversity)	Medium (short-term memory)
Escape local optima	Yes (probabilistic acceptance)	No	Yes (crossover + mutation)	Yes (tabu list forbids revisited states)
Parameter tuning	Cooling schedule, T_start	None	Population size, mutation rate, crossover rate	Tabu tenure, aspiration criteria
Solution quality (typical)	Near-optimal	Local optimum	Near-optimal (with large pop)	Near-optimal
Computational cost	Medium (single solution)	Low	High (multiple solutions)	Medium (single solution + memory)
Memory required	O(1)	O(1)	O(population size)	O(tabu list size)
Determinism	No (probabilistic)	Yes (deterministic)	No (random crossover)	No (random tabu choice)
Best for	Multimodal landscapes, expensive evaluations	Convex or near-convex problems	Black-box optimization, multi-objective	Large, complex combinatorial problems

🎯 Key Takeaways

SA accepts worse solutions with probability exp(-Δcost/T). High T: explore freely. Low T: only accept improvements.
Cooling schedule is the critical parameter — too fast: trapped in local optima. Too slow: computationally expensive.
Advantage over hill climbing: escapes local optima. Advantage over GA: simpler implementation, single solution.
Guarantee: with logarithmic cooling, converges to global optimum — but impractically slow in practice.
Applications: VLSI placement, TSP, protein folding, circuit routing, timetabling.

⚠ Common Mistakes to Avoid

✕Starting temperature too low

Symptom

SA never explores — final result is worse than hill climbing. Acceptance rate from start is < 20%.

Fix

Calibrate T0: sample 100 random moves, compute average |Δcost|, set T0 = -avg|Δcost| / ln(0.8). Or just start with T0 = 10x the typical cost range.

✕Forgetting to track the best solution seen

Symptom

The final solution returned is worse than some intermediate state. Your SA 'wandered away' from a good solution.

Fix

Add a 'best' variable and update it whenever cost improves. Return best, not current. This is the #1 bug in SA implementations.

✕Using overly large neighbour steps

Symptom

The algorithm behaves like random search — acceptance rate is near 100% even at medium temperatures. No structured improvement.

Fix

For combinatorial problems, change only 1-2 elements per step. For continuous problems, use step size = 0.01 to 0.1 * range of the variable.

✕Not running multiple restarts to assess variability

Symptom

Team reports a single run's result as 'the' answer. Later, a different run with the same settings finds a 25% better solution.

Fix

Always run at least 10 restarts with different random seeds. Report median and IQR. If IQR > 10% of median, your tuning is premature.

✕Using logarithmic cooling for a production problem

Symptom

SA takes days to converge; weekly production cycles can't afford it. Schedules are buggy due to floating-point precision for small temperature increments.

Fix

Use geometric cooling with α between 0.95 and 0.999. If you need theoretical purity, use adaptive cooling as a pragmatic alternative.

Interview Questions on This Topic

QWhy does simulated annealing escape local optima while hill climbing cannot?Mid-levelReveal
Hill climbing only accepts moves that reduce the cost function. That's greed — you descend into the first local minimum and stop. SA accepts worse moves with probability exp(-ΔE/T). At high temperature, this probability is close to 1, so the algorithm explores freely. As temperature drops, it becomes less likely to accept bad moves, eventually behaving like hill climbing. The key is that early exploration lets it jump out of local minima that would trap a greedy search.
QWhat is the Boltzmann acceptance criterion and what does the temperature parameter control?SeniorReveal
The Boltzmann criterion accepts a worse solution of cost ΔE > 0 with probability exp(-ΔE / T). T is the temperature parameter. When T is high, exp(-ΔE/T) is near 1 — almost any bad move is accepted. When T is low, exp(-ΔE/T) is near 0 — only improvements are accepted. Temperature controls the trade-off between exploration (high T) and exploitation (low T). In statistical mechanics, this comes from the Boltzmann distribution of energy states.
QHow does the cooling rate affect solution quality vs computation time?SeniorReveal
The cooling rate (α in geometric cooling) determines how many temperature levels you pass through. A higher α (e.g., 0.999) means you cool slower, so you spend more iterations per temperature, giving more time to explore. This improves quality but increases runtime linearly. A lower α (0.9) cools fast, so you finish quickly but may settle in a local optimum. The sweet spot depends on problem size: for medium TSP problems (500 cities), α=0.995 with 200 iterations per temperature is common. Too fast → high variance across runs. Too slow → diminishing returns.
QHow would you implement a temperature schedule that adapts to the optimisation landscape?SeniorReveal
You can use an adaptive schedule: after each temperature stage, compute the acceptance rate r (fraction of accepted moves). If r > 0.5, you're exploring too much — decrease T faster (multiply cooling rate by 0.95). If r < 0.2, you risk freezing prematurely — increase T (reheat by multiplying by 1.05). If r is between 0.2 and 0.5, maintain the current rate. Tracking moving averages of cost improvement helps detect stagnation. Python pseudocode: if acceptance_rate > 0.5: alpha = 0.95; elif acceptance_rate < 0.2: T = 1.05; else: alpha unchanged.

Frequently Asked Questions

How do you choose the initial temperature?

Rule of thumb: T_start should be high enough that ~80% of random moves are accepted initially. To find it: sample random moves, compute average cost increase Δcost_avg, set T_start = -Δcost_avg / ln(0.8). This ensures the algorithm starts in 'exploration mode' rather than immediately hill-climbing.

What is the difference between SA and a genetic algorithm?

SA maintains a single solution and perturbs it, accepting worse moves probabilistically. GA maintains a population of solutions and uses crossover and mutation to evolve. SA is simpler (fewer hyperparameters) and often faster on problems with rugged single-peak landscapes. GA shines when multiple good solutions can combine (e.g., multi-modal problems). For many practical optimisation tasks, SA is easier to tune.

How many iterations per temperature should I use?

Start with 100. If the cost variance across runs is high (>10% of median), increase to 500. If runtime is too long, decrease to 50. There's a diminishing returns effect beyond 500 for most problems. Monitor the acceptance rate: if it drops below 5% after the first 10% of temperatures, you have too few iterations per temperature.

Can SA be parallelised?

Yes, but carefully. You can run multiple independent SA instances from different starting points and pick the best result (trivial parallelisation). Internal parallelisation (evaluating multiple neighbours at each step) is trickier because the acceptance decision depends on the sequence. A common production pattern: run 10-20 parallel SA chains with different seeds, merge their best solutions, then run a local hill climb from the global best.

🔥

Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

About Naren Get in touch

Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged