Simulated Annealing — Cooling Schedule Caused Unrouted Nets
- SA accepts worse solutions with probability exp(-Δcost/T). High T: explore freely. Low T: only accept improvements.
- Cooling schedule is the critical parameter — too fast: trapped in local optima. Too slow: computationally expensive.
- Advantage over hill climbing: escapes local optima. Advantage over GA: simpler implementation, single solution.
- Simulated annealing (SA) escapes local optima by accepting worse moves with probability exp(-ΔE/T)
- Starts at high temperature (exploration) and slowly cools to greedy (exploitation)
- Cooling schedule is the critical knob: too fast traps you, too slow wastes compute
- Geometric cooling (T *= 0.995) is the workhorse in production — simple and effective
- Real-world cost: each temperature stage runs ~100 iterations, total ~10,000 evaluations for medium problems
- Production gotcha: SA needs a good initial temperature — set it so ~80% of moves are accepted at start
SA Tuning in 2 Commands
Acceptance rate near 100% after 20% of cooling
print(f'Acceptance rate at T={T:.3f}: {accepted/total:.2%}')Track per-temperature best cost: plot(T_history, cost_history)Cost doesn't improve after 5 temperature steps
Compute avg |Δcost| over 1000 random neighboursIf avg |Δcost| < 1% of current cost, double step sizeRuntime too long (>hours) for no improvement
Monte Carlo test: run SA for 5 minutes and plot cost vs timeIf cost dropped <10% in 5 min, your problem is too large for pure SA — consider population-based methodsProduction Incident
Production Debug GuideDiagnose why SA is stuck or too slow
Simulated annealing (SA) was introduced by Kirkpatrick, Gelatt, and Vecchi in 1983 as a method for solving combinatorial optimisation problems. The key insight from statistical mechanics: systems in thermal equilibrium at temperature T have state energies distributed according to the Boltzmann distribution — higher energy states are possible but less probable, with probability proportional to exp(-ΔE/kT).
Applied to optimisation: at high temperature T, accept moves that make the objective worse with probability exp(-Δcost/T). As T decreases, only improvements are accepted. This controlled randomness escapes local optima that would trap pure greedy search.
Simulated Annealing Algorithm
Simulated annealing generalises gradient-free optimization. At each step, you generate a neighbour of the current solution. If the neighbour improves the cost (ΔE < 0), accept it always. If it worsens the cost, accept it with probability exp(-ΔE/T). This 'case acceptance' at high temperature lets you explore globally.
Here's a production-ready implementation in Python. It uses geometric cooling and tracks the best solution seen (not just the current one) because you can wander into a bad region and still remember the best discovered peak.
import math, random def simulated_annealing(initial_solution, neighbour_fn, cost_fn, T_start=1000.0, T_end=0.1, cooling_rate=0.995, iterations_per_temp=100): """ General SA framework. neighbour_fn(solution) -> new solution (small random change) cost_fn(solution) -> float (lower = better) """ current = initial_solution current_cost = cost_fn(current) best = current best_cost = current_cost T = T_start while T > T_end: for _ in range(iterations_per_temp): candidate = neighbour_fn(current) candidate_cost = cost_fn(candidate) delta = candidate_cost - current_cost # Accept if better, or with probability exp(-delta/T) if worse if delta < 0 or random.random() < math.exp(-delta / T): current = candidate current_cost = candidate_cost if current_cost < best_cost: best = current best_cost = current_cost T *= cooling_rate return best, best_cost # Minimise a simple function with many local minima def rastrigin(x): return x**2 + 10 - 10*math.cos(2*math.pi*x) def neighbour(x): return x + random.uniform(-0.5, 0.5) best_x, best_val = simulated_annealing(5.0, neighbour, rastrigin) print(f'Best x: {best_x:.4f}, f(x): {best_val:.4f}') # Near x=0, f=0
- At high T: you're a drunkard wandering the solution space — every move is possible
- At medium T: you're a tipsy climber — you mostly go uphill but sometimes take a detour
- At low T: you're a sober mountaineer — you only move to higher ground
- The cooling schedule dictates how long you stay in each state of sobriety
neighbour() must generate a small perturbation — like swapping two cities. A neighbour that's too large (random permutation) breaks SA's local structure and kills convergence.Cooling Schedules: Which One for Production?
The cooling schedule determines how fast the temperature falls. The four main types:
Geometric cooling: T ← T * α, where α ∈ (0.9, 0.999). Simple and efficient. α close to 0.999 gives near-logarithmic behaviour but with constant computational cost per step.
Logarithmic cooling: T = T₀ / log(1 + t). Theoretically guarantees convergence to global optimum (Hajek, 1988) but needs trillions of steps for complex problems — impractical.
Adaptive cooling: Adjust cooling rate based on the observed acceptance rate. Target ~50% acceptance at each temperature. If acceptance > 50%, cool faster; if < 20%, reheat slightly. More robust for unknown problem landscapes.
Exponential cooling (fast): T = T₀ * exp(-β t). Used when you need a quick answer — converges fast but may get stuck.
Simulated Annealing vs Hill Climbing
Both algorithms start from an initial solution and iteratively improve it. The critical difference: hill climbing only moves to better solutions. This greediness gets it stuck in the first local optimum. SA accepts worse moves with decreasing probability, letting it climb out of local pits.
When hill climbing works: convex or near-convex landscapes, small search spaces, where the first local optimum is good enough.
When SA wins: multimodal landscapes with many local optima, large search spaces (TSP, VLSI, protein folding), when quality matters more than speed.
Hybrid approach: Run SA for global search, then hill climb from the best SA solution to fine-tune. This is common in production — hill climbing is cheap, and it tightens the final solution.
Convergence Guarantees and Practical Reality
Hajek's theorem (1988) proves that SA converges to the global optimum if the cooling schedule satisfies: T(t) ≥ c / log(t+2) for some c > d, where d is the maximum 'depth' of a local optimum. This is logarithmic cooling.
In practice, you cannot run infinitely many steps. Instead, you estimate the critical depth d* by running SA multiple times with different random seeds and check if the final cost variance is small. A variance < 2% across 10 runs suggests you're close to the global optimum.
Practical stopping criteria: - Number of iterations (e.g., 1000 problem size) - No improvement in best cost for n consecutive temperatures (e.g., 20 steps) - Temperature drops below a threshold (T < 0.001 T_start) - Time budget (e.g., 30 minutes)
Applications and Where SA Fails
Simulated annealing shines in combinatorial optimisation with rugged landscapes. Key applications:
- VLSI placement (original 1983 paper): minimise wire length in chip design
- Traveling Salesman Problem: standard benchmark — SA can find near-optimal tours for up to 10⁵ cities
- Protein folding: find minimum energy 3D structure (slow, but better bounds than greedy)
- Circuit board routing: place tracks to minimise area and crosstalk
- Machine learning: hyperparameter tuning for expensive models (e.g., neural architecture search)
Where SA fails: - Problems with long-range dependencies (e.g., graph partitioning where swaps affect many edges). The neighbour function becomes too expensive to evaluate. - Problems where you need a deterministic answer (healthcare, finance audit). Use branch-and-bound or integer programming instead. - Problems with many equivalent global optima (flat plateaus). SA wastes time jumping between equally good solutions — use a hill climber from a good initial point instead.
| Criterion | Simulated Annealing | Hill Climbing | Genetic Algorithms | Tabu Search |
|---|---|---|---|---|
| Exploration capability | High (temperature controls it) | None (greedy) | High (population diversity) | Medium (short-term memory) |
| Escape local optima | Yes (probabilistic acceptance) | No | Yes (crossover + mutation) | Yes (tabu list forbids revisited states) |
| Parameter tuning | Cooling schedule, T_start | None | Population size, mutation rate, crossover rate | Tabu tenure, aspiration criteria |
| Solution quality (typical) | Near-optimal | Local optimum | Near-optimal (with large pop) | Near-optimal |
| Computational cost | Medium (single solution) | Low | High (multiple solutions) | Medium (single solution + memory) |
| Memory required | O(1) | O(1) | O(population size) | O(tabu list size) |
| Determinism | No (probabilistic) | Yes (deterministic) | No (random crossover) | No (random tabu choice) |
| Best for | Multimodal landscapes, expensive evaluations | Convex or near-convex problems | Black-box optimization, multi-objective | Large, complex combinatorial problems |
🎯 Key Takeaways
- SA accepts worse solutions with probability exp(-Δcost/T). High T: explore freely. Low T: only accept improvements.
- Cooling schedule is the critical parameter — too fast: trapped in local optima. Too slow: computationally expensive.
- Advantage over hill climbing: escapes local optima. Advantage over GA: simpler implementation, single solution.
- Guarantee: with logarithmic cooling, converges to global optimum — but impractically slow in practice.
- Applications: VLSI placement, TSP, protein folding, circuit routing, timetabling.
⚠ Common Mistakes to Avoid
Interview Questions on This Topic
- QWhy does simulated annealing escape local optima while hill climbing cannot?Mid-levelReveal
- QWhat is the Boltzmann acceptance criterion and what does the temperature parameter control?SeniorReveal
- QHow does the cooling rate affect solution quality vs computation time?SeniorReveal
- QHow would you implement a temperature schedule that adapts to the optimisation landscape?SeniorReveal
Frequently Asked Questions
How do you choose the initial temperature?
Rule of thumb: T_start should be high enough that ~80% of random moves are accepted initially. To find it: sample random moves, compute average cost increase Δcost_avg, set T_start = -Δcost_avg / ln(0.8). This ensures the algorithm starts in 'exploration mode' rather than immediately hill-climbing.
What is the difference between SA and a genetic algorithm?
SA maintains a single solution and perturbs it, accepting worse moves probabilistically. GA maintains a population of solutions and uses crossover and mutation to evolve. SA is simpler (fewer hyperparameters) and often faster on problems with rugged single-peak landscapes. GA shines when multiple good solutions can combine (e.g., multi-modal problems). For many practical optimisation tasks, SA is easier to tune.
How many iterations per temperature should I use?
Start with 100. If the cost variance across runs is high (>10% of median), increase to 500. If runtime is too long, decrease to 50. There's a diminishing returns effect beyond 500 for most problems. Monitor the acceptance rate: if it drops below 5% after the first 10% of temperatures, you have too few iterations per temperature.
Can SA be parallelised?
Yes, but carefully. You can run multiple independent SA instances from different starting points and pick the best result (trivial parallelisation). Internal parallelisation (evaluating multiple neighbours at each step) is trickier because the acceptance decision depends on the sequence. A common production pattern: run 10-20 parallel SA chains with different seeds, merge their best solutions, then run a local hill climb from the global best.
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.