Simulated Annealing — Cooling Schedule Caused Unrouted Nets
Production VLSI layout faced 15% wire-length over budget due to aggressive cooling frozen search.
20+ years shipping performance-critical code where algorithms decide the bill. Drawn from code that ran under real load.
- Simulated annealing (SA) escapes local optima by accepting worse moves with probability exp(-ΔE/T)
- Starts at high temperature (exploration) and slowly cools to greedy (exploitation)
- Cooling schedule is the critical knob: too fast traps you, too slow wastes compute
- Geometric cooling (T *= 0.995) is the workhorse in production — simple and effective
- Real-world cost: each temperature stage runs ~100 iterations, total ~10,000 evaluations for medium problems
- Production gotcha: SA needs a good initial temperature — set it so ~80% of moves are accepted at start
When you heat metal and let it cool slowly (annealing), atoms arrange into a low-energy crystalline structure — the global minimum. Simulated annealing applies this to optimisation: start hot (accept bad moves freely), cool slowly (accept bad moves less and less often), freeze (only accept improvements). The random acceptance at high temperature lets the algorithm escape local optima that trap pure hill-climbing.
Simulated annealing (SA) was introduced by Kirkpatrick, Gelatt, and Vecchi in 1983 as a method for solving combinatorial optimisation problems. The key insight from statistical mechanics: systems in thermal equilibrium at temperature T have state energies distributed according to the Boltzmann distribution — higher energy states are possible but less probable, with probability proportional to exp(-ΔE/kT).
Applied to optimisation: at high temperature T, accept moves that make the objective worse with probability exp(-Δcost/T). As T decreases, only improvements are accepted. This controlled randomness escapes local optima that would trap pure greedy search.
Simulated Annealing as a Stochastic Search Engine
Simulated annealing is a probabilistic optimization algorithm that mimics the physical annealing process in metallurgy. It starts with a high "temperature" parameter, allowing the algorithm to accept worse solutions with a certain probability, then gradually cools to zero, at which point only improvements are accepted. The core mechanic is the Metropolis acceptance criterion: a worse solution is accepted with probability exp(-ΔE / T), where ΔE is the change in cost and T is the current temperature. This controlled randomness lets the search escape local optima early while converging to a near-global optimum later.
In practice, the cooling schedule—how T decreases over iterations—determines success. Common schedules include exponential cooling (T = T₀ * α^k, α ≈ 0.95–0.99) and logarithmic cooling (T = T₀ / log(k)), which guarantees convergence to the global optimum given infinite time. The algorithm has no memory, so each step depends only on the current state and temperature. Key properties: it is simple to implement, works on discrete and continuous spaces, and requires only a cost function and a neighbor generation mechanism.
Use simulated annealing when the search space is large, multimodal, and gradient-free—for example, VLSI routing, traveling salesman, or protein folding. It matters in production because it often finds good-enough solutions faster than exact methods (which are NP-hard) and more robustly than greedy hill-climbing. The trade-off: tuning the cooling schedule and initial temperature is art, not science, and a bad schedule can produce results worse than random.
Simulated Annealing Algorithm
Simulated annealing generalises gradient-free optimization. At each step, you generate a neighbour of the current solution. If the neighbour improves the cost (ΔE < 0), accept it always. If it worsens the cost, accept it with probability exp(-ΔE/T). This 'case acceptance' at high temperature lets you explore globally.
Here's a production-ready implementation in Python. It uses geometric cooling and tracks the best solution seen (not just the current one) because you can wander into a bad region and still remember the best discovered peak.
- At high T: you're a drunkard wandering the solution space — every move is possible
- At medium T: you're a tipsy climber — you mostly go uphill but sometimes take a detour
- At low T: you're a sober mountaineer — you only move to higher ground
- The cooling schedule dictates how long you stay in each state of sobriety
neighbour() must generate a small perturbation — like swapping two cities. A neighbour that's too large (random permutation) breaks SA's local structure and kills convergence.Cooling Schedules: Which One for Production?
The cooling schedule determines how fast the temperature falls. The four main types:
Geometric cooling: T ← T * α, where α ∈ (0.9, 0.999). Simple and efficient. α close to 0.999 gives near-logarithmic behaviour but with constant computational cost per step.
Logarithmic cooling: T = T₀ / log(1 + t). Theoretically guarantees convergence to global optimum (Hajek, 1988) but needs trillions of steps for complex problems — impractical.
Adaptive cooling: Adjust cooling rate based on the observed acceptance rate. Target ~50% acceptance at each temperature. If acceptance > 50%, cool faster; if < 20%, reheat slightly. More robust for unknown problem landscapes.
Exponential cooling (fast): T = T₀ * exp(-β t). Used when you need a quick answer — converges fast but may get stuck.
Simulated Annealing vs Hill Climbing
Both algorithms start from an initial solution and iteratively improve it. The critical difference: hill climbing only moves to better solutions. This greediness gets it stuck in the first local optimum. SA accepts worse moves with decreasing probability, letting it climb out of local pits.
When hill climbing works: convex or near-convex landscapes, small search spaces, where the first local optimum is good enough.
When SA wins: multimodal landscapes with many local optima, large search spaces (TSP, VLSI, protein folding), when quality matters more than speed.
Hybrid approach: Run SA for global search, then hill climb from the best SA solution to fine-tune. This is common in production — hill climbing is cheap, and it tightens the final solution.
Convergence Guarantees and Practical Reality
Hajek's theorem (1988) proves that SA converges to the global optimum if the cooling schedule satisfies: T(t) ≥ c / log(t+2) for some c > d, where d is the maximum 'depth' of a local optimum. This is logarithmic cooling.
In practice, you cannot run infinitely many steps. Instead, you estimate the critical depth d* by running SA multiple times with different random seeds and check if the final cost variance is small. A variance < 2% across 10 runs suggests you're close to the global optimum.
Practical stopping criteria: - Number of iterations (e.g., 1000 problem size) - No improvement in best cost for n consecutive temperatures (e.g., 20 steps) - Temperature drops below a threshold (T < 0.001 T_start) - Time budget (e.g., 30 minutes)
Applications and Where SA Fails
Simulated annealing shines in combinatorial optimisation with rugged landscapes. Key applications:
- VLSI placement (original 1983 paper): minimise wire length in chip design
- Traveling Salesman Problem: standard benchmark — SA can find near-optimal tours for up to 10⁵ cities
- Protein folding: find minimum energy 3D structure (slow, but better bounds than greedy)
- Circuit board routing: place tracks to minimise area and crosstalk
- Machine learning: hyperparameter tuning for expensive models (e.g., neural architecture search)
Where SA fails: - Problems with long-range dependencies (e.g., graph partitioning where swaps affect many edges). The neighbour function becomes too expensive to evaluate. - Problems where you need a deterministic answer (healthcare, finance audit). Use branch-and-bound or integer programming instead. - Problems with many equivalent global optima (flat plateaus). SA wastes time jumping between equally good solutions — use a hill climber from a good initial point instead.
The Physics Behind the Algorithm (Why Hot Metal Finds Better Bugs)
Simulated Annealing rips off thermodynamics. Specifically, the way metals cool and settle into a low-energy crystal structure. Heat a metal past its melting point and atoms scramble randomly; cool slowly and they lock into a near-perfect lattice. Cool too fast and you get defects — a local minimum in energy terms.
SA mimics that controlled chaos. High temperature means the algorithm accepts bad moves freely, exploring the search space like a drunk surveyor. As temperature drops, acceptance tightens. The algorithm starts rejecting uphill moves, eventually settling into a decent local minimum. The magic is that slow cooling allows escape from shallow local minima that would trap greedy hill climbing.
Kirkpatrick patented this in 1983 for the Traveling Salesman Problem. It's not a proof of optimality — it's statistical mechanics as optimization heuristic. The cost function becomes your energy landscape. The acceptance probability follows the Boltzmann distribution: exp(-\Delta E / T). Higher delta or lower temp kills acceptance probability exponentially fast.
Perturbation Functions: Tuning the Random Shove
You need a way to mutate a candidate solution. In SA, that's the perturbation function. This isn't a random number generator — it's a domain-specific operator that should produce valid, nearby solutions.
For continuous optimization, a common approach is Gaussian perturbation: new_x = current_x + N(0, sigma). The sigma parameter matters: too small and you inchworm through the space; too large and you leap into garbage territory. Adaptive schemes adjust sigma based on acceptance rate. If you're accepting everything, increase sigma; if accepting nothing, shrink it.
For combinatorial problems like TSP, use swaps, reverses, or insertions on the tour. The perturbation must preserve solution legality. Don't generate an invalid tour and then burn cycles repairing it — that's a different kind of hell. Measure the average delta-cost per perturbation type. Pick the one that yields the best trade-off between exploration and exploitation.
Production trap: Never hard-code perturbation magnitude. Parameterize it and tune via a grid search or Bayesian optimization. Yes, you now have a meta-optimizer. Welcome to senior engineering.
VLSI Chip Layout: The Unrouted Net Debacle
- Cooling schedule is the single most impactful parameter — more than iterations per temperature
- Logarithmic cooling guarantees convergence to global optimum but is impractically slow; hybrid schedules are the production standard
- Always verify with random restarts: if the final cost varies by more than 5% across runs, your schedule is too fast
print(f'Acceptance rate at T={T:.3f}: {accepted/total:.2%}')Track per-temperature best cost: plot(T_history, cost_history)Key takeaways
Common mistakes to avoid
5 patternsStarting temperature too low
Forgetting to track the best solution seen
Using overly large neighbour steps
Not running multiple restarts to assess variability
Using logarithmic cooling for a production problem
Practice These on LeetCode
Interview Questions on This Topic
Why does simulated annealing escape local optima while hill climbing cannot?
Frequently Asked Questions
20+ years shipping performance-critical code where algorithms decide the bill. Drawn from code that ran under real load.
That's Genetic Algorithms. Mark it forged?
6 min read · try the examples if you haven't