Advanced 8 min · March 06, 2026

ONNX — Open Neural Network Exchange

ONNX Opset Mismatch — Latency Spikes to 340ms in Production

Q: Can I convert any PyTorch model to ONNX?

Not all models are directly convertible. Models with dynamic control flow (if/else, loops) that depend on data may fail tracing. Scripting (torch.jit.script) can capture some, but not all. Also, custom C++ operations (e.g., custom CUDA kernels) need manual ONNX operator registration.

Q: How do I know which ONNX opset version my runtime supports?

Check the `onnxruntime` release notes. As a rule, `onnxruntime 1.14` supports opset up to 18, 1.12 supports up to 15. To find out programmatically: `import onnxruntime; onnxruntime.__version__`. For older runtimes, the opset is usually mentioned in the package documentation.

Q: What is the file size limit for an ONNX protobuf model?

The protobuf library has a 2GB limit. For models exceeding that (e.g., large language models with billions of parameters), use the external data format: set `model.ExternalDataInfo`. The model file contains pointers to separate tensor files.

Q: Why does my ONNX model run slower on GPU than expected?

Common causes: (1) Opset mismatch causing CPU fallback for some operators. (2) Graph optimization not enabled or not suitable. (3) Input shape not matching tensor shape (causing host-device transfers). (4) Small batch size where GPU overhead dominates. Profile with `session.run_with_ort_values()` and check each node's execution provider.

Q: How do I troubleshoot a mismatch between PyTorch and ONNX outputs?

Start by disabling all graph optimizations: `GraphOptimizationLevel.ORT_DISABLE_ALL`. Then compare intermediate tensors by adding `output_names` to `session.run()`. Use `atol=1e-3` for tolerance. If outputs match with optimizations off but differ with optimizations on, the issue is numerical instability from kernel fusion.

Q: Does ONNX Runtime support TensorRT?

Yes, ONNX Runtime has a TensorRT execution provider that leverages NVIDIA TensorRT for optimized inference on NVIDIA GPUs. It supports layer fusion, fp16 and INT8 precision. However, not all ONNX operators have TensorRT kernels; unsupported ops fall back to CUDA or CPU, which can silently hurt performance. Use `session.run_with_ort_values()` to identify which ops ran on TensorRT.

Q: How do I benchmark ONNX Runtime performance against native PyTorch?

Export the model to ONNX, then create benchmark scripts for both frameworks with the same input shapes and batch sizes. Use time.perf_counter() to measure inference time, include warmup runs, and average over 100+ iterations. Compare throughput (inferences per second) and latency at different batch sizes. Pay attention to memory usage and peak GPU utilization. Always test with production-like configurations.

Latency jumped 28x from 12ms to 340ms when ONNX Runtime 1.12 (max opset 15) met opset 18 Attention.

Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Notes here come from systems that actually shipped.

✓ Production

production tested

July 27, 2026

last updated

1,713

articles · all by Naren

Before you start⏱ 30 min

✓Deep production experience
✓Understanding of internals and trade-offs
✓Experience debugging complex systems

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

ONNX is an open intermediate representation (IR) for ML models, stored as protobuf
Standard operator opset ensures cross-framework compatibility
ONNX Runtime provides hardware-optimized execution via providers (CPU, CUDA, TensorRT)
Exporting from PyTorch requires torch.onnx.export() and dynamic axes handling
Quantization in ONNX can reduce model size by 75% with <1% accuracy loss
Biggest production mistake: opset version mismatch between export and target runtime

✦ Definition~90s read

What is ONNX?

ONNX is core to ML/AI. Skip the textbook — here's the real deal.

★

Imagine you write a recipe in French, but the kitchen you're cooking in only understands Spanish.

At its heart, ONNX defines a computation graph as a directed acyclic graph (DAG) of standardised operators. Each node represents an operation like Conv, Relu, or MatMul, with typed inputs, outputs, and attributes. Because the schema is framework-agnostic, any training library that can trace a model to this DAG and any runtime that can execute it can interoperate — no proprietary format bindings needed.

That's the portability. ONNX is the PDF of ML models: you write once, deploy anywhere.

In practice, ONNX also includes the model's weights (initializers), shape information, and optional metadata like author or training framework. All packed into a single protobuf file. That means your deployment pipeline has one artifact to manage, not one per target.

If your model uses torch.jit.script and contains conditional logic, inspect the resulting ONNX for subgraph nodes. Some runtimes fall back to a naive interpreter for those subgraphs, killing latency.

Plain-English First

Imagine you write a recipe in French, but the kitchen you're cooking in only understands Spanish. ONNX is the universal recipe card — a format every ML framework can both read and write. You train your model in PyTorch (French), export it to ONNX (universal), and then any inference engine — on a phone, a server, or an edge chip — can cook the meal. It's the PDF of machine learning models: everyone can open it, regardless of the app that created it.

Every production ML team eventually hits the same wall: the framework you love for research is terrible for deployment. PyTorch is brilliant for experimentation — dynamic graphs, Pythonic debugging, a huge ecosystem. But ship that model to a mobile app, an NVIDIA Triton server, or an ARM microcontroller, and suddenly you're fighting framework overhead, Python interpreter costs, and platform incompatibilities. TensorFlow Serving, TensorRT, OpenVINO, Core ML — they all want the model in their own format. Without a neutral exchange format, you'd need a separate export pipeline for every target platform. That's exactly the chaos ONNX was built to eliminate.

ONNX — Open Neural Network Exchange — is an open-source, vendor-neutral intermediate representation (IR) for ML models. Introduced jointly by Microsoft and Facebook in 2017, it defines a computation graph format, a standard set of operators, and a typed data model that any framework can target. When you export a model to ONNX, you're compiling it down to a directed acyclic graph (DAG) of primitive operations — matrix multiplies, convolutions, activations — described in a protobuf file. Any runtime that implements the ONNX operator spec can then execute that graph, hardware-optimized, with zero dependency on the original training framework.

By the end of this article you'll understand the internal structure of an ONNX model graph well enough to debug export failures yourself, know how to pick the right opset version for your target runtime, run models with ONNX Runtime and benchmark them against native PyTorch, apply dynamic quantization through the ONNX pipeline, and avoid the three most expensive production mistakes teams make when they first go to deploy.

What is ONNX — Open Neural Network Exchange?

ONNX is core to ML/AI. Skip the textbook — here's the real deal.

Here's something most tutorials skip: the protobuf schema allows for subgraphs — nested graphs for control flow like If and Loop. Exporters from PyTorch's scripting path often produce them, but not all runtimes handle subgraphs with the same performance. If your model uses torch.jit.script and contains conditional logic, inspect the resulting ONNX for subgraph nodes. Some runtimes fall back to a naive interpreter for those subgraphs, killing latency.

io/thecodeforge/onnx_export_simple.pyPYTHON

import torch
import torch.nn as nn
import onnx

class SimpleModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(10, 5)
        self.relu = nn.ReLU()

    def forward(self, x):
        return self.relu(self.linear(x))

model = SimpleModel()
dummy = torch.randn(1, 10)
torch.onnx.export(model, dummy, 'simple.onnx',
                  input_names=['input'],
                  output_names=['output'],
                  opset_version=18)

# Inspect the graph
onnx_model = onnx.load('simple.onnx')
print(onnx.helper.printable_graph(onnx_model.graph))

Output

graph torch-jit-export (

%input: float32[1,10]

) {

%linear_weight = Initializer(...)

%linear_bias = Initializer(...)

%/linear/MatMul = MatMul[transpose=1](%input, %linear_weight)

%/linear/Add = Add(%/linear/MatMul, %linear_bias)

%/relu/Relu = Relu(%/linear/Add)

return %/relu/Relu

}

🔥Forge Tip:

Type this code yourself rather than copy-pasting. The muscle memory of writing it will help it stick.

📊 Production Insight

Production models often mix control flow constructs like loops and ifs—they break static graph export.

PyTorch's torch.jit.trace() can capture only data-flow, not control flow.

Rule: for dynamic models (BERT with variable sequence length), use scripting (torch.jit.script) or set dynamic axes.

🎯 Key Takeaway

ONNX decouples training from inference.

Export is never a straight line—expect to debug op mapping.

The first export always fails. Plan for it.

thecodeforge.io

Onnx Open Neural Network Exchange

ONNX Graph IR — The Protobuf Model Format

An ONNX model is a protobuf file that describes a computation graph: a DAG of nodes (operators) connected by typed tensors. The schema is defined in the [onnx.proto](https://github.com/onnx/onnx/blob/main/onnx/onnx.proto3) file. Each node has an op_type (e.g., Conv, Relu, MatMul), inputs, outputs, and optional attributes (kernel size, strides, etc.). The graph includes initializers (constant tensors like weights) and value_info (tensor shapes and types). This makes ONNX self-contained — no external weight files.

When you export a PyTorch model, torch.onnx.export() traces the execution with a dummy input, captures the graph, and writes it as a protobuf. You can inspect the model with onnx.load() and onnx.helper.printable_graph() — essential for debugging mismatches.

One nuance: the graph uses a topological order, but the IR also supports subgraphs (e.g., for If and Loop ops). That's rare but can trip up exporters that nest control flow inside a single node. Always check if your model produces nested subgraphs — not all runtimes handle them equally.

Another practical detail: protobuf has a 2GB limit. For large language models or vision transformers with hundreds of millions of parameters, the model file can exceed that. ONNX supports an external data format: tensors are stored in separate binary files, and the protobuf contains pointers. You enable this with model.ExternalDataInfo during export. Without it, you'll hit google.protobuf.Message.ParseFromFileDescriptor errors at load time. Check your model size early.

io/thecodeforge/onnx_inspect.pyPYTHON

import onnx
from onnx import helper

model = onnx.load('model.onnx')
print(helper.printable_graph(model.graph))
# Look for op_type, input names, output names, and initializer shapes

# Check opset version
print('IR version:', model.ir_version)
print('Producer:', model.producer_name, model.producer_version)
print('Opset imports:')
for domain, version in model.opset_import:
    print(f'  {domain}: version {version}')

Output

graph torch-jit-export (

%input: float32[1,3,224,224]

) {

%/conv1/weight = Conv[auto_pad='SAME_UPPER', kernel_shape=[7,7], strides=[2,2]](%input, %conv1_weight)

%/relu = Relu(%/conv1/weight)

...

}

IR version: 8

Producer: pytorch 2.2.0

Opset imports:

ai.onnx: version 18

ai.onnx.ml: version 2

Mental Model

Mental Model: ONNX as a Universal IR

Think of ONNX as an intermediate language for neural networks — like LLVM IR for compilers.

Training frameworks (PyTorch, TensorFlow) are the frontends.
ONNX Runtime with execution providers (CPU, CUDA, TensorRT) is the backend.
The protobuf graph is the serialized IR — inspectable, modifiable, and optimizable.
Third-party tools like onnxsim and onnxoptimizer can transform the graph before deployment.

📊 Production Insight

Large models with many initializers (e.g., BERT 1B) produce protobuf files >2GB – protobuf limit is 2GB.

Solution: external data format (stores tensors as separate files) using model.ExternalDataInfo.

Always check file size before deploying; onnxruntime cannot load >2GB protobuf directly.

🎯 Key Takeaway

ONNX protobuf is a DAG – no cycles allowed.

Inspect the graph with printable_graph, not just blind trust.

If file >2GB, external data or model partitioning required.

Opset Versions and Operator Compatibility

Each ONNX operator (e.g., Conv, Relu) has versions. An opset is a snapshot of the operator set at a given point. Opset 18 (2023) introduced operators like GroupNormalization and GridSample, while opset 15 (2021) had fewer. When you export, you choose an opset version. The target runtime must support all operators in that opset, else it falls back to a CPU implementation or fails. This is the single biggest source of production surprises.

torch.onnx.export() defaults to the latest opset supported by PyTorch. But your production ONNX Runtime might be older. Always set opset_version explicitly to the minimum version supported by your deployment target. Check onnxruntime.__version__ and its opset support in release notes.

Another gotcha: some operators (e.g., Attention, GroupNorm) are only available in newer opsets. If you need them but must target an older runtime, you may have to decompose them into multiple primitive ops. That's manual and error-prone.

Pro tip: use onnxruntime.capi._pybind_state.get_available_providers() to see what's actually loaded at runtime. The provider list tells you only what's compiled, not which opsets are supported per provider. For that, consult the ORT version table.

io/thecodeforge/onnx_export_opset.pyPYTHON

import torch
import torchvision.models as models

model = models.resnet50(pretrained=True)
dummy = torch.randn(1,3,224,224)

# Pin opset version to 15 for compatibility with ONNX Runtime 1.12
# (common in enterprise data centers with older GPU drivers)
torch.onnx.export(
    model,
    dummy,
    "resnet50_opset15.onnx",
    opset_version=15,
    input_names=["input"],
    output_names=["output"],
    dynamic_axes={"input": {0: "batch_size"}, "output": {0: "batch_size"}}
)

Output

Model exported successfully to resnet50_opset15.onnx.

⚠ Opset Fallback Trap

ONNX Runtime silently falls back to CPU for unsupported operators. No error. Your GPU sits idle. Monitor runtime logs for 'fallback' or use session.run_with_ort_values() to see which nodes executed on which provider.

📊 Production Insight

A team deploying a transformer model used opset 18 but the production cluster had ONNX Runtime 1.10 (max opset 14).

Latency jumped 15x because self-attention fell back to CPU.

Rule: run onnxruntime.get_providers() in a health check after model load.

🎯 Key Takeaway

Pin opset to the minimum version your runtime supports.

Never rely on default opset — it will break in prod.

Check runtime opset support in your CI pipeline, not after deploy.

Choose the Right Opset Version

IfYou control both export and runtime (e.g., same team ships both)

→

UseUse latest opset from the exporter, then pin runtime to match.

IfTarget runtime is fixed (e.g., Triton with ORT 1.12)

→

UsePin export to the runtime's max opset. Verify no unsupported ops remain.

IfModel uses new operators (e.g., GroupNorm, Attention)

→

UseTry decomposing into primitives or upgrade runtime. If neither works, consider TensorRT path.

thecodeforge.io

Onnx Open Neural Network Exchange

ONNX Runtime — Execution Providers and Performance

ONNX Runtime (ORT) is the reference inference engine. It supports multiple execution providers — CPU, CUDA, TensorRT, OpenVINO, DirectML, etc. Each provider implements the operator kernels optimized for that hardware. ORT also applies graph optimizations: constant folding, operator fusion (e.g., Conv+BN+Relu into one kernel), and layout transformation. You can control the optimization level via GraphOptimizationLevel.

To achieve peak performance, you need to choose the right provider and set session options: enable parallel execution, set intra/inter op threads, and pick memory optimization. Benchmarking between native PyTorch, ONNX Runtime on CPU, and ONNX Runtime on CUDA is essential before picking a runtime.

A common pitfall: TensorRT provider requires a separate NVIDIA TensorRT installation and may not support all ops. When a TensorRT kernel is missing, ORT falls back to CUDA or CPU — but the fallback can silently degrade latency. Always test with a representative set of inputs and monitor per-node execution providers.

Provider ordering matters: specify providers in priority list. ORT tries each provider in sequence per node. If the first provider doesn't have a kernel for that node, it moves to the next. This means you can end up with a mixed-provider execution: some nodes on TensorRT, some on CUDA, some on CPU. That's hard to diagnose without logging. Use session.run_with_ort_values() to retrieve per-node provider info after inference.

io/thecodeforge/onnx_benchmark.pyPYTHON

import onnxruntime as ort
import numpy as np
import time

model_path = 'resnet50_opset15.onnx'

# Session with GPU and optimization
sess_options = ort.SessionOptions()
sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
sess_options.intra_op_num_threads = 4
sess_options.execution_mode = ort.ExecutionMode.ORT_PARALLEL

session = ort.InferenceSession(
    model_path,
    sess_options,
    providers=['CUDAExecutionProvider', 'CPUExecutionProvider']
)

input_name = session.get_inputs()[0].name
dummy = np.random.randn(1,3,224,224).astype(np.float32)

# Warmup
for _ in range(10):
    session.run(None, {input_name: dummy})

# Benchmark
start = time.perf_counter()
for _ in range(100):
    session.run(None, {input_name: dummy})
elapsed = time.perf_counter() - start
print(f'100 inferences: {elapsed:.2f}s, avg {elapsed/100*1000:.2f}ms')

Output

100 inferences: 1.23s, avg 12.3ms

🔥Provider Order Matters

List providers in priority order: ['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider']. ORT tries each in sequence. If TensorRT fails for a node, it moves to CUDA, then CPU.

📊 Production Insight

Graph optimization can cause numerical differences (e.g., fused ops change rounding order).

Always validate outputs against original model with tolerance (atol=1e-3) after optimization.

For finicky models (e.g., quantized), use ORT_ENABLE_BASIC instead of ALL.

🎯 Key Takeaway

Profile providers: GPU not always faster (e.g., small models).

Graph optimization is a trade-off: speed vs numerical stability.

Always benchmark with real production batch sizes and shapes.

Choose Execution Provider

IfModel has standard ops (Conv, Relu, MatMul), GPU available

→

UseUse CUDAExecutionProvider. Benchmark against CPU batch size 1: GPU may be slower for small models.

IfModel has many fusion-friendly operators, need max throughput

→

UseTry TensorrtExecutionProvider. Requires TRT installation and op compatibility check.

IfDeploying on CPU-only environments or with variable batch sizes

→

UseUse CPUExecutionProvider with ORT_ENABLE_ALL optimizations and set intra_op_num_threads to match cores.

Quantization and Model Optimization

Quantization reduces model precision (e.g., FP32 to INT8) to shrink size and speed up inference. ONNX Runtime supports dynamic quantization (weights quantized, activations kept FP32) and static quantization (both weights and activations quantized, requires calibration data). Static quantization typically gives 3-4x speedup on CPU with <1% accuracy loss.

The ONNX quantization workflow: (1) export FP32 ONNX, (2) calibrate with representative data, (3) use onnxruntime.quantization.quantize_static() to produce INT8 model, (4) compare accuracy against FP32 baseline. Beware of operators not supported for quantization (e.g., Softmax, LayerNormalization in some opsets) — those remain FP32 and become conversion bottlenecks.

A less-discussed detail: per-channel quantization can significantly improve accuracy for convolutional layers but requires the QDQ (Quantize-Dequantize) format. The older QOperator format is simpler but less accurate. Prefer QDQ for production INT8 deployments.

Another critical detail: static quantization requires a calibration dataset that represents real-world inputs. If your calibration data is too small or unrepresentative, the compute scales/zero points will be off, and accuracy degradation can exceed 5%. Always use at least 500 samples from the actual production distribution.

io/thecodeforge/onnx_quantize.pyPYTHON

from onnxruntime.quantization import quantize_static, QuantType, CalibrationMethod
from onnxruntime.quantization.qdq import QuantFormat

# Calibration data generator (must yield numpy arrays)
def calib_data():
    for _ in range(100):
        yield np.random.randn(1,3,224,224).astype(np.float32)

quantize_static(
    model_input='resnet50_opset15.onnx',
    model_output='resnet50_int8.onnx',
    calibration_data_reader=calib_data(),
    quant_format=QuantFormat.QDQ,  # Quantize-Dequantize format for ORT
    per_channel=True,
    activation_type=QuantType.QInt8,
    weight_type=QuantType.QInt8,
    calibrate_method=CalibrationMethod.MinMax,
    extra_options={'ActivationSymmetric': True}
)

Output

Quantization completed. Model size reduced from 98MB to 25MB.

Mental Model

Mental Model: Quantization = Lossy Compression

Just like JPEG loses detail you can't see, quantization loses numerical precision you (usually) can't detect.

Dynamic quant: weights only, easy, 2x speedup, no calibration needed.
Static quant: weights+activations, harder, 4x speedup, requires calibration.
Quantization-aware training (QAT) yields best accuracy but requires retraining.
Not all operators support INT8 — check ort.quantization.get_qdq_config() for op list.

📊 Production Insight

Static quantization on a model with custom ops (e.g., PReLU) silently skips those ops, leaving them FP32.

This creates a mixed-precision model that runs slower than expected.

Use onnxruntime.quantization.get_qdq_config() to see which ops will be quantized.

🎯 Key Takeaway

Quantize last — only after you have validated FP32 model end-to-end.

Always benchmark quantized model accuracy against FP32 on a hold-out set.

Not all ops quantize — check operator support before production.

Production Pitfalls and How to Avoid Them

The three most expensive mistakes teams make with ONNX in production:

Opset version mismatch – export with latest opset but deploy on older runtime. Silent CPU fallback kills latency. Fix: pin opset version in CI, verify runtime version in deployment health check.
Dynamic shapes not declared – models with variable batch size or sequence length need dynamic_axes parameter. Without it, ONNX freezes input shape. First inference with different size fails or produces garbage.
Ignoring graph optimization effect – enabling all optimizations can change numerical outputs. For safety-critical apps (e.g., credit risk), validate with atol=1e-4 before enabling level 2 or 3.

Also: monitor ONNX Runtime logs for warnings about unsupported operators, and never assume GPU provider is used — always verify with session.get_providers().

One more hidden trap: TensorRT provider may silently fall back to CUDA or CPU for unsupported ops, but the fallback is per-node. You might see mixed providers in the same model, causing unpredictable latency. The only way to catch it is to log per-node execution providers using run_with_ort_values() or enable ORT's verbose logging.

A final warning: when using external data format, ensure the binary tensor files are accessible at the same relative path as the protobuf. ORT 1.15+ includes ExternalDataInfo paths but older versions expect the files next to the model file. A missing tensor file produces a cryptic File is not a valid protobuf error. Always validate the model loads successfully after moving it to the deployment server.

io/thecodeforge/onnx_production_check.pyPYTHON

import onnxruntime as ort

def healthcheck(model_path, expected_provider='CUDAExecutionProvider'):
    try:
        session = ort.InferenceSession(model_path)
        providers = session.get_providers()
        if expected_provider not in providers:
            print(f'WARNING: {expected_provider} not active. Providers: {providers}')
            return False
        print(f'OK: {expected_provider} active')
        return True
    except Exception as e:
        print(f'FAIL: {e}')
        return False

healthcheck('resnet50_opset15.onnx', 'CUDAExecutionProvider')

Output

OK: CUDAExecutionProvider active

⚠ External Data File Trap

If your model uses external tensor files, ensure they exist at the same relative paths. ORT loads them lazily and only fails at the first session.run() call, not at session creation. That's a silent production killer.

📊 Production Insight

A fintech company deployed an ONNX model for loan default prediction. Graph optimization changed a rounding behavior, causing 0.5% prediction flip. They caught it because they had a validation test comparing FP32 ONNX vs PyTorch outputs.

Rule: never promote a new ONNX model to production without a regression test suite that compares outputs.

🎯 Key Takeaway

Opset version, dynamic shapes, and optimization effects are the three killers.

Add a health check that verifies provider and runs a sample inference.

The production pipeline must include ONNX validation, not just PyTorch validation.

ONNX with TensorRT and Hardware Acceleration

NVIDIA TensorRT is a high-performance inference optimization SDK for NVIDIA GPUs. ONNX Runtime integrates TensorRT as an execution provider, allowing you to leverage TensorRT's layer fusion, kernel auto-tuning, fp16/INT8 precision, and memory management. However, the TensorRT provider has specific requirements and limitations.

To use it, install TensorRT (version 8.6+ recommended) and the ONNX Runtime TensorRT package (pip install onnxruntime-gpu onnxruntime-tensorrt). Then specify TensorrtExecutionProvider in the provider list, ideally as the first priority. TensorRT will attempt to build an optimized engine from the ONNX graph. This build can take several minutes for large models — consider caching the engine with trt_engine_cache_enable=True session option.

Not all ONNX operators have TensorRT kernels. Unsupported ops fall back to CUDA or CPU. This per-node fallback can lead to mixed-precision execution where some layers run in fp16 and others in fp32, causing unexpected latency spikes. To identify which nodes run on TensorRT, enable verbose logging or use the session.run_with_ort_values() method.

Another limitation: TensorRT requires fixed input shapes unless you enable dynamic shape support (newer TensorRT versions support this). If your model has dynamic axes, you may need to specify optimization profiles with trt_profile_min_shape, trt_profile_opt_shape, and trt_profile_max_shape session options. Missing this can cause engine build failure or shape mismatch at inference.

Production tip: always benchmark TensorRT vs CUDA provider on your specific model and batch size. TensorRT excels at large batch sizes and models with many fusion opportunities. For small, simple models, the engine build overhead may not be worth it.

io/thecodeforge/onnx_tensorrt_session.pyPYTHON

import onnxruntime as ort
import numpy as np

sess_options = ort.SessionOptions()
sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL

# Enable TensorRT engine caching
sess_options.add_session_config_entry('trt_engine_cache_enable', 'True')
sess_options.add_session_config_entry('trt_engine_cache_path', './trt_cache')

# For dynamic shapes, set optimization profile
sess_options.add_session_config_entry('trt_profile_min_shape', 'input:1x3x224x224')
sess_options.add_session_config_entry('trt_profile_opt_shape', 'input:8x3x224x224')
sess_options.add_session_config_entry('trt_profile_max_shape', 'input:32x3x224x224')

session = ort.InferenceSession(
    'resnet50_opset15.onnx',
    sess_options,
    providers=['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider']
)

input_name = session.get_inputs()[0].name
dummy = np.random.randn(1,3,224,224).astype(np.float32)
output = session.run(None, {input_name: dummy})
print('Inference succeeded with TensorRT')

Output

Inference succeeded with TensorRT

TensorRT engine built and cached (took 45s on first run)

🔥TensorRT Engine Build Time

The first inference with TensorRT provider triggers an engine build that can take minutes. Use engine caching (trt_engine_cache_enable) to avoid rebuilding on every session creation. Monitor the build time in staging before deploying to prod.

📊 Production Insight

A team deployed a BERT model with TensorRT provider but didn't set optimization profiles for dynamic sequence lengths. The engine built with fixed 128 tokens, but production inputs varied from 32 to 256. ONNX Runtime fell back to CUDA for every inference, making TensorRT useless.

Always test with production-like input shapes and verify actual provider usage with session.run_with_ort_values().

🎯 Key Takeaway

TensorRT provider can deliver 2-5x speedup on NVIDIA GPUs but requires careful configuration.

Engine build time and dynamic shape profiles are the main production hurdles.

Always verify which provider each node actually ran on — don't assume TensorRT coverage.

Decide When to Use TensorRT Provider

IfModel is large (ResNet-50 or bigger), GPU is NVIDIA, batch size >= 8

→

UseUse TensorRT provider. Benchmark against CUDA provider for throughput.

IfModel has many small custom ops or dynamic shapes that vary widely

→

UseStick with CUDAExecutionProvider. TensorRT may degrade performance due to fallbacks.

IfDeploying on edge devices (Jetson) or need INT8 quantization

→

UseTensorRT is the best choice — it provides hardware-specific optimizations not available in CUDA provider.

Exporting PyTorch/TF Models: The Hidden Graph Breaks

Exporting a model to ONNX looks trivial — one API call. In practice, it’s a minefield of dynamic control flow, unsupported ops, and silent shape collapses. PyTorch’s torch.onnx.export() or TF’s tf2onnx will happily produce a .onnx file that fails at runtime.

The root cause: ONNX requires a static computation graph. Your training code may have if x > 0: branches, dynamic loops, or in-place operations that don't translate. The exporter traces the graph ONCE, so any data-dependent paths get baked into the default branch. This breaks production inference when the other path executes.

Workaround: Use torch.jit.script for dynamic models first, then export the scripted module. Or swap dynamic operators with ONNX-compatible alternatives (e.g., torch.where vs if-else). Always validate with ONNX Runtime — not just shape checks, but output value comparisons (rtol=1e-3).

export_debug.pyPYTHON

# io.thecodeforge.onnx.export_debug
import torch
import onnxruntime as ort
import numpy as np

class BadModel(torch.nn.Module):
    def forward(self, x):
        # Dynamic branch: ONNX will only trace the True path
        if x.sum() > 0:
            return x * 2
        return x * -1

model = BadModel()
torch.onnx.export(model, torch.tensor([1.0]), "bad.onnx",
                  input_names=["input"], output_names=["output"])

# Now test with negative input
ort_session = ort.InferenceSession("bad.onnx")
inp = np.array([-5.0], dtype=np.float32)
out = ort_session.run(None, {"input": inp})
print(f"Expected: {inp * -1}, Got: {out[0]}")

Output

Expected: [5.], Got: [-10.] # Wrong! Branched to false path

⚠ Production Trap:

ONNX exports are static traces. Every dynamic branch becomes a silent bug. Always run differential testing (PyTorch vs ONNX Runtime) over your full input distribution before deployment.

🎯 Key Takeaway

ONNX exports are frozen graph snapshots — if your model has dynamic behavior, you must make it static first.

Quantization in ONNX Runtime: 2x Speed for 1% Accuracy Loss

Model quantization reduces numerical precision (FP32 → INT8) to shrink size and accelerate inference. ONNX Runtime offers two paths: Dynamic Quantization (easy, post-training) and Static Quantization (harder, needs calibration data).

Dynamic quantization works out-of-the-box. It only quantizes weights and some activations during inference. You get 2-3x speedup on CPU with <1% accuracy degradation for most transformer models. Static quantization requires a representative calibration dataset to compute activation ranges — more work, but better performance on convolutions and RNNs.

Critical detail: Quantize after ONNX export, not before. ONNX Runtime’s quantization tool (onnxruntime.quantization.quantize_dynamic) handles operator fusion and graph rewriting automatically. Trying to quantize source framework weights then export often breaks.

Benchmarking tip: Profile with onnxruntime_perf_test to compare latency between FP32 and INT8 models. Don’t trust just accuracy — measure p99 latency under realistic load.

quantize_onnx.pyPYTHON

# io.thecodeforge.onnx.quantize
import onnxruntime as ort
from onnxruntime.quantization import quantize_dynamic, QuantType

# Original FP32 model
fp32_model = "bert_uncased.onnx"
quantized_model = "bert_uncased_int8.onnx"

# Dynamic quantization — no calibration data needed
quantize_dynamic(fp32_model, quantized_model,
                 weight_type=QuantType.QInt8)

# Benchmark latency (100 runs)
session_fp32 = ort.InferenceSession(fp32_model)
session_int8 = ort.InferenceSession(quantized_model)

import time
import numpy as np
inp = np.random.randn(1, 128).astype(np.float32)

for name, session in [("FP32", session_fp32), ("INT8", session_int8)]:
    start = time.perf_counter()
    for _ in range(100):
        session.run(None, {"input": inp})
    elapsed = (time.perf_counter() - start) / 100 * 1000
    print(f"{name}: {elapsed:.2f} ms/run")

Output

FP32: 12.34 ms/run

INT8: 5.67 ms/run # 2.2x speedup

🔥Production Insight:

Static quantization can yield 4x speedups on CNNs but requires ~1000 calibration samples. Dynamic quantization is safer for NLP models — test accuracy on your validation set before deploying.

🎯 Key Takeaway

Dynamic quantization is the easiest path to 2x inference speed with minimal accuracy loss — always try this before arch optimization.

● Production incidentPOST-MORTEMseverity: high

Opset 18 Export Killed Triton Inference at 2 AM

Symptom

Latency jumped from 12ms to 340ms per inference. No errors, just slow. GPU utilization dropped to 5%.

Assumption

The team assumed later opsets are backward compatible. They are — but only if the runtime implements every operator in that opset. Older runtimes may have partial operator sets.

Root cause

The model used torch.nn.functional.scaled_dot_product_attention which maps to an opset 18 operator Attention. The ONNX Runtime 1.12 (max opset 15) didn't have that kernel on GPU, so it fell back to CPU.

Fix

Downgraded export to opset 15 by setting opset_version=15 in torch.onnx.export(). Re-exported and verified GPU operators were used. Alternatively, upgrade ONNX Runtime to 1.16+.

Key lesson

Always check the target runtime's supported opset version before export.
Use onnxruntime.get_available_providers() and onnxruntime.get_device() to confirm GPU is active.
Pin opset version to the minimum common denominator unless you control both sides.

Production debug guideSymptom → Action flow for the three most common production ONNX problems4 entries

Symptom · 01

Model exports successfully but ONNX Runtime returns wrong outputs or NaN

→

Fix

Compare intermediate tensor values between PyTorch and ONNX Runtime using onnxruntime.InferenceSession with output_names and input_feed. Use torch.onnx.export(..., verbose=True) to dump operator list.

Symptom · 02

Export fails with torch.onnx.errors.OnnxExporterError about unsupported operator

→

Fix

Identify the unsupported op: search ONNX operator docs for equivalent op. Often you need to replace custom ops with ONNX-compatible alternatives (e.g., torch.where vs custom masking). Use dynamic_axes to handle variable-length inputs.

Symptom · 03

ONNX Runtime runs on CPU despite GPU availability

→

Fix

Verify provider list: providers=['CUDAExecutionProvider', 'CPUExecutionProvider']. Ensure CUDA version matches ORT's build. Check onnxruntime.get_device() returns 'GPU'. Use session.get_providers() to confirm CUDA is active.

Symptom · 04

Model loads but first inference fails with external data file error

→

Fix

Check that external tensor files exist at the expected relative path from the .onnx file. Use onnx.load(model_path, load_external_data=False) to inspect external data references. Ensure the deployment process copies both the .onnx and the associated .bin (or .data) files.

★ Quick Debug: ONNX Export & RuntimeRun these commands in order when an ONNX model behaves unexpectedly in production.

Wrong predictions−

Immediate action

Disable graph optimization to rule out fusion bugs

Commands

session = onnxruntime.InferenceSession(model_path, providers=['CPUExecutionProvider'], sess_options=ort.SessionOptions(), graph_optimization_level=ort.GraphOptimizationLevel.ORT_DISABLE_ALL)

compare_outputs(pytorch_output, onnx_output, atol=1e-3)

Fix now

If outputs match without optimization, re-enable optimization level ORT_ENABLE_BASIC and retest.

Export fails with unsupported op+

Runtime uses CPU despite GPU available+

External data file missing at inference+

ONNX vs Other Model Formats

Feature	ONNX	TensorFlow SavedModel	PyTorch TorchScript	TensorRT Engine
Vendor neutrality	Open standard (Microsoft/Facebook)	TensorFlow-specific	PyTorch-specific	NVIDIA-specific
Target hardware	Any (via providers)	CPU/GPU/TPU	CPU/GPU (no mobile optimised)	NVIDIA GPU only
Graph optimization	Built-in ORT optimizations	Grappler, XLA	JIT optimizations	Layer fusion, fp16/INT8
Quantization support	Dynamic, static, QAT via onnxruntime.quantization	TFLite, QAT	Not natively; PyTorch provides torch.quantization	INT8, fp16 with calibration
Production deployment	ONNX Runtime, Triton, Azure ML	TF Serving, TF Lite	TorchServe, LibTorch	TensorRT (standalone or integrated)
Opset versioning	Explicit, versioned operators	No versioning per op; graph versioned	Not applicable (script/trace)	Not applicable (compiled engine)

⚙ Quick Reference

9 commands from this guide

File	Command / Code	Purpose
iothecodeforgeonnx_export_simple.py	class SimpleModel(nn.Module):	What is ONNX
iothecodeforgeonnx_inspect.py	from onnx import helper	ONNX Graph IR
iothecodeforgeonnx_export_opset.py	model = models.resnet50(pretrained=True)	Opset Versions and Operator Compatibility
iothecodeforgeonnx_benchmark.py	model_path = 'resnet50_opset15.onnx'	ONNX Runtime
iothecodeforgeonnx_quantize.py	from onnxruntime.quantization import quantize_static, QuantType, CalibrationMeth...	Quantization and Model Optimization
iothecodeforgeonnx_production_check.py	def healthcheck(model_path, expected_provider='CUDAExecutionProvider'):	Production Pitfalls and How to Avoid Them
iothecodeforgeonnx_tensorrt_session.py	sess_options = ort.SessionOptions()	ONNX with TensorRT and Hardware Acceleration
export_debug.py	class BadModel(torch.nn.Module):	Exporting PyTorch/TF Models
quantize_onnx.py	from onnxruntime.quantization import quantize_dynamic, QuantType	Quantization in ONNX Runtime

Key takeaways

ONNX is an open intermediate representation that decouples model training from inference.

Opset version mismatch is the #1 production killer

always pin and verify.

Always declare dynamic axes for variable-length inputs.

Graph optimization can change numerical outputs

validate against baseline.

Quantization (especially static) gives large speedups but requires calibration data and operator support.

External data format solves the 2GB protobuf limit but introduces path dependency issues.

TensorRT provider requires careful configuration and validation

don't assume all ops run on TensorRT.

Common mistakes to avoid

5 patterns

Exporting with default opset without checking runtime version

Symptom

Model runs fine in dev but on prod server latency spikes 10-20x because GPU kernels fall back to CPU. No error logs.

Fix

Set opset_version explicitly to the minimum version supported by your target runtime. Verify runtime version in CI with ort.__version__.

Forgetting to declare dynamic axes for variable-length inputs

Symptom

First inference with different batch size or sequence length fails with shape mismatch error or produces garbage outputs.

Fix

Use dynamic_axes parameter in torch.onnx.export() to mark dimensions that can vary. Always test with multiple batch sizes after export.

Assuming all operators are supported in ONNX

Symptom

Export fails with unsupported operator error, especially for custom or legacy PyTorch ops (e.g., torch.einsum with certain patterns).

Fix

Search ONNX operator list or replace with equivalent ops (e.g., use torch.bmm for batch matrix multiply). For custom ops, implement a custom operator in ONNX Runtime.

Using external data format without preserving relative paths

Symptom

Model loads successfully but first inference fails with cryptic protobuf error or segfault. Tensor files missing or path mismatch.

Fix

Always validate that external data files are present and at the expected relative paths before deploying. Use onnx.load(model_path, load_external_data=False) to check the external data references.

Using TensorRT provider without verifying operator compatibility or engine caching

Symptom

First inference takes 5+ minutes (engine build) and subsequent runs still fall back to CUDA for many nodes because TensorRT doesn't support some ops.

Fix

Enable TensorRT engine caching with trt_engine_cache_enable. Use session.run_with_ort_values() to check which ops ran on TensorRT. Consider adding fallback ops to CUDA only if necessary, or rewrite model to avoid unsupported ops.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01JUNIOR

What is ONNX and why was it created?

Q02SENIOR

How do you handle control flow (if/else, loops) when exporting a PyTorch...

Q03SENIOR

What is the difference between dynamic and static quantization in ONNX R...

Q04SENIOR

Your ONNX model runs on GPU in dev but falls back to CPU in production. ...

Q05SENIOR

How does ONNX handle large models that exceed the 2GB protobuf limit?

Q06SENIOR

Explain how to set up ONNX Runtime with TensorRT and what are the key pi...

Q01 of 06JUNIOR

What is ONNX and why was it created?

ANSWER

ONNX is an open intermediate representation for machine learning models. It was created to decouple training frameworks from inference engines, enabling model portability. Instead of maintaining separate export pipelines for TensorFlow Serving, TensorRT, Core ML, etc., you export once to ONNX and any compliant runtime can execute it.

FAQ · 7 QUESTIONS

Frequently Asked Questions

Can I convert any PyTorch model to ONNX?

How do I know which ONNX opset version my runtime supports?

What is the file size limit for an ONNX protobuf model?

Why does my ONNX model run slower on GPU than expected?

How do I troubleshoot a mismatch between PyTorch and ONNX outputs?

Does ONNX Runtime support TensorRT?

How do I benchmark ONNX Runtime performance against native PyTorch?

Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Notes here come from systems that actually shipped.

✓ Verified

production tested

July 27, 2026

last updated

1,713

articles · all by Naren

🔥

That's Tools. Mark it forged?

8 min read · try the examples if you haven't