Senior 8 min · March 06, 2026

ONNX Opset Mismatch — Latency Spikes to 340ms in Production

Latency jumped 28x from 12ms to 340ms when ONNX Runtime 1.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • ONNX is an open intermediate representation (IR) for ML models, stored as protobuf
  • Standard operator opset ensures cross-framework compatibility
  • ONNX Runtime provides hardware-optimized execution via providers (CPU, CUDA, TensorRT)
  • Exporting from PyTorch requires torch.onnx.export() and dynamic axes handling
  • Quantization in ONNX can reduce model size by 75% with <1% accuracy loss
  • Biggest production mistake: opset version mismatch between export and target runtime
Plain-English First

Imagine you write a recipe in French, but the kitchen you're cooking in only understands Spanish. ONNX is the universal recipe card — a format every ML framework can both read and write. You train your model in PyTorch (French), export it to ONNX (universal), and then any inference engine — on a phone, a server, or an edge chip — can cook the meal. It's the PDF of machine learning models: everyone can open it, regardless of the app that created it.

Every production ML team eventually hits the same wall: the framework you love for research is terrible for deployment. PyTorch is brilliant for experimentation — dynamic graphs, Pythonic debugging, a huge ecosystem. But ship that model to a mobile app, an NVIDIA Triton server, or an ARM microcontroller, and suddenly you're fighting framework overhead, Python interpreter costs, and platform incompatibilities. TensorFlow Serving, TensorRT, OpenVINO, Core ML — they all want the model in their own format. Without a neutral exchange format, you'd need a separate export pipeline for every target platform. That's exactly the chaos ONNX was built to eliminate.

ONNX — Open Neural Network Exchange — is an open-source, vendor-neutral intermediate representation (IR) for ML models. Introduced jointly by Microsoft and Facebook in 2017, it defines a computation graph format, a standard set of operators, and a typed data model that any framework can target. When you export a model to ONNX, you're compiling it down to a directed acyclic graph (DAG) of primitive operations — matrix multiplies, convolutions, activations — described in a protobuf file. Any runtime that implements the ONNX operator spec can then execute that graph, hardware-optimized, with zero dependency on the original training framework.

By the end of this article you'll understand the internal structure of an ONNX model graph well enough to debug export failures yourself, know how to pick the right opset version for your target runtime, run models with ONNX Runtime and benchmark them against native PyTorch, apply dynamic quantization through the ONNX pipeline, and avoid the three most expensive production mistakes teams make when they first go to deploy.

What is ONNX — Open Neural Network Exchange?

ONNX is core to ML/AI. Skip the textbook — here's the real deal.

At its heart, ONNX defines a computation graph as a directed acyclic graph (DAG) of standardised operators. Each node represents an operation like Conv, Relu, or MatMul, with typed inputs, outputs, and attributes. Because the schema is framework-agnostic, any training library that can trace a model to this DAG and any runtime that can execute it can interoperate — no proprietary format bindings needed. That's the portability. ONNX is the PDF of ML models: you write once, deploy anywhere.

In practice, ONNX also includes the model's weights (initializers), shape information, and optional metadata like author or training framework. All packed into a single protobuf file. That means your deployment pipeline has one artifact to manage, not one per target.

Here's something most tutorials skip: the protobuf schema allows for subgraphs — nested graphs for control flow like If and Loop. Exporters from PyTorch's scripting path often produce them, but not all runtimes handle subgraphs with the same performance. If your model uses torch.jit.script and contains conditional logic, inspect the resulting ONNX for subgraph nodes. Some runtimes fall back to a naive interpreter for those subgraphs, killing latency.

io/thecodeforge/onnx_export_simple.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import torch
import torch.nn as nn
import onnx

class SimpleModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(10, 5)
        self.relu = nn.ReLU()

    def forward(self, x):
        return self.relu(self.linear(x))

model = SimpleModel()
dummy = torch.randn(1, 10)
torch.onnx.export(model, dummy, 'simple.onnx',
                  input_names=['input'],
                  output_names=['output'],
                  opset_version=18)

# Inspect the graph
onnx_model = onnx.load('simple.onnx')
print(onnx.helper.printable_graph(onnx_model.graph))
Output
graph torch-jit-export (
%input: float32[1,10]
) {
%linear_weight = Initializer(...)
%linear_bias = Initializer(...)
%/linear/MatMul = MatMul[transpose=1](%input, %linear_weight)
%/linear/Add = Add(%/linear/MatMul, %linear_bias)
%/relu/Relu = Relu(%/linear/Add)
return %/relu/Relu
}
Forge Tip:
Type this code yourself rather than copy-pasting. The muscle memory of writing it will help it stick.
Production Insight
Production models often mix control flow constructs like loops and ifs—they break static graph export.
PyTorch's torch.jit.trace() can capture only data-flow, not control flow.
Rule: for dynamic models (BERT with variable sequence length), use scripting (torch.jit.script) or set dynamic axes.
Key Takeaway
ONNX decouples training from inference.
Export is never a straight line—expect to debug op mapping.
The first export always fails. Plan for it.

ONNX Graph IR — The Protobuf Model Format

An ONNX model is a protobuf file that describes a computation graph: a DAG of nodes (operators) connected by typed tensors. The schema is defined in the [onnx.proto](https://github.com/onnx/onnx/blob/main/onnx/onnx.proto3) file. Each node has an op_type (e.g., Conv, Relu, MatMul), inputs, outputs, and optional attributes (kernel size, strides, etc.). The graph includes initializers (constant tensors like weights) and value_info (tensor shapes and types). This makes ONNX self-contained — no external weight files.

When you export a PyTorch model, torch.onnx.export() traces the execution with a dummy input, captures the graph, and writes it as a protobuf. You can inspect the model with onnx.load() and onnx.helper.printable_graph() — essential for debugging mismatches.

One nuance: the graph uses a topological order, but the IR also supports subgraphs (e.g., for If and Loop ops). That's rare but can trip up exporters that nest control flow inside a single node. Always check if your model produces nested subgraphs — not all runtimes handle them equally.

Another practical detail: protobuf has a 2GB limit. For large language models or vision transformers with hundreds of millions of parameters, the model file can exceed that. ONNX supports an external data format: tensors are stored in separate binary files, and the protobuf contains pointers. You enable this with model.ExternalDataInfo during export. Without it, you'll hit google.protobuf.Message.ParseFromFileDescriptor errors at load time. Check your model size early.

io/thecodeforge/onnx_inspect.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
import onnx
from onnx import helper

model = onnx.load('model.onnx')
print(helper.printable_graph(model.graph))
# Look for op_type, input names, output names, and initializer shapes

# Check opset version
print('IR version:', model.ir_version)
print('Producer:', model.producer_name, model.producer_version)
print('Opset imports:')
for domain, version in model.opset_import:
    print(f'  {domain}: version {version}')
Output
graph torch-jit-export (
%input: float32[1,3,224,224]
) {
%/conv1/weight = Conv[auto_pad='SAME_UPPER', kernel_shape=[7,7], strides=[2,2]](%input, %conv1_weight)
%/relu = Relu(%/conv1/weight)
...
}
IR version: 8
Producer: pytorch 2.2.0
Opset imports:
ai.onnx: version 18
ai.onnx.ml: version 2
Mental Model: ONNX as a Universal IR
  • Training frameworks (PyTorch, TensorFlow) are the frontends.
  • ONNX Runtime with execution providers (CPU, CUDA, TensorRT) is the backend.
  • The protobuf graph is the serialized IR — inspectable, modifiable, and optimizable.
  • Third-party tools like onnxsim and onnxoptimizer can transform the graph before deployment.
Production Insight
Large models with many initializers (e.g., BERT 1B) produce protobuf files >2GB – protobuf limit is 2GB.
Solution: external data format (stores tensors as separate files) using model.ExternalDataInfo.
Always check file size before deploying; onnxruntime cannot load >2GB protobuf directly.
Key Takeaway
ONNX protobuf is a DAG – no cycles allowed.
Inspect the graph with printable_graph, not just blind trust.
If file >2GB, external data or model partitioning required.

Opset Versions and Operator Compatibility

Each ONNX operator (e.g., Conv, Relu) has versions. An opset is a snapshot of the operator set at a given point. Opset 18 (2023) introduced operators like GroupNormalization and GridSample, while opset 15 (2021) had fewer. When you export, you choose an opset version. The target runtime must support all operators in that opset, else it falls back to a CPU implementation or fails. This is the single biggest source of production surprises.

torch.onnx.export() defaults to the latest opset supported by PyTorch. But your production ONNX Runtime might be older. Always set opset_version explicitly to the minimum version supported by your deployment target. Check onnxruntime.__version__ and its opset support in release notes.

Another gotcha: some operators (e.g., Attention, GroupNorm) are only available in newer opsets. If you need them but must target an older runtime, you may have to decompose them into multiple primitive ops. That's manual and error-prone.

Pro tip: use onnxruntime.capi._pybind_state.get_available_providers() to see what's actually loaded at runtime. The provider list tells you only what's compiled, not which opsets are supported per provider. For that, consult the ORT version table.

io/thecodeforge/onnx_export_opset.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import torch
import torchvision.models as models

model = models.resnet50(pretrained=True)
dummy = torch.randn(1,3,224,224)

# Pin opset version to 15 for compatibility with ONNX Runtime 1.12
# (common in enterprise data centers with older GPU drivers)
torch.onnx.export(
    model,
    dummy,
    "resnet50_opset15.onnx",
    opset_version=15,
    input_names=["input"],
    output_names=["output"],
    dynamic_axes={"input": {0: "batch_size"}, "output": {0: "batch_size"}}
)
Output
Model exported successfully to resnet50_opset15.onnx.
Opset Fallback Trap
ONNX Runtime silently falls back to CPU for unsupported operators. No error. Your GPU sits idle. Monitor runtime logs for 'fallback' or use session.run_with_ort_values() to see which nodes executed on which provider.
Production Insight
A team deploying a transformer model used opset 18 but the production cluster had ONNX Runtime 1.10 (max opset 14).
Latency jumped 15x because self-attention fell back to CPU.
Rule: run onnxruntime.get_providers() in a health check after model load.
Key Takeaway
Pin opset to the minimum version your runtime supports.
Never rely on default opset — it will break in prod.
Check runtime opset support in your CI pipeline, not after deploy.
Choose the Right Opset Version
IfYou control both export and runtime (e.g., same team ships both)
UseUse latest opset from the exporter, then pin runtime to match.
IfTarget runtime is fixed (e.g., Triton with ORT 1.12)
UsePin export to the runtime's max opset. Verify no unsupported ops remain.
IfModel uses new operators (e.g., GroupNorm, Attention)
UseTry decomposing into primitives or upgrade runtime. If neither works, consider TensorRT path.

ONNX Runtime — Execution Providers and Performance

ONNX Runtime (ORT) is the reference inference engine. It supports multiple execution providers — CPU, CUDA, TensorRT, OpenVINO, DirectML, etc. Each provider implements the operator kernels optimized for that hardware. ORT also applies graph optimizations: constant folding, operator fusion (e.g., Conv+BN+Relu into one kernel), and layout transformation. You can control the optimization level via GraphOptimizationLevel.

To achieve peak performance, you need to choose the right provider and set session options: enable parallel execution, set intra/inter op threads, and pick memory optimization. Benchmarking between native PyTorch, ONNX Runtime on CPU, and ONNX Runtime on CUDA is essential before picking a runtime.

A common pitfall: TensorRT provider requires a separate NVIDIA TensorRT installation and may not support all ops. When a TensorRT kernel is missing, ORT falls back to CUDA or CPU — but the fallback can silently degrade latency. Always test with a representative set of inputs and monitor per-node execution providers.

Provider ordering matters: specify providers in priority list. ORT tries each provider in sequence per node. If the first provider doesn't have a kernel for that node, it moves to the next. This means you can end up with a mixed-provider execution: some nodes on TensorRT, some on CUDA, some on CPU. That's hard to diagnose without logging. Use session.run_with_ort_values() to retrieve per-node provider info after inference.

io/thecodeforge/onnx_benchmark.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import onnxruntime as ort
import numpy as np
import time

model_path = 'resnet50_opset15.onnx'

# Session with GPU and optimization
sess_options = ort.SessionOptions()
sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
sess_options.intra_op_num_threads = 4
sess_options.execution_mode = ort.ExecutionMode.ORT_PARALLEL

session = ort.InferenceSession(
    model_path,
    sess_options,
    providers=['CUDAExecutionProvider', 'CPUExecutionProvider']
)

input_name = session.get_inputs()[0].name
dummy = np.random.randn(1,3,224,224).astype(np.float32)

# Warmup
for _ in range(10):
    session.run(None, {input_name: dummy})

# Benchmark
start = time.perf_counter()
for _ in range(100):
    session.run(None, {input_name: dummy})
elapsed = time.perf_counter() - start
print(f'100 inferences: {elapsed:.2f}s, avg {elapsed/100*1000:.2f}ms')
Output
100 inferences: 1.23s, avg 12.3ms
Provider Order Matters
List providers in priority order: ['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider']. ORT tries each in sequence. If TensorRT fails for a node, it moves to CUDA, then CPU.
Production Insight
Graph optimization can cause numerical differences (e.g., fused ops change rounding order).
Always validate outputs against original model with tolerance (atol=1e-3) after optimization.
For finicky models (e.g., quantized), use ORT_ENABLE_BASIC instead of ALL.
Key Takeaway
Profile providers: GPU not always faster (e.g., small models).
Graph optimization is a trade-off: speed vs numerical stability.
Always benchmark with real production batch sizes and shapes.
Choose Execution Provider
IfModel has standard ops (Conv, Relu, MatMul), GPU available
UseUse CUDAExecutionProvider. Benchmark against CPU batch size 1: GPU may be slower for small models.
IfModel has many fusion-friendly operators, need max throughput
UseTry TensorrtExecutionProvider. Requires TRT installation and op compatibility check.
IfDeploying on CPU-only environments or with variable batch sizes
UseUse CPUExecutionProvider with ORT_ENABLE_ALL optimizations and set intra_op_num_threads to match cores.

Quantization and Model Optimization

Quantization reduces model precision (e.g., FP32 to INT8) to shrink size and speed up inference. ONNX Runtime supports dynamic quantization (weights quantized, activations kept FP32) and static quantization (both weights and activations quantized, requires calibration data). Static quantization typically gives 3-4x speedup on CPU with <1% accuracy loss.

The ONNX quantization workflow: (1) export FP32 ONNX, (2) calibrate with representative data, (3) use onnxruntime.quantization.quantize_static() to produce INT8 model, (4) compare accuracy against FP32 baseline. Beware of operators not supported for quantization (e.g., Softmax, LayerNormalization in some opsets) — those remain FP32 and become conversion bottlenecks.

A less-discussed detail: per-channel quantization can significantly improve accuracy for convolutional layers but requires the QDQ (Quantize-Dequantize) format. The older QOperator format is simpler but less accurate. Prefer QDQ for production INT8 deployments.

Another critical detail: static quantization requires a calibration dataset that represents real-world inputs. If your calibration data is too small or unrepresentative, the compute scales/zero points will be off, and accuracy degradation can exceed 5%. Always use at least 500 samples from the actual production distribution.

io/thecodeforge/onnx_quantize.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
from onnxruntime.quantization import quantize_static, QuantType, CalibrationMethod
from onnxruntime.quantization.qdq import QuantFormat

# Calibration data generator (must yield numpy arrays)
def calib_data():
    for _ in range(100):
        yield np.random.randn(1,3,224,224).astype(np.float32)

quantize_static(
    model_input='resnet50_opset15.onnx',
    model_output='resnet50_int8.onnx',
    calibration_data_reader=calib_data(),
    quant_format=QuantFormat.QDQ,  # Quantize-Dequantize format for ORT
    per_channel=True,
    activation_type=QuantType.QInt8,
    weight_type=QuantType.QInt8,
    calibrate_method=CalibrationMethod.MinMax,
    extra_options={'ActivationSymmetric': True}
)
Output
Quantization completed. Model size reduced from 98MB to 25MB.
Mental Model: Quantization = Lossy Compression
  • Dynamic quant: weights only, easy, 2x speedup, no calibration needed.
  • Static quant: weights+activations, harder, 4x speedup, requires calibration.
  • Quantization-aware training (QAT) yields best accuracy but requires retraining.
  • Not all operators support INT8 — check ort.quantization.get_qdq_config() for op list.
Production Insight
Static quantization on a model with custom ops (e.g., PReLU) silently skips those ops, leaving them FP32.
This creates a mixed-precision model that runs slower than expected.
Use onnxruntime.quantization.get_qdq_config() to see which ops will be quantized.
Key Takeaway
Quantize last — only after you have validated FP32 model end-to-end.
Always benchmark quantized model accuracy against FP32 on a hold-out set.
Not all ops quantize — check operator support before production.

Production Pitfalls and How to Avoid Them

The three most expensive mistakes teams make with ONNX in production:

  1. Opset version mismatch – export with latest opset but deploy on older runtime. Silent CPU fallback kills latency. Fix: pin opset version in CI, verify runtime version in deployment health check.
  2. Dynamic shapes not declared – models with variable batch size or sequence length need dynamic_axes parameter. Without it, ONNX freezes input shape. First inference with different size fails or produces garbage.
  3. Ignoring graph optimization effect – enabling all optimizations can change numerical outputs. For safety-critical apps (e.g., credit risk), validate with atol=1e-4 before enabling level 2 or 3.

Also: monitor ONNX Runtime logs for warnings about unsupported operators, and never assume GPU provider is used — always verify with session.get_providers().

One more hidden trap: TensorRT provider may silently fall back to CUDA or CPU for unsupported ops, but the fallback is per-node. You might see mixed providers in the same model, causing unpredictable latency. The only way to catch it is to log per-node execution providers using run_with_ort_values() or enable ORT's verbose logging.

A final warning: when using external data format, ensure the binary tensor files are accessible at the same relative path as the protobuf. ORT 1.15+ includes ExternalDataInfo paths but older versions expect the files next to the model file. A missing tensor file produces a cryptic File is not a valid protobuf error. Always validate the model loads successfully after moving it to the deployment server.

io/thecodeforge/onnx_production_check.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import onnxruntime as ort

def healthcheck(model_path, expected_provider='CUDAExecutionProvider'):
    try:
        session = ort.InferenceSession(model_path)
        providers = session.get_providers()
        if expected_provider not in providers:
            print(f'WARNING: {expected_provider} not active. Providers: {providers}')
            return False
        print(f'OK: {expected_provider} active')
        return True
    except Exception as e:
        print(f'FAIL: {e}')
        return False

healthcheck('resnet50_opset15.onnx', 'CUDAExecutionProvider')
Output
OK: CUDAExecutionProvider active
External Data File Trap
If your model uses external tensor files, ensure they exist at the same relative paths. ORT loads them lazily and only fails at the first session.run() call, not at session creation. That's a silent production killer.
Production Insight
A fintech company deployed an ONNX model for loan default prediction. Graph optimization changed a rounding behavior, causing 0.5% prediction flip. They caught it because they had a validation test comparing FP32 ONNX vs PyTorch outputs.
Rule: never promote a new ONNX model to production without a regression test suite that compares outputs.
Key Takeaway
Opset version, dynamic shapes, and optimization effects are the three killers.
Add a health check that verifies provider and runs a sample inference.
The production pipeline must include ONNX validation, not just PyTorch validation.

ONNX with TensorRT and Hardware Acceleration

NVIDIA TensorRT is a high-performance inference optimization SDK for NVIDIA GPUs. ONNX Runtime integrates TensorRT as an execution provider, allowing you to leverage TensorRT's layer fusion, kernel auto-tuning, fp16/INT8 precision, and memory management. However, the TensorRT provider has specific requirements and limitations.

To use it, install TensorRT (version 8.6+ recommended) and the ONNX Runtime TensorRT package (pip install onnxruntime-gpu onnxruntime-tensorrt). Then specify TensorrtExecutionProvider in the provider list, ideally as the first priority. TensorRT will attempt to build an optimized engine from the ONNX graph. This build can take several minutes for large models — consider caching the engine with trt_engine_cache_enable=True session option.

Not all ONNX operators have TensorRT kernels. Unsupported ops fall back to CUDA or CPU. This per-node fallback can lead to mixed-precision execution where some layers run in fp16 and others in fp32, causing unexpected latency spikes. To identify which nodes run on TensorRT, enable verbose logging or use the session.run_with_ort_values() method.

Another limitation: TensorRT requires fixed input shapes unless you enable dynamic shape support (newer TensorRT versions support this). If your model has dynamic axes, you may need to specify optimization profiles with trt_profile_min_shape, trt_profile_opt_shape, and trt_profile_max_shape session options. Missing this can cause engine build failure or shape mismatch at inference.

Production tip: always benchmark TensorRT vs CUDA provider on your specific model and batch size. TensorRT excels at large batch sizes and models with many fusion opportunities. For small, simple models, the engine build overhead may not be worth it.

io/thecodeforge/onnx_tensorrt_session.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import onnxruntime as ort
import numpy as np

sess_options = ort.SessionOptions()
sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL

# Enable TensorRT engine caching
sess_options.add_session_config_entry('trt_engine_cache_enable', 'True')
sess_options.add_session_config_entry('trt_engine_cache_path', './trt_cache')

# For dynamic shapes, set optimization profile
sess_options.add_session_config_entry('trt_profile_min_shape', 'input:1x3x224x224')
sess_options.add_session_config_entry('trt_profile_opt_shape', 'input:8x3x224x224')
sess_options.add_session_config_entry('trt_profile_max_shape', 'input:32x3x224x224')

session = ort.InferenceSession(
    'resnet50_opset15.onnx',
    sess_options,
    providers=['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider']
)

input_name = session.get_inputs()[0].name
dummy = np.random.randn(1,3,224,224).astype(np.float32)
output = session.run(None, {input_name: dummy})
print('Inference succeeded with TensorRT')
Output
Inference succeeded with TensorRT
TensorRT engine built and cached (took 45s on first run)
TensorRT Engine Build Time
The first inference with TensorRT provider triggers an engine build that can take minutes. Use engine caching (trt_engine_cache_enable) to avoid rebuilding on every session creation. Monitor the build time in staging before deploying to prod.
Production Insight
A team deployed a BERT model with TensorRT provider but didn't set optimization profiles for dynamic sequence lengths. The engine built with fixed 128 tokens, but production inputs varied from 32 to 256. ONNX Runtime fell back to CUDA for every inference, making TensorRT useless.
Always test with production-like input shapes and verify actual provider usage with session.run_with_ort_values().
Key Takeaway
TensorRT provider can deliver 2-5x speedup on NVIDIA GPUs but requires careful configuration.
Engine build time and dynamic shape profiles are the main production hurdles.
Always verify which provider each node actually ran on — don't assume TensorRT coverage.
Decide When to Use TensorRT Provider
IfModel is large (ResNet-50 or bigger), GPU is NVIDIA, batch size >= 8
UseUse TensorRT provider. Benchmark against CUDA provider for throughput.
IfModel has many small custom ops or dynamic shapes that vary widely
UseStick with CUDAExecutionProvider. TensorRT may degrade performance due to fallbacks.
IfDeploying on edge devices (Jetson) or need INT8 quantization
UseTensorRT is the best choice — it provides hardware-specific optimizations not available in CUDA provider.
● Production incidentPOST-MORTEMseverity: high

Opset 18 Export Killed Triton Inference at 2 AM

Symptom
Latency jumped from 12ms to 340ms per inference. No errors, just slow. GPU utilization dropped to 5%.
Assumption
The team assumed later opsets are backward compatible. They are — but only if the runtime implements every operator in that opset. Older runtimes may have partial operator sets.
Root cause
The model used torch.nn.functional.scaled_dot_product_attention which maps to an opset 18 operator Attention. The ONNX Runtime 1.12 (max opset 15) didn't have that kernel on GPU, so it fell back to CPU.
Fix
Downgraded export to opset 15 by setting opset_version=15 in torch.onnx.export(). Re-exported and verified GPU operators were used. Alternatively, upgrade ONNX Runtime to 1.16+.
Key lesson
  • Always check the target runtime's supported opset version before export.
  • Use onnxruntime.get_available_providers() and onnxruntime.get_device() to confirm GPU is active.
  • Pin opset version to the minimum common denominator unless you control both sides.
Production debug guideSymptom → Action flow for the three most common production ONNX problems4 entries
Symptom · 01
Model exports successfully but ONNX Runtime returns wrong outputs or NaN
Fix
Compare intermediate tensor values between PyTorch and ONNX Runtime using onnxruntime.InferenceSession with output_names and input_feed. Use torch.onnx.export(..., verbose=True) to dump operator list.
Symptom · 02
Export fails with torch.onnx.errors.OnnxExporterError about unsupported operator
Fix
Identify the unsupported op: search ONNX operator docs for equivalent op. Often you need to replace custom ops with ONNX-compatible alternatives (e.g., torch.where vs custom masking). Use dynamic_axes to handle variable-length inputs.
Symptom · 03
ONNX Runtime runs on CPU despite GPU availability
Fix
Verify provider list: providers=['CUDAExecutionProvider', 'CPUExecutionProvider']. Ensure CUDA version matches ORT's build. Check onnxruntime.get_device() returns 'GPU'. Use session.get_providers() to confirm CUDA is active.
Symptom · 04
Model loads but first inference fails with external data file error
Fix
Check that external tensor files exist at the expected relative path from the .onnx file. Use onnx.load(model_path, load_external_data=False) to inspect external data references. Ensure the deployment process copies both the .onnx and the associated .bin (or .data) files.
★ Quick Debug: ONNX Export & RuntimeRun these commands in order when an ONNX model behaves unexpectedly in production.
Wrong predictions
Immediate action
Disable graph optimization to rule out fusion bugs
Commands
session = onnxruntime.InferenceSession(model_path, providers=['CPUExecutionProvider'], sess_options=ort.SessionOptions(), graph_optimization_level=ort.GraphOptimizationLevel.ORT_DISABLE_ALL)
compare_outputs(pytorch_output, onnx_output, atol=1e-3)
Fix now
If outputs match without optimization, re-enable optimization level ORT_ENABLE_BASIC and retest.
Export fails with unsupported op+
Immediate action
Find the offending PyTorch operation
Commands
torch.onnx.export(model, dummy_input, 'model.onnx', opset_version=18, verbose=True) 2>&1 | grep -i 'unsupported'
Check ONNX operator registry: https://github.com/onnx/onnx/blob/main/docs/Operators.md
Fix now
Replace the unsupported op with a combination of supported ops (e.g., use torch.where + torch.mul instead of custom masking).
Runtime uses CPU despite GPU available+
Immediate action
List available providers and device
Commands
ort.get_available_providers()
ort.get_device()
Fix now
If 'CUDAExecutionProvider' not in providers, reinstall ONNX Runtime with CUDA: pip install onnxruntime-gpu and verify CUDA version matches.
External data file missing at inference+
Immediate action
Check the ONNX model file size and look for .data or .bin files in the deployment directory
Commands
ls -la $(dirname model.onnx)/*.data
onnx.load('model.onnx', load_external_data=False).graph.initializer[0].external_data
Fix now
Ensure all external data files are present in the same directory as the .onnx file, or update relative paths using external_data_info.
ONNX vs Other Model Formats
FeatureONNXTensorFlow SavedModelPyTorch TorchScriptTensorRT Engine
Vendor neutralityOpen standard (Microsoft/Facebook)TensorFlow-specificPyTorch-specificNVIDIA-specific
Target hardwareAny (via providers)CPU/GPU/TPUCPU/GPU (no mobile optimised)NVIDIA GPU only
Graph optimizationBuilt-in ORT optimizationsGrappler, XLAJIT optimizationsLayer fusion, fp16/INT8
Quantization supportDynamic, static, QAT via onnxruntime.quantizationTFLite, QATNot natively; PyTorch provides torch.quantizationINT8, fp16 with calibration
Production deploymentONNX Runtime, Triton, Azure MLTF Serving, TF LiteTorchServe, LibTorchTensorRT (standalone or integrated)
Opset versioningExplicit, versioned operatorsNo versioning per op; graph versionedNot applicable (script/trace)Not applicable (compiled engine)

Key takeaways

1
ONNX is an open intermediate representation that decouples model training from inference.
2
Opset version mismatch is the #1 production killer
always pin and verify.
3
Always declare dynamic axes for variable-length inputs.
4
Graph optimization can change numerical outputs
validate against baseline.
5
Quantization (especially static) gives large speedups but requires calibration data and operator support.
6
External data format solves the 2GB protobuf limit but introduces path dependency issues.
7
TensorRT provider requires careful configuration and validation
don't assume all ops run on TensorRT.

Common mistakes to avoid

5 patterns
×

Exporting with default opset without checking runtime version

Symptom
Model runs fine in dev but on prod server latency spikes 10-20x because GPU kernels fall back to CPU. No error logs.
Fix
Set opset_version explicitly to the minimum version supported by your target runtime. Verify runtime version in CI with ort.__version__.
×

Forgetting to declare dynamic axes for variable-length inputs

Symptom
First inference with different batch size or sequence length fails with shape mismatch error or produces garbage outputs.
Fix
Use dynamic_axes parameter in torch.onnx.export() to mark dimensions that can vary. Always test with multiple batch sizes after export.
×

Assuming all operators are supported in ONNX

Symptom
Export fails with unsupported operator error, especially for custom or legacy PyTorch ops (e.g., torch.einsum with certain patterns).
Fix
Search ONNX operator list or replace with equivalent ops (e.g., use torch.bmm for batch matrix multiply). For custom ops, implement a custom operator in ONNX Runtime.
×

Using external data format without preserving relative paths

Symptom
Model loads successfully but first inference fails with cryptic protobuf error or segfault. Tensor files missing or path mismatch.
Fix
Always validate that external data files are present and at the expected relative paths before deploying. Use onnx.load(model_path, load_external_data=False) to check the external data references.
×

Using TensorRT provider without verifying operator compatibility or engine caching

Symptom
First inference takes 5+ minutes (engine build) and subsequent runs still fall back to CUDA for many nodes because TensorRT doesn't support some ops.
Fix
Enable TensorRT engine caching with trt_engine_cache_enable. Use session.run_with_ort_values() to check which ops ran on TensorRT. Consider adding fallback ops to CUDA only if necessary, or rewrite model to avoid unsupported ops.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01JUNIOR
What is ONNX and why was it created?
Q02SENIOR
How do you handle control flow (if/else, loops) when exporting a PyTorch...
Q03SENIOR
What is the difference between dynamic and static quantization in ONNX R...
Q04SENIOR
Your ONNX model runs on GPU in dev but falls back to CPU in production. ...
Q05SENIOR
How does ONNX handle large models that exceed the 2GB protobuf limit?
Q06SENIOR
Explain how to set up ONNX Runtime with TensorRT and what are the key pi...
Q01 of 06JUNIOR

What is ONNX and why was it created?

ANSWER
ONNX is an open intermediate representation for machine learning models. It was created to decouple training frameworks from inference engines, enabling model portability. Instead of maintaining separate export pipelines for TensorFlow Serving, TensorRT, Core ML, etc., you export once to ONNX and any compliant runtime can execute it.
FAQ · 7 QUESTIONS

Frequently Asked Questions

01
Can I convert any PyTorch model to ONNX?
02
How do I know which ONNX opset version my runtime supports?
03
What is the file size limit for an ONNX protobuf model?
04
Why does my ONNX model run slower on GPU than expected?
05
How do I troubleshoot a mismatch between PyTorch and ONNX outputs?
06
Does ONNX Runtime support TensorRT?
07
How do I benchmark ONNX Runtime performance against native PyTorch?
🔥

That's Tools. Mark it forged?

8 min read · try the examples if you haven't

Previous
LangChain for LLM Applications
9 / 12 · Tools
Next
Best AI Tools for Developers in 2026 (Curated & Ranked)