MLOps study - Raviraja Week 4: ONNX

22 Oct 2022

Week 4 is about ONNX. We generally train models using PyTorch or TensorFlow. But where does inference take place? It depends on the user. It could be a PC, or it could be a mobile environment. Training might be in PyTorch, but inference might have to be performed in TensorFlow. For situations like this, let’s look at model packaging using ONNX.

Start ONNX
ONNX runtime
Netron

Start ONNX

The process of turning a trained model into an ONNX model and then making it so that the user can use whichever framework they want at inference time is called model packaging.

It works simply for both PyTorch and PyTorch-lightning models, but in the case of PyTorch-lightning, the model itself includes a method for converting to onnx. The information needed to create an onnx model is as follows.

Name of the onnx model
Input sample
Input names (You can specify names not only for the initial input but also for the inputs of each layer. If you specify fewer than the number of layer inputs, the remaining ones are named automatically.)
Output names (the names of the outputs)
Dynamic axes (Marking the axes that are designed to vary in each input. Typically this would be axis 0, the batch axis, or the sequence length in an RNN.)

If you run the code below, a .onnx file is created!

#### Preparing the model and sample
model_path = f"models/best-checkpoint.ckpt"
cola_model = ColaModel.load_from_checkpoint(model_path)
data_model = DataModule()
data_model.prepare_data()
data_model.setup()

input_batch = next(iter(data_model.train_dataloader()))
input_sample = {
    "input_ids": input_batch["input_ids"][0].unsqueeze(0),
    "attention_mask": input_batch["attention_mask"][0].unsqueeze(0),
}

#### When using PyTorch
torch.onnx.export(
    cola_model,  # model being run
    (
        input_sample["input_ids"],
        input_sample["attention_mask"],
    ),  # model input (or a tuple for multiple inputs)
    "models/model.onnx",  # where to save the model
    export_params=True,
    opset_version=10,
    verbose=True,
    input_names=["input_ids", "attention_mask"],  # the model's input names
    output_names=["output"],  # the model's output names
    dynamic_axes={            # variable length axes
        "input_ids": {0: "batch_size"},
        "attention_mask": {0: "batch_size"},
        "output": {0: "batch_size"},
    },
)

##### When using PyTorch-lightning
cola_model.to_onnx(
  "models/model.onnx",             # where to save the model
  input_sample,             # input samples with atleast batch size as 1
  export_params=True,
  opset_version=10,
  input_names = ["input_ids", "attention_mask"],    # Input names
  output_names = ['output'],  # Output names
  dynamic_axes={            # variable length axes
        "input_ids": {0: "batch_size"},
        "attention_mask": {0: "batch_size"},
        "output": {0: "batch_size"},
    },
)

ONNX runtime

ONNX runtime is the inference engine for ONNX models. First, let’s find and install the version that matches our Cuda version. If it doesn’t match the Cuda version, you can’t do GPU-based inference. (Cuda-ONNXruntime version table)

python -m pip install onnxruntime-gpu==<version>

ONNX runtime is built to operate easily across different OSes and HW (accelerators). By HW, I mean things like GPUs, NPUs, and TPUs. Also, the official homepage explains that the library is built using a variety of languages including not only Python but also C++, JAVA, and Ruby. You can check all possible HW and the HW available on the current platform as below. Unfortunately, my lab’s server is Cuda 9.1, so it seems it doesn’t support onnxruntime-gpu…:(

from onnxruntime import  get_all_providers, get_available_providers
print(get_all_providers())
print(get_available_providers())

Let’s try doing inference with this. You just load the onnx model, make it into an InferenceSession, and run it along with the input. output_names is used when you want to obtain only the output with a specific name as the result. If you set it to None, it returns all outputs.

import onnxruntime as ort
onnx_model_path = 'models/model.onnx'
ort_session = ort.InferenceSession(onnx_model_path)
ort_inputs = {
    "input_ids": input_sample["input_ids"].numpy(),
    "attention_mask": input_sample["attention_mask"].numpy(),
}
output_name = None
ort_output = ort_session.run(output_name, ort_inputs)

What’s interesting is that ONNX is much faster than PyTorch inference (about 2~3 times). Yet on Google, nobody seems to question why it’s faster… (Maybe it’s just well optimized…??) In fact, PyTorch probably runs quite a lot of unnecessary operations — it stores all the computed gradients and so on. So if you strip all of that away and leave only the computation part, it makes sense that you can make it about 3 times faster (just my speculation).

from time import time

ort_time = time()
ort_output = ort_session.run(output_name, ort_inputs)
print("ONNX inference time:", time()-ort_time, "sec")

pt_time = time()
with torch.no_grad():
    pt_output = cola_model(**input_sample)
print("PyTorch inference time:", time()-pt_time, "sec")

Result

ONNX inference time: 0.004585742950439453 sec
PyTorch inference time: 0.015786170959472656 sec

Netron

Netron is a program that shows the structure of a model. Tensorboard has been very usefully employed as a visualization tool that shows the structure of a model at the time you implemented it directly. But since ONNX has packaged the model away, it’s hard to figure out its structure. You can’t check the code directly. Also, if you only have the onnx file without knowing the names of the inputs and outputs, it would be hard to use. In this case, you can use a tool called Netron. It supports a variety of model files such as .onnx and .pt. With just a simple upload…!! it draws the model for you!! You can consider it to tell you most of the information, such as the name of each layer and the shape of the inputs and outputs.

Download the ipynb file

Jae-Kyung Cho Being unique is better than being perfect

MLOps study - Raviraja Week 4: ONNX

Start ONNX

ONNX runtime

Netron

references:

Jae-Kyung Cho Being unique is better than being perfect

MLOps study - Raviraja Week 4: ONNX

Start ONNX

ONNX runtime

Netron

references:

Related posts

Diary - AI training이란 무엇일까 (feat. Claude Code) 06 Mar 2026

Diary - What Is AI Training, Really? (feat. Claude Code) 06 Mar 2026

Diary - LLM에서 효율적인 강화학습이란 무엇일까 2 (feat. Qwen-3.5와 GLM-5) 26 Feb 2026