Intermediate results with (ONNX) ReferenceEvaluator

Let’s assume onnxruntime crashes without telling why or where. The first thing is do is to locate where. For that, we run a python runtime which is going to run until it fails.

A failing model

The issue here is a an operator Cast trying to convert a result into a non-existing type.

import numpy as np
import onnx
import onnx.helper as oh
import onnxruntime
from onnx_diagnostic import doc
from onnx_diagnostic.helpers.onnx_helper import from_array_extended
from onnx_diagnostic.reference import ExtendedReferenceEvaluator

TFLOAT = onnx.TensorProto.FLOAT

model = oh.make_model(
    oh.make_graph(
        [
            oh.make_node("Mul", ["X", "Y"], ["xy"], name="n0"),
            oh.make_node("Sigmoid", ["xy"], ["sy"], name="n1"),
            oh.make_node("Add", ["sy", "one"], ["C"], name="n2"),
            oh.make_node("Cast", ["C"], ["X999"], to=999, name="failing"),
            oh.make_node("CastLike", ["X999", "Y"], ["Z"], name="n4"),
        ],
        "-nd-",
        [
            oh.make_tensor_value_info("X", TFLOAT, ["a", "b", "c"]),
            oh.make_tensor_value_info("Y", TFLOAT, ["a", "b", "c"]),
        ],
        [oh.make_tensor_value_info("Z", TFLOAT, ["a", "b", "c"])],
        [from_array_extended(np.array([1], dtype=np.float32), name="one")],
    ),
    opset_imports=[oh.make_opsetid("", 18)],
    ir_version=9,
)

We check it is failing.

try:
    onnxruntime.InferenceSession(model.SerializeToString(), providers=["CPUExecutionProvider"])
except onnxruntime.capi.onnxruntime_pybind11_state.Fail as e:
    print(e)
[ONNXRuntimeError] : 1 : FAIL : Node (failing) Op (Cast) [TypeInferenceError] Attribute to does not specify a valid type in .

ExtendedReferenceEvaluator

This class extends onnx.reference.ReferenceEvaluator with operators outside the standard but defined by onnxruntime. verbose=10 tells the class to print as much as possible, verbose=0 prints nothing. Intermediate values for more or less verbosity.

ref = ExtendedReferenceEvaluator(model, verbose=10)
feeds = dict(
    X=np.random.rand(3, 4).astype(np.float32), Y=np.random.rand(3, 4).astype(np.float32)
)
try:
    ref.run(None, feeds)
except Exception as e:
    print("ERROR", type(e), e)
 +C one: float32:(1,):[1.0]
 +I X: float32:(3, 4):0.19803810119628906,0.10446123033761978,0.7083548903465271,0.5814882516860962,0.7943190932273865...
 +I Y: float32:(3, 4):0.3212786614894867,0.5833426117897034,0.8408930897712708,0.8432177901268005,0.45993563532829285...
Mul(X, Y) -> xy
 + xy: float32:(3, 4):0.06362541764974594,0.060936685651540756,0.5956507325172424,0.49032124876976013,0.3653356432914734...
Sigmoid(xy) -> sy
 + sy: float32:(3, 4):0.5159009695053101,0.5152294635772705,0.6446606516838074,0.6201820969581604,0.5903314352035522...
Add(sy, one) -> C
 + C: float32:(3, 4):1.51590096950531,1.5152294635772705,1.6446607112884521,1.6201820373535156,1.5903314352035522...
Cast(C) -> X999
ERROR <class 'KeyError'> 999

We can see it run until it reaches Cast and stops. The error message is not always obvious to interpret. It gets improved every time from time to time. This runtime is useful when it fails for a numerical reason. It is possible to insert prints in the python code to print more information or debug if needed.

doc.plot_legend("Python Runtime\nfor ONNX", "ExtendedReferenceEvalutor", "lightgrey")
plot failing reference evaluator

Total running time of the script: (0 minutes 0.027 seconds)

Related examples

Intermediate results with onnxruntime

Intermediate results with onnxruntime

Find where a model is failing by running submodels

Find where a model is failing by running submodels

Export microsoft/phi-2

Export microsoft/phi-2

Gallery generated by Sphinx-Gallery