Expected API#

yobx.sklearn.to_onnx() accepts a builder_cls parameter that defaults to yobx.xbuilder.GraphBuilder. Any object can be substituted as long as it exposes the two-part API described on this page.

The API is split into two groups that mirror the cross-references used in the source code:

Construction API (Building a graph from scratch) — methods to declare inputs, outputs, initializers, and nodes, and to export the finished graph.
Shape / type API (Shape and type tracking) — methods to attach and query shape and type metadata on intermediate tensors.

An alternative bridge implementation, OnnxScriptGraphBuilder, shows how the same API can be satisfied on top of onnxscript’s IR.

When any ONNXSTOP* variable triggers an exception, the resulting stack trace points to the exact line of converter code that first assigned a type or shape to that result.

Why using strings to refer to intermediate results?

A user usually only sees the final model and can only investigate an issue based on the names he reads. Keeping explicit, stable names for intermediate results in converters code helps to track the code where this name appears. Keeping that in mind, a protocol for a value seems unnecessary. The creation of the final name should not be delayed. That makes it easier to investigate issues such as exposes in Debugging with Environment Variables.

Construction API#

Method / attribute	Description
`__init__(target_opset or existing ModelProto or FunctionProto)`	Constructor. target_opset is either an `int` (main domain) or a `Dict[str, int]` mapping domain names to versions.
`make_tensor_input(name, elem_type, shape, device=-1)`	Declare a graph input tensor. elem_type is an `onnx.TensorProto.` integer constant. shape* is a tuple whose elements are integers (static) or strings (symbolic/dynamic). device is `-1` for CPU.
`make_tensor_output(name, indexed=False, allow_untyped_output=True)`	Declare a graph output. When `indexed=False` the name is used verbatim; set `True` when the output name is generated by the builder (e.g. `"output_0"`, `"output_1"` …).
`make_initializer(name, value, source="")`	Add a constant tensor to the graph. value can be a `numpy.ndarray`, a scalar, or an `onnx.TensorProto`. Returns the name that was assigned (may differ from name if the builder deduplicates identical constants).
`make_node(op_type, inputs, num_outputs, , domain="", name="", *attrs)`	Low-level node creation. Returns a sequence of output tensor name(s).
`op.<OpType>(inputs, *attrs)`	Convenience short-hand: `g.op.Relu("X")` is equivalent to `g.make_node("Relu", ["X"], 1)`. Inline numpy arrays are automatically promoted to initializers.
`to_onnx(...)`	Finalise and return an `onnx.ModelProto`.

Minimal example#

The snippet below builds the same Sub / Div graph emitted by the StandardScaler converter, using the default GraphBuilder:

<<<

import numpy as np
import onnx
from yobx.xbuilder import GraphBuilder, OptimizationOptions
from yobx.helpers.onnx_helper import pretty_onnx

TFLOAT = onnx.TensorProto.FLOAT

opts = OptimizationOptions(constant_folding=False)
g = GraphBuilder(20, ir_version=10, optimization_options=opts)
g.make_tensor_input("X", TFLOAT, ("batch", 4))

mean = np.array([1.0, 2.0, 3.0, 4.0], dtype=np.float32)
scale = np.array([0.5, 1.0, 2.0, 4.0], dtype=np.float32)
mean_name = g.make_initializer("mean", mean)
scale_name = g.make_initializer("scale", scale)

centered = g.op.Sub("X", mean_name)
g.set_type(centered, TFLOAT)
g.set_shape(centered, ("batch", 4))

result = g.op.Div(centered, scale_name)
g.set_type(result, TFLOAT)
g.set_shape(result, ("batch", 4))

g.make_tensor_output(result, indexed=False, allow_untyped_output=True)
model = g.to_onnx()
print(pretty_onnx(model))

>>>

    opset: domain='' version=20
    input: name='X' type=dtype('float32') shape=['batch', 4]
    init: name='mean' type=float32 shape=(4,) -- array([1., 2., 3., 4.], dtype=float32)
    init: name='scale' type=float32 shape=(4,) -- array([0.5, 1. , 2. , 4. ], dtype=float32)
    Sub(X, mean) -> _onx_sub_X
      Div(_onx_sub_X, scale) -> _onx_div_sub_X
    output: name='_onx_div_sub_X' type='NOTENSOR' shape=None

Opset API#

Converters frequently need to know which opset versions are active so they can choose the right operator variant or register an additional domain (e.g. "ai.onnx.ml" for scikit-learn models).

Method / attribute	Description
`main_opset`	Read-only property. Returns the opset version for the main ONNX domain (`""`). Equivalent to `g.opsets[""]`.
`has_opset(domain)`	Returns the opset version (an `int`) for domain, or `0` if the domain is not registered. Because `0` is falsy and any valid version is truthy, the return value can be used directly in a boolean context: `if g.has_opset("ai.onnx.ml"): ...`.
`get_opset(domain, exc=True)`	Returns the opset version for domain. When `exc=True` (default) an `AssertionError` is raised if the domain is not registered; set `exc=False` to get `0` instead.
`set_opset(domain, version=1)`	Registers domain with the given version. If the domain is already registered with the same version the call is a no-op; a version mismatch raises an `AssertionError`.
`add_domain(domain, version=1)`	Deprecated alias for `set_opset`.

A converter that targets the main ONNX domain only needs to read g.main_opset. A converter that also emits nodes from a secondary domain (e.g. "ai.onnx.ml") should first call g.set_opset(domain, version) to ensure the domain is recorded in the exported model, then query its version with g.get_opset(domain).

from yobx.typing import GraphBuilderExtendedProtocol
from yobx.xbuilder import GraphBuilder

def convert_my_estimator(g: GraphBuilderExtendedProtocol, sts, outputs, estimator, X):
    # Read the main opset to pick the right operator variant.
    opset = g.main_opset

    # Register and query the ai.onnx.ml domain when needed.
    g.set_opset("ai.onnx.ml", 3)
    ml_opset = g.get_opset("ai.onnx.ml")

    # Check whether an optional domain is already registered.
    if g.has_opset("com.microsoft"):
        result = g.op.MicrosoftOp(X)
    elif opset >= 20:
        result = g.op.SomeNewOp(X)
    else:
        result = g.op.SomeLegacyOp(X)
    ...
    return result

Shape and type API#

Converters are expected to propagate shape and type information after each node so that downstream converters (e.g. pipeline steps) can query them without re-running inference. The model may be different given that information. Below, the required methods are where g defines the GraphBuilder implemented the expected API.

Method	Description
`g.set_type(name, itype)`	Register the element type (an `onnx.TensorProto.*` integer) for tensor `name`.
`g.get_type(name)`	Return the previously registered element type.
`g.has_type(name)`	Return `True` if the element type is known.
`g.set_shape(name, shape)`	Register the shape for tensor `name`. Dimensions may be integers (static) or strings (symbolic).
`g.get_shape(name)`	Return the shape as a tuple of integers or strings.
`g.has_shape(name)`	Return `True` if the shape is known.
`g.set_device(name, device)`	Register the device for tensor `name` (`-1` = CPU).
`g.get_device(name)`	Return the device.
`g.has_device(name)`	Return `True` if a device is registered for the tensor.

In addition, it is usually useful to implement the following methods.

`g.unique_name(prefix)`	Return a name that starts with prefix and is not yet in use anywhere in the graph.
`g.set_type_shape_unary_op(name, input_name, itype: int = None)`	Defines shape, type, device for name equal the one defined for input_name, itype can be used the change the type

The current does not include common operations on shapes, +, -, //, *, %, min, max or even their simplification. This is usually needed to optimize models but not mandatory to write the model itself. This is left to the builder. Every converter must usually known the type, the device, the rank and sometimes if a dimension is static or dynamic.

Shape and Type representation#

This API follows ONNX standard.

A name is a string: it is a unique identifier.
A type is an integer: see supported types.
A shape is a tuple, empty or filled with integers (static dimension) or strings (dynamic dimension).

Additionally:

A device is an integer, -1 for CPU, a value >= 0 for a CUDA device.
A rank is an integer and equal to len(shape).

Propagating shape and type in a converter#

The canonical pattern at the end of every converter is:

result = g.op.Relu(X, name=name)
g.set_type_shape_unary_op(result, X)
return result

The helper set_type_shape_unary_op combines the three set_* calls into one call.

Convert Options#

ConvertOptionsProtocol is a lightweight protocol that lets callers opt-in to extra outputs on a per-estimator basis without changing the core converter signatures.

Protocol contract#

Any object that implements the single method below satisfies the protocol and can be passed to to_onnx() as the convert_options argument:

Method	Description
`has(option_name: str, piece: object) -> bool`	Return `True` when the option identified by option_name should be activated for the estimator piece. The second argument is the fitted scikit-learn estimator currently being converted, which lets callers enable an option only for a specific step inside a `Pipeline`.

Inside a converter, the options object is accessible via the graph builder’s convert_options property (g.convert_options):

# Inside a converter function:
if g.convert_options.has("decision_path", estimator):
    # emit the extra decision-path output
    ...

Built-in options: `ConvertOptions`#

ConvertOptions is the default implementation shipped with the package. It currently exposes two boolean flags:

Option name	Type	Description
`decision_path`	`bool`	When `True`, an extra output tensor is appended for each tree/ensemble estimator. For a single `DecisionTreeClassifier` or `DecisionTreeRegressor` the shape is `(N, 1)`; for ensemble models (`RandomForestClassifier`, `RandomForestRegressor`, `ExtraTreesClassifier`, `ExtraTreesRegressor`) the shape is `(N, n_estimators)`. Each value is a binary string encoding the root-to-leaf path through the tree.
`decision_leaf`	`bool`	When `True`, an extra output tensor (`int64`) is appended containing the zero-based leaf node index reached by each sample. Shapes follow the same convention as `decision_path`.

Passing options to `to_onnx`#

import numpy as np
from sklearn.tree import DecisionTreeClassifier
from yobx.sklearn import to_onnx, ConvertOptions

X = np.random.default_rng(0).standard_normal((20, 4)).astype(np.float32)
y = (X[:, 0] > 0).astype(int)
clf = DecisionTreeClassifier(max_depth=3).fit(X, y)

opts = ConvertOptions(decision_path=True)
model_onnx = to_onnx(clf, (X,), convert_options=opts)
# The exported model now has three outputs:
#   output_0 – label (int64, shape [N])
#   output_1 – probabilities (float32, shape [N, 2])
#   output_2 – decision path (object/string, shape [N, 1])

Implementing a custom protocol#

You can supply any object with a has method. The simplest way is to subclass DefaultConvertOptions and override has:

from yobx.typing import DefaultConvertOptions

class MyOptions(DefaultConvertOptions):
    def has(self, option_name: str, piece: object) -> bool:
        # Only enable decision_leaf for RandomForestClassifier:
        from sklearn.ensemble import RandomForestClassifier
        if option_name == "decision_leaf":
            return isinstance(piece, RandomForestClassifier)
        return False

Alternatively, any object whose class implements the single-method ConvertOptionsProtocol is accepted directly.

See Exporting sklearn tree models with convert options for a full runnable example of decision_path and decision_leaf on single trees and ensembles.

Alternative implementations#

Any class that satisfies the two-part API above can be passed as builder_cls. The package ships with:

GraphBuilder — the default; builds graphs using onnx protobuf objects with built-in optimization passes.
OnnxScriptGraphBuilder — a bridge that satisfies the same API while using the onnxscript IR internally. Useful when the rest of the pipeline already works with onnxscript.

<<<

import numpy as np
import onnx
from sklearn.preprocessing import StandardScaler
from yobx.sklearn import to_onnx
from yobx.builder.onnxscript import OnnxScriptGraphBuilder
from yobx.helpers.onnx_helper import pretty_onnx

rng = np.random.default_rng(0)
X = rng.standard_normal((10, 4)).astype(np.float32)

scaler = StandardScaler().fit(X)
model = to_onnx(scaler, (X,), builder_cls=OnnxScriptGraphBuilder)
print(pretty_onnx(model))

>>>

    opset: domain='' version=21
    opset: domain='ai.onnx.ml' version=5
    input: name='X' type=dtype('float32') shape=['batch', 4]
    init: name='init_' type=float32 shape=(4,) -- array([-0.448,  0.052, -0.093,  0.247], dtype=float32)
    init: name='init_2' type=float32 shape=(4,) -- array([0.774, 0.641, 0.825, 0.728], dtype=float32)
    Sub(X, init_) -> Sub
      Div(Sub, init_2) -> x
    output: name='x' type=dtype('float32') shape=['batch', 4]