.. _l-design-expected-api: ============ Expected API ============ :func:`yobx.sklearn.to_onnx` accepts a ``builder_cls`` parameter that defaults to :class:`yobx.xbuilder.GraphBuilder`. Any object can be substituted as long as it exposes the two-part API described on this page. The API is split into two groups that mirror the cross-references used in the source code: * **Construction API** (:ref:`builder-api-make`) — methods to declare inputs, outputs, initializers, and nodes, and to export the finished graph. * **Shape / type API** (:ref:`builder-api`) — methods to attach and query shape and type metadata on intermediate tensors. An alternative bridge implementation, :class:`OnnxScriptGraphBuilder `, shows how the same API can be satisfied on top of ``onnxscript``'s IR. When any ``ONNXSTOP*`` variable triggers an exception, the resulting **stack trace points to the exact line of converter code** that first assigned a type or shape to that result. **Why using strings to refer to intermediate results?** A user usually only sees the final model and can only investigate an issue based on the names he reads. Keeping **explicit, stable names** for intermediate results in converters code helps to track the code where this name appears. Keeping that in mind, a protocol for a value seems unnecessary. The creation of the final name should not be delayed. That makes it easier to investigate issues such as exposes in :ref:`l-design-sklearn-debug-env-vars`. Construction API ================ .. list-table:: :header-rows: 1 :widths: 40 60 * - Method / attribute - Description * - ``__init__(target_opset or existing ModelProto or FunctionProto)`` - Constructor. *target_opset* is either an ``int`` (main domain) or a ``Dict[str, int]`` mapping domain names to versions. * - ``make_tensor_input(name, elem_type, shape, device=-1)`` - Declare a graph input tensor. *elem_type* is an ``onnx.TensorProto.*`` integer constant. *shape* is a tuple whose elements are integers (static) or strings (symbolic/dynamic). *device* is ``-1`` for CPU. * - ``make_tensor_output(name, indexed=False, allow_untyped_output=True)`` - Declare a graph output. When ``indexed=False`` the name is used verbatim; set ``True`` when the output name is generated by the builder (e.g. ``"output_0"``, ``"output_1"`` …). * - ``make_initializer(name, value, source="")`` - Add a constant tensor to the graph. *value* can be a :class:`numpy.ndarray`, a scalar, or an :class:`onnx.TensorProto`. Returns the name that was assigned (may differ from *name* if the builder deduplicates identical constants). * - ``make_node(op_type, inputs, num_outputs, *, domain="", name="", **attrs)`` - Low-level node creation. Returns a sequence of output tensor name(s). * - ``op.(*inputs, **attrs)`` - Convenience short-hand: ``g.op.Relu("X")`` is equivalent to ``g.make_node("Relu", ["X"], 1)``. Inline numpy arrays are automatically promoted to initializers. * - ``to_onnx(...)`` - Finalise and return an :class:`onnx.ModelProto`. Minimal example --------------- The snippet below builds the same ``Sub`` / ``Div`` graph emitted by the ``StandardScaler`` converter, using the default :class:`GraphBuilder `: .. runpython:: :showcode: import numpy as np import onnx from yobx.xbuilder import GraphBuilder, OptimizationOptions from yobx.helpers.onnx_helper import pretty_onnx TFLOAT = onnx.TensorProto.FLOAT opts = OptimizationOptions(constant_folding=False) g = GraphBuilder(20, ir_version=10, optimization_options=opts) g.make_tensor_input("X", TFLOAT, ("batch", 4)) mean = np.array([1.0, 2.0, 3.0, 4.0], dtype=np.float32) scale = np.array([0.5, 1.0, 2.0, 4.0], dtype=np.float32) mean_name = g.make_initializer("mean", mean) scale_name = g.make_initializer("scale", scale) centered = g.op.Sub("X", mean_name) g.set_type(centered, TFLOAT) g.set_shape(centered, ("batch", 4)) result = g.op.Div(centered, scale_name) g.set_type(result, TFLOAT) g.set_shape(result, ("batch", 4)) g.make_tensor_output(result, indexed=False, allow_untyped_output=True) model = g.to_onnx() print(pretty_onnx(model)) Opset API ========= Converters frequently need to know which opset versions are active so they can choose the right operator variant or register an additional domain (e.g. ``"ai.onnx.ml"`` for scikit-learn models). .. list-table:: :header-rows: 1 :widths: 40 60 * - Method / attribute - Description * - ``main_opset`` - Read-only property. Returns the opset version for the main ONNX domain (``""``). Equivalent to ``g.opsets[""]``. * - ``has_opset(domain)`` - Returns the opset version (an ``int``) for *domain*, or ``0`` if the domain is not registered. Because ``0`` is falsy and any valid version is truthy, the return value can be used directly in a boolean context: ``if g.has_opset("ai.onnx.ml"): ...``. * - ``get_opset(domain, exc=True)`` - Returns the opset version for *domain*. When ``exc=True`` (default) an ``AssertionError`` is raised if the domain is not registered; set ``exc=False`` to get ``0`` instead. * - ``set_opset(domain, version=1)`` - Registers *domain* with the given *version*. If the domain is already registered with the same version the call is a no-op; a version mismatch raises an ``AssertionError``. * - ``add_domain(domain, version=1)`` - Deprecated alias for ``set_opset``. A converter that targets the main ONNX domain only needs to read ``g.main_opset``. A converter that also emits nodes from a secondary domain (e.g. ``"ai.onnx.ml"``) should first call ``g.set_opset(domain, version)`` to ensure the domain is recorded in the exported model, then query its version with ``g.get_opset(domain)``. .. code-block:: python from yobx.typing import GraphBuilderExtendedProtocol from yobx.xbuilder import GraphBuilder def convert_my_estimator(g: GraphBuilderExtendedProtocol, sts, outputs, estimator, X): # Read the main opset to pick the right operator variant. opset = g.main_opset # Register and query the ai.onnx.ml domain when needed. g.set_opset("ai.onnx.ml", 3) ml_opset = g.get_opset("ai.onnx.ml") # Check whether an optional domain is already registered. if g.has_opset("com.microsoft"): result = g.op.MicrosoftOp(X) elif opset >= 20: result = g.op.SomeNewOp(X) else: result = g.op.SomeLegacyOp(X) ... return result Shape and type API ================== Converters are expected to propagate shape and type information after each node so that downstream converters (e.g. pipeline steps) can query them without re-running inference. The model may be different given that information. Below, the required methods are where ``g`` defines the ``GraphBuilder`` implemented the expected API. .. list-table:: :header-rows: 1 :widths: 30 70 * - Method - Description * - ``g.set_type(name, itype)`` - Register the element type (an ``onnx.TensorProto.*`` integer) for tensor ``name``. * - ``g.get_type(name)`` - Return the previously registered element type. * - ``g.has_type(name)`` - Return ``True`` if the element type is known. * - ``g.set_shape(name, shape)`` - Register the shape for tensor ``name``. Dimensions may be integers (static) or strings (symbolic). * - ``g.get_shape(name)`` - Return the shape as a tuple of integers or strings. * - ``g.has_shape(name)`` - Return ``True`` if the shape is known. * - ``g.set_device(name, device)`` - Register the device for tensor ``name`` (``-1`` = CPU). * - ``g.get_device(name)`` - Return the device. * - ``g.has_device(name)`` - Return ``True`` if a device is registered for the tensor. In addition, it is usually useful to implement the following methods. .. list-table:: :header-rows: 1 :widths: 30 70 * - ``g.unique_name(prefix)`` - Return a name that starts with *prefix* and is not yet in use anywhere in the graph. * - ``g.set_type_shape_unary_op(name, input_name, itype: int = None)`` - Defines shape, type, device for `name` equal the one defined for `input_name`, `itype` can be used the change the type The current does not include common operations on shapes, ``+, -, //, *, %, min, max`` or even their simplification. This is usually needed to optimize models but not mandatory to write the model itself. This is left to the builder. Every converter must usually known the type, the device, the rank and sometimes if a dimension is static or dynamic. Shape and Type representation ----------------------------- This API follows ONNX standard. * A name is a string: it is a unique identifier. * A type is an integer: see `supported types `_. * A shape is a tuple, empty or filled with integers (static dimension) or strings (dynamic dimension). Additionally: * A device is an integer, -1 for CPU, a value >= 0 for a CUDA device. * A rank is an integer and equal to ``len(shape)``. Propagating shape and type in a converter ----------------------------------------- The canonical pattern at the end of every converter is: .. code-block:: python result = g.op.Relu(X, name=name) g.set_type_shape_unary_op(result, X) return result The helper :meth:`set_type_shape_unary_op ` combines the three ``set_*`` calls into one call. Convert Options =============== :class:`~yobx.typing.ConvertOptionsProtocol` is a lightweight protocol that lets callers **opt-in to extra outputs** on a per-estimator basis without changing the core converter signatures. Protocol contract ----------------- Any object that implements the single method below satisfies the protocol and can be passed to :func:`~yobx.sklearn.to_onnx` as the ``convert_options`` argument: .. list-table:: :header-rows: 1 :widths: 40 60 * - Method - Description * - ``has(option_name: str, piece: object) -> bool`` - Return ``True`` when the option identified by *option_name* should be activated for the estimator *piece*. The second argument is the fitted scikit-learn estimator currently being converted, which lets callers enable an option only for a specific step inside a :class:`~sklearn.pipeline.Pipeline`. Inside a converter, the options object is accessible via the graph builder's ``convert_options`` property (``g.convert_options``): .. code-block:: python # Inside a converter function: if g.convert_options.has("decision_path", estimator): # emit the extra decision-path output ... Built-in options: ``ConvertOptions`` ------------------------------------- :class:`~yobx.sklearn.ConvertOptions` is the default implementation shipped with the package. It currently exposes two boolean flags: .. list-table:: :header-rows: 1 :widths: 25 20 55 * - Option name - Type - Description * - ``decision_path`` - ``bool`` - When ``True``, an extra output tensor is appended for each tree/ensemble estimator. For a single :class:`~sklearn.tree.DecisionTreeClassifier` or :class:`~sklearn.tree.DecisionTreeRegressor` the shape is ``(N, 1)``; for ensemble models (:class:`~sklearn.ensemble.RandomForestClassifier`, :class:`~sklearn.ensemble.RandomForestRegressor`, :class:`~sklearn.ensemble.ExtraTreesClassifier`, :class:`~sklearn.ensemble.ExtraTreesRegressor`) the shape is ``(N, n_estimators)``. Each value is a binary string encoding the root-to-leaf path through the tree. * - ``decision_leaf`` - ``bool`` - When ``True``, an extra output tensor (``int64``) is appended containing the zero-based leaf node index reached by each sample. Shapes follow the same convention as ``decision_path``. Passing options to ``to_onnx`` -------------------------------- .. code-block:: python import numpy as np from sklearn.tree import DecisionTreeClassifier from yobx.sklearn import to_onnx, ConvertOptions X = np.random.default_rng(0).standard_normal((20, 4)).astype(np.float32) y = (X[:, 0] > 0).astype(int) clf = DecisionTreeClassifier(max_depth=3).fit(X, y) opts = ConvertOptions(decision_path=True) model_onnx = to_onnx(clf, (X,), convert_options=opts) # The exported model now has three outputs: # output_0 – label (int64, shape [N]) # output_1 – probabilities (float32, shape [N, 2]) # output_2 – decision path (object/string, shape [N, 1]) Implementing a custom protocol -------------------------------- You can supply any object with a ``has`` method. The simplest way is to subclass :class:`~yobx.typing.DefaultConvertOptions` and override ``has``: .. code-block:: python from yobx.typing import DefaultConvertOptions class MyOptions(DefaultConvertOptions): def has(self, option_name: str, piece: object) -> bool: # Only enable decision_leaf for RandomForestClassifier: from sklearn.ensemble import RandomForestClassifier if option_name == "decision_leaf": return isinstance(piece, RandomForestClassifier) return False Alternatively, any object whose class implements the single-method :class:`~yobx.typing.ConvertOptionsProtocol` is accepted directly. See :ref:`l-plot-sklearn-convert-options` for a full runnable example of ``decision_path`` and ``decision_leaf`` on single trees and ensembles. Alternative implementations =========================== Any class that satisfies the two-part API above can be passed as ``builder_cls``. The package ships with: * :class:`GraphBuilder ` — the default; builds graphs using onnx protobuf objects with built-in optimization passes. * :class:`OnnxScriptGraphBuilder ` — a bridge that satisfies the same API while using the ``onnxscript`` IR internally. Useful when the rest of the pipeline already works with onnxscript. .. runpython:: :showcode: import numpy as np import onnx from sklearn.preprocessing import StandardScaler from yobx.sklearn import to_onnx from yobx.builder.onnxscript import OnnxScriptGraphBuilder from yobx.helpers.onnx_helper import pretty_onnx rng = np.random.default_rng(0) X = rng.standard_normal((10, 4)).astype(np.float32) scaler = StandardScaler().fit(X) model = to_onnx(scaler, (X,), builder_cls=OnnxScriptGraphBuilder) print(pretty_onnx(model)) .. seealso:: :ref:`l-design-sklearn-converter` — overview of the built-in converters. :ref:`l-design-sklearn-custom-converter` — how to write and register a custom converter. :ref:`l-design-graph-builder` — the full :class:`GraphBuilder ` reference, including optimization passes and dynamic shapes.