Sklearn Converter#

yobx.sklearn.to_onnx() converts a fitted scikit-learn estimator into an onnx.ModelProto. The conversion is powered by yobx.xbuilder.GraphBuilder and follows a registry-based design: each estimator class maps to a dedicated converter function that emits the required ONNX nodes.

High-level workflow#

fitted estimator
      │
      ▼
  to_onnx()          ← builds GraphBuilder, looks up converter
      │
      ▼
converter function   ← adds ONNX nodes via GraphBuilder.op.*
      │
      ▼
  GraphBuilder.to_onnx()   ← validates and returns ModelProto

to_onnx accepts the fitted estimator, representative dummy inputs (used to infer dtype and shape), and optional input_names / dynamic_shapes.
It calls register_sklearn_converters (idempotent) to populate the global registry on first use.
It constructs a GraphBuilder and declares one graph input per dummy array via make_tensor_input.
It looks up the converter for type(estimator) and calls it.
Each graph output is declared with make_tensor_output.
GraphBuilder.to_onnx finalises and returns the model.

<<<

import numpy as np
from sklearn.preprocessing import StandardScaler
from yobx.sklearn import to_onnx
from yobx.helpers.onnx_helper import pretty_onnx

rng = np.random.default_rng(0)
X = rng.standard_normal((10, 4)).astype(np.float32)

scaler = StandardScaler().fit(X)
model = to_onnx(scaler, (X,))
print(pretty_onnx(model))

>>>

    opset: domain='' version=21
    opset: domain='ai.onnx.ml' version=5
    input: name='X' type=dtype('float32') shape=['batch', 4]
    init: name='init1_s4_' type=float32 shape=(4,) -- array([-0.448,  0.052, -0.093,  0.247], dtype=float32)-- Opset.make_node.0
    init: name='init1_s4_2' type=float32 shape=(4,) -- array([0.774, 0.641, 0.825, 0.728], dtype=float32)-- Opset.make_node.0
    Sub(X, init1_s4_) -> _onx_sub_X
      Div(_onx_sub_X, init1_s4_2) -> x
    output: name='x' type='NOTENSOR' shape=None

Converter registry#

The registry is a plain module-level dictionary SKLEARN_CONVERTERS: Dict[type, Callable] defined in yobx.sklearn.register.

Registering a converter#

Use the register_sklearn_converter decorator. Pass a single class or a tuple of classes as the first argument:

from yobx.sklearn.register import register_sklearn_converter
from yobx.typing import GraphBuilderExtendedProtocol
from yobx.xbuilder import GraphBuilder

@register_sklearn_converter(MyEstimator)
def convert_my_estimator(
    g: GraphBuilderExtendedProtocol,
    sts: dict,
    outputs: list[str],
    estimator: MyEstimator,
    X: str,
    name: str = "my_estimator",
) -> str:
    ...

The decorator raises TypeError if a converter is already registered for the same class, preventing accidental double-registration.

Looking up a converter#

get_sklearn_converter takes a class and returns the registered callable, raising ValueError if none is found.

Converter function signature#

Every converter follows the same contract:

(g, sts, outputs, estimator, *input_names, name) → output_name(s)

Parameter	Description
`g`	`GraphBuilder` — call `g.op.<OpType>(…)` to emit ONNX nodes.
`sts`	unused
`outputs`	`List[str]` of pre-allocated output tensor names that the converter must write to.
`estimator`	The fitted scikit-learn object.
`*inputs`	One positional `str` argument per graph input (the tensor name in the graph).
`name`	String prefix used when generating unique node names via `g.op`.

The function must return the output tensor name (str) for single-output estimators, or a tuple of names for multi-output ones (e.g. classifiers that produce both a label and probabilities).

Output naming#

get_output_names determines the list of output tensor names for an estimator:

Transformers that expose get_feature_names_out() use those names (collapsed to a common prefix via longest_prefix when more than one output is expected).
Classifiers default to ["label", "probabilities"].
Regressors default to ["predictions"].
Everything else defaults to ["Y"].
Estimators that inherit from NoKnownOutputMixin cause get_output_names() to return None, signalling that the converter itself is responsible for registering its outputs. Use this mixin when the output shape cannot be determined by the heuristics above — for example a transformer that returns an arbitrary number of named columns. See Custom Converter for a worked example.

Adding a new converter#

To support a new scikit-learn estimator:

Create a new file (e.g. yobx/sklearn/ensemble/random_forest.py).
Implement a converter function following the signature described above.
Decorate it with @register_sklearn_converter(MyEstimator).
Add an import in the matching register() function so the converter is loaded when register_sklearn_converters is called.

# yobx/sklearn/ensemble/random_forest.py
from sklearn.ensemble import RandomForestClassifier
from ...typing import GraphBuilderExtendedProtocol
from ..register import register_sklearn_converter


@register_sklearn_converter(RandomForestClassifier)
def convert_random_forest_classifier(
    g: GraphBuilderExtendedProtocol,
    sts: dict,
    outputs: list[str],
    estimator: RandomForestClassifier,
    X: str,
    name: str = "random_forest",
):
    # ... emit ONNX nodes via g.op.*
    ...

Converting Options#

The user may need extra outputs coming from the model. This is driven by class yobx.sklearn.ConvertOptions which exposes all the possible ways to change the default behaviour if the converters. See Exporting sklearn tree models with convert options for a worked example.

yobx VS skl2onnx#

Both libraries convert scikit-learn models to ONNX in a similar way. But yobx implements a unified way to convert models across different packages and is more scalable than sklearn-onnx.

yobx can then easily be extended to libraries such as sktorch since both exporters can use the same GraphBuilder.
It enables onnxruntime optimizations whenever possible.
It allows the user to export estimator in a pipeline as local functions (see Exporting sklearn estimators as ONNX local functions).
By using protocol, the default GraphBuilder can be replaced by other GraphBuilder based on other packages such as onnxscript or spox.