yobx.sklearn.to_onnx#

yobx.sklearn.to_onnx(estimator: ~sklearn.base.BaseEstimator, args: ~typing.Tuple[~typing.Any], input_names: ~typing.Sequence[str] | None = None, dynamic_shapes: ~typing.Tuple[~typing.Dict[int, str]] | None = None, target_opset: int | ~typing.Dict[str, int] = 21, verbose: int = 0, builder_cls: type | ~typing.Callable = <class 'yobx.xbuilder.graph_builder.GraphBuilder'>, extra_converters: ~typing.Dict[type, ~typing.Callable] | None = None, large_model: bool = False, external_threshold: int = 1024, function_options: ~yobx.xbuilder.function_options.FunctionOptions | None = None, convert_options: ~yobx.typing.ConvertOptionsProtocol | None = None, filename: str | None = None, return_optimize_report: bool = False) ExportArtifact[source]#

Converts a scikit-learn estimator into ONNX. By default, the first dimension is considered as dynamic, the others are static.

Parameters:
  • estimator – estimator

  • args

    dummy inputs; each element may be a numpy array, a pandas.DataFrame, an onnx.ValueInfoProto that explicitly describes the input tensor’s name, element type and shape, or a (name, dtype, shape) tuple. A DataFrame is expanded column-by-column: each column is registered as a separate 1-D ONNX graph input named after the column, and an Unsqueeze + Concat node sequence assembles them back into a 2-D matrix (batch, n_cols) that is passed to the converter. When a ValueInfoProto or a (name, dtype, shape) tuple is provided no actual data is required, and the dynamic_shapes parameter is ignored for that input (the shape is taken directly from the descriptor). The (name, dtype, shape) tuple format uses a plain string for the name, a numpy dtype (or scalar-type class such as np.float32) for the element type, and a sequence of ints and/or strings for the shape (strings denote symbolic / dynamic dimensions). Example:

    to_onnx(estimator, (('x', np.float32, ('N', 4)),))
    

  • dynamic_shapes – dynamic shapes, if not specified, the first dimension is dynamic, the others are static

  • target_opset – opset to use; either an integer for the default domain (""), or a dictionary mapping domain names to opset versions, e.g. {"": 20, "ai.onnx.ml": 5}. When "ai.onnx.ml" is set to 5 the converter emits the unified TreeEnsemble operator introduced in that opset instead of the older per-task operators. If it includes {'com.microsoft': 1}, the converted model may include optimized kernels specific to onnxruntime.

  • verbose – verbosity

  • builder_cls – by default the graph builder is a yobx.xbuilder.GraphBuilder but any builder can be used as long it implements the apis Shape and type tracking and Building a graph from scratch

  • extra_converters – optional mapping from estimator type to converter function; entries here take priority over the built-in converters and allow converting custom estimators that are not natively supported

  • large_model – if True the returned ExportArtifact has its container attribute set to an ExtendedModelContainer, which lets the user decide later whether weights should be embedded in the model or saved as external data

  • external_threshold – if large_model is True, every tensor whose element count exceeds this threshold is stored as external data

  • function_options – when a FunctionOptions is provided every non-container estimator is exported as a separate ONNX local function. Pipeline and ColumnTransformer are treated as orchestrators — their individual steps/sub-transformers are each wrapped as a function instead of the container itself. Function names for each step are always derived from the estimator’s class name; the name field of the provided FunctionOptions is not used by this helper to customize function naming. Pass None (the default) to disable function wrapping and produce a flat graph. when large_model is True

  • convert_options – see yobx.sklearn.ConvertOptions

  • filename – if set, the exported ONNX model is saved to this path and the ExportReport is written as a companion Excel file (same base name with .xlsx extension).

  • return_optimize_report – if True, the returned ExportArtifact has its report attribute populated with per-pattern optimization statistics

Returns:

ExportArtifact wrapping the exported ONNX proto together with an ExportReport.

Note

scikit-learn==1.8 is more strict with computation types and the number of discrepancies is reduced. Switch to float32 in a matrix multiplication when the order of magnitude of the coefficient is quite large usually introduces discrepancies. That is often the case when a matrix is the inverse of another one. See Float32 vs Float64: precision loss with PLSRegression.

Example:

import numpy as np
from sklearn.linear_model import LinearRegression
from yobx.sklearn import to_onnx

X = np.random.randn(10, 3).astype(np.float32)
y = X @ np.array([1.0, 2.0, 3.0], dtype=np.float32)
reg = LinearRegression().fit(X, y)

artifact = to_onnx(reg, (X,))
# Access the raw proto:
proto = artifact.proto
# Save to disk:
artifact.save("model.onnx")
yobx.sklearn.wrap_skl2onnx_converter(skl2onnx_op_converter: Callable) Callable[source]#

Wrap a skl2onnx-style converter function so it can be used with yobx.sklearn.to_onnx() via the extra_converters parameter.

Note

This module contains no skl2onnx imports. Only onnx and numpy (both core yobx dependencies) are used inside the mock helper classes.

class yobx.sklearn.ConvertOptions(decision_leaf: bool | Set[str | type | int | Callable] = False, decision_path: bool | Set[str | type | int | Callable] = False)[source]#

Tunes the way every piece of a model is exported.

Pass an instance of this class to yobx.sklearn.to_onnx() via the convert_options keyword argument to request extra outputs from tree and ensemble estimators.

Parameters:
  • decision_leaf – when True, an extra int64 output tensor is appended containing the zero-based leaf node index reached by each input sample. The shape is (N, 1) for single trees and (N, n_estimators) for ensembles. The option triggers for every estimator which implements decision_path method.

  • decision_path – when True, an extra object (string) output tensor is appended containing the binary root-to-leaf path for each input sample. Each value is a byte-string whose i-th character is '1' if node i was visited and '0' otherwise. The shape is (N, 1) for single trees and (N, n_estimators) for ensembles. The option triggers for every estimator which implements decision_path method.

Class attributes

OPTIONS
Type:

list[str]

Sorted list of all recognised option names. Currently ["decision_leaf", "decision_path"].

Example:

import numpy as np
from sklearn.tree import DecisionTreeClassifier
from yobx.sklearn import ConvertOptions, to_onnx

X = np.random.randn(20, 4).astype(np.float32)
y = (X[:, 0] > 0).astype(int)
clf = DecisionTreeClassifier(max_depth=3).fit(X, y)

# Export with both extra outputs enabled
opts = ConvertOptions(decision_leaf=True, decision_path=True)
artifact = to_onnx(clf, (X,), convert_options=opts)
# The model now produces four outputs:
#   label (int64), probabilities (float32),
#   decision_path (object/string), decision_leaf (int64)

See also

Exporting sklearn tree models with convert options — worked examples for single trees and ensemble models.

OPTIONS = {'decision_leaf': <function ConvertOptions.<lambda>>, 'decision_path': <function ConvertOptions.<lambda>>}#
available_options() Sequence[str][source]#

Returns the list of available options.

has(option_name: str, piece: BaseEstimator, name: str | None = None) bool[source]#

Return True if option option_name is active for estimator piece.

Parameters:
  • option_name – name of the option to query. Must be one of the strings listed in OPTIONS. An AssertionError is raised when an unknown name is passed.

  • piece – the fitted BaseEstimator for which the option is being queried.

  • name – optional pipeline step name for the estimator. When the option attribute is a set, string elements in the set are compared against this name to enable an option for a specific named step inside a Pipeline. If name is None the string elements are ignored. Non-string elements (types, integer object ids, callable predicates) are always checked regardless of name.

Returns:

True when the option is enabled globally (the attribute value is True), False when it is disabled (False or any falsy value).

Raises:

AssertionError – if option_name is not a member of OPTIONS.

class yobx.sklearn.NoKnownOutputMixin[source]#

Mixin for custom sklearn estimators that produce a variable or non-standard number of ONNX outputs.

By default the ONNX converter infrastructure infers the expected output names from the estimator type (classifier, regressor, transformer, …) and from get_feature_names_out(). For estimators whose outputs cannot be determined by those heuristics — for example a transformer that returns multiple named columns — this mixin instructs get_output_names() to return None so that the converter is given full control over how many outputs it registers.

Usage#

Inherit from both BaseEstimator (or any sklearn base class) and NoKnownOutputMixin when writing a custom converter that needs to emit an arbitrary set of ONNX outputs:

from sklearn.base import BaseEstimator, TransformerMixin
from yobx.sklearn import NoKnownOutputMixin

class MyMultiOutputTransformer(BaseEstimator, TransformerMixin, NoKnownOutputMixin):
    def fit(self, X=None, y=None):
        self.input_dtypes_ = {"a": np.dtype("float32"), "b": np.dtype("float32")}
        return self

    def transform(self, df):
        return df[["a", "b"]].assign(total=df["a"] + df["b"])

    def get_feature_names_out(self, input_features=None):
        return ["a", "b", "total"]

The paired extra_converters entry is then free to call g.make_output(...) for each output without the framework complaining about a mismatched output count.

class yobx.sklearn.TraceableMixin[source]#

Marks an estimator as traceable. Then its method transform will be traced to export it into ONNX.