scikit-learn Export to ONNX#
See also
Numpy-Tracing and FunctionTransformer — the numpy-tracing
mechanism used by FunctionTransformer
is documented in the core design section.
A basic scikit-learn model may look like the following,
a scaler following by an estimator. Every model can be converter
with model to_onnx().
Pipeline(steps=[('scaler', StandardScaler()), ('clf', LogisticRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
Parameters
Parameters
Models based on scikit-learn are made of a custom collection of known transformers or estimators. The main function functions has to call every converter for every piece and assembles the result into a single ONNX model.
The custom collection is mainly implemented through the classes
sklearn.pipeline.Pipeline, sklearn.pipeline.FeatureUnion and
sklearn.compose.ColumnTransformer. Everything else is well defined
and can be mapped to its converted ONNX code. It also implemented through
meta-estimators combining others such as sklearn.multiclass.OneVsRestClassifier
or sklearn.ensemble.VotingClassifier.
Common API
Putting ONNX node together in a model is not difficult but almost everybody
already implemented its own way of doing, ir-py, onnxscript,
spox. Every converting library has also its own: sklearn-onnx,
tensorflow-onnx, onnxmltools… The choice was made not to create
a new one but more to define what the converters expect to find in a class
classed GraphBuilder. It then becomes possible to create a bridge
such as yobx.builder.onnxscript.OnnxScriptGraphBuilder which implements
this API for every known way. See Expected API for further details.
Opsets
yobx.sklearn.to_onnx() converts scikit-learn models into
ONNX. The function exposes arguments target_opset.
The conversion is done for opset 18 if target_opset==18.
The conversion may includes optimized kernels for onnxruntime
if target_opsets={'': 18, 'com.microsoft': 1} (see ONNX Runtime Contrib Ops (com.microsoft domain)).
Discrepancies
scikit-learn==1.8 is more strict with computation types and the number of discrepancies is reduced. Switch to float32 in a matrix multiplication when the order of magnitude of the coefficient is quite large usually introduces discrepancies. That is often the case when a matrix is the inverse of another one (see Float32 vs Float64: precision loss with PLSRegression). Prior to that, it was not rare the see huge difference when using a model just after normalizing the data. The normalizer was implicitly switching the type to float64 while ONNX was keeping float32. If followed by a tree, a small difference could make the model choose a different decision path and produce a very different output.
Finally, the example given at the top of the page would be converted into the mode which follows.
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from yobx.sklearn import to_onnx
from yobx.helpers.dot_helper import to_dot
rng = np.random.default_rng(0)
X = rng.standard_normal((20, 4)).astype(np.float32)
y = (X[:, 0] > 0).astype(np.int64)
pipe = Pipeline([
("scaler", StandardScaler()),
("clf", LogisticRegression()),
]).fit(X, y)
model = to_onnx(pipe, (X,))
print("DOT-SECTION", to_dot(model))