Converting a scikit-learn KMeans to ONNX#

yobx.sklearn.to_onnx() converts a fitted sklearn.cluster.KMeans into an onnx.ModelProto that can be executed with any ONNX-compatible runtime.

The converted model produces two outputs:

  • label - cluster index for each sample (equivalent to predict()).

  • distances - Euclidean distance from each sample to every centroid (equivalent to transform()).

The workflow is:

  1. Train a KMeans as usual.

  2. Call yobx.sklearn.to_onnx() with a representative dummy input.

  3. Run the ONNX model with any ONNX runtime — this example uses onnxruntime.

  4. Verify that the ONNX outputs match scikit-learn’s predictions.

import numpy as np
import onnxruntime
from sklearn.cluster import KMeans
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from yobx.doc import plot_dot
from yobx.sklearn import to_onnx

1. Train a KMeans model#

rng = np.random.default_rng(0)
X = rng.standard_normal((100, 4)).astype(np.float32)

km = KMeans(n_clusters=3, random_state=0, n_init=10)
km.fit(X)
KMeans(n_clusters=3, n_init=10, random_state=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


2. Convert to ONNX#

onx = to_onnx(km, (X,))
print(f"ONNX model inputs : {[i.name for i in onx.graph.input]}")
print(f"ONNX model outputs: {[o.name for o in onx.graph.output]}")
ONNX model inputs : ['X']
ONNX model outputs: ['label', 'distances']

3. Run the ONNX model and compare outputs#

ref = onnxruntime.InferenceSession(onx.SerializeToString(), providers=["CPUExecutionProvider"])
label_onnx, distances_onnx = ref.run(None, {"X": X})

label_sk = km.predict(X).astype(np.int64)
distances_sk = km.transform(X).astype(np.float32)

print("\nFirst 5 labels (sklearn):", label_sk[:5])
print("First 5 labels (ONNX)   :", label_onnx[:5])

print("\nFirst 5 distances (sklearn):", distances_sk[:5].round(4))
print("First 5 distances (ONNX)   :", distances_onnx[:5].round(4))

assert (label_sk == label_onnx).all(), "Labels differ!"
assert np.allclose(distances_sk, distances_onnx, atol=1e-4), "Distances differ!"
print("\nAll labels and distances match ✓")
First 5 labels (sklearn): [2 1 0 0 1]
First 5 labels (ONNX)   : [2 1 0 0 1]

First 5 distances (sklearn): [[1.9018 1.3076 1.0513]
 [2.7406 1.1723 2.3066]
 [0.7925 2.3442 2.1114]
 [1.8823 3.3654 3.1916]
 [1.8663 0.9522 2.1298]]
First 5 distances (ONNX)   : [[1.9018 1.3076 1.0513]
 [2.7406 1.1723 2.3066]
 [0.7925 2.3442 2.1114]
 [1.8823 3.3654 3.1916]
 [1.8663 0.9522 2.1298]]

All labels and distances match ✓

4. KMeans inside a Pipeline#

KMeans also works as the final step of a Pipeline.

pipe = Pipeline(
    [("scaler", StandardScaler()), ("km", KMeans(n_clusters=3, random_state=0, n_init=10))]
)
pipe.fit(X)

onx_pipe = to_onnx(pipe, (X,))

ref_pipe = onnxruntime.InferenceSession(
    onx_pipe.SerializeToString(), providers=["CPUExecutionProvider"]
)
label_pipe_onnx, _ = ref_pipe.run(None, {"X": X})
label_pipe_sk = pipe.predict(X).astype(np.int64)

assert (label_pipe_sk == label_pipe_onnx).all(), "Pipeline labels differ!"
print("Pipeline labels match ✓")
Pipeline labels match ✓

5. Visualize the pipeline#

plot_dot(onx_pipe)
plot sklearn kmeans

Total running time of the script: (0 minutes 2.437 seconds)

Related examples

Converting a scikit-learn Pipeline to ONNX

Converting a scikit-learn Pipeline to ONNX

Exporting sklearn tree models with convert options

Exporting sklearn tree models with convert options

Exporting FunctionTransformer with numpy tracing

Exporting FunctionTransformer with numpy tracing

Gallery generated by Sphinx-Gallery