.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples_sklearn/plot_sklearn_dataframe_pipeline.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_sklearn_plot_sklearn_dataframe_pipeline.py: .. _l-plot-sklearn-dataframe-pipeline: DataFrame input to a Pipeline with ColumnTransformer ===================================================== This example shows how to convert a :epkg:`scikit-learn` :class:`~sklearn.pipeline.Pipeline` whose first step is a :class:`~sklearn.compose.ColumnTransformer` when the training data is a :class:`pandas.DataFrame`. When a :class:`~pandas.DataFrame` is passed as the dummy input to :func:`yobx.sklearn.to_onnx`, each column is registered as a **separate 1-D ONNX graph input** named after the column. An ``Unsqueeze`` + ``Concat`` node sequence assembles the per-column tensors back into the 2-D matrix that the rest of the pipeline expects. The :class:`~sklearn.compose.ColumnTransformer` may reference columns by **name** (strings) rather than by integer position — *yobx* resolves the names to integer indices using ``feature_names_in_`` that scikit-learn stores after fitting. This example covers: 1. **ColumnTransformer only** — two scalers applied to different named columns. 2. **Pipeline** — ColumnTransformer followed by a classifier, taking a DataFrame as input. 3. **Validation** — confirming that ONNX and scikit-learn produce identical predictions. 4. **Visualisation** — inspecting the ONNX graph. See :ref:`l-design-dataframe-pipeline` for a deeper explanation of how DataFrame inputs are handled during conversion. .. GENERATED FROM PYTHON SOURCE LINES 35-47 .. code-block:: Python import numpy as np import pandas as pd import onnxruntime from sklearn.compose import ColumnTransformer from sklearn.linear_model import LogisticRegression from sklearn.pipeline import Pipeline from sklearn.preprocessing import MinMaxScaler, StandardScaler from yobx.doc import plot_dot from yobx.sklearn import to_onnx .. GENERATED FROM PYTHON SOURCE LINES 48-54 1. Build a labelled dataset ---------------------------- We create a small :class:`~pandas.DataFrame` with four named columns that mimic a typical tabular dataset: two numeric features that will be standardised and two that will be min-max scaled. .. GENERATED FROM PYTHON SOURCE LINES 54-66 .. code-block:: Python rng = np.random.default_rng(0) n_samples = 120 X_raw = rng.standard_normal((n_samples, 4)).astype(np.float32) df = pd.DataFrame(X_raw, columns=["age", "income", "score", "balance"]) y = ((df["age"] + df["income"]) > 0).astype(int).to_numpy() print("Dataset shape:", df.shape) print("Column dtypes:\n", df.dtypes) print("Class distribution:", np.bincount(y)) .. rst-class:: sphx-glr-script-out .. code-block:: none Dataset shape: (120, 4) Column dtypes: age float32 income float32 score float32 balance float32 dtype: object Class distribution: [58 62] .. GENERATED FROM PYTHON SOURCE LINES 67-77 2. Build and fit the pipeline ------------------------------ :class:`~sklearn.compose.ColumnTransformer` references columns by **name**: * ``age`` and ``income`` → :class:`~sklearn.preprocessing.StandardScaler` * ``score`` and ``balance`` → :class:`~sklearn.preprocessing.MinMaxScaler` A :class:`~sklearn.linear_model.LogisticRegression` classifier is appended as the final step. .. GENERATED FROM PYTHON SOURCE LINES 77-89 .. code-block:: Python ct = ColumnTransformer( [("std", StandardScaler(), ["age", "income"]), ("mm", MinMaxScaler(), ["score", "balance"])] ) pipe = Pipeline([("preprocessor", ct), ("clf", LogisticRegression(max_iter=500))]) # max_iter=500 avoids ConvergenceWarnings on some random seeds. pipe.fit(df, y) print("\nPipeline steps:") for step_name, step in pipe.steps: print(f" {step_name}: {step}") .. rst-class:: sphx-glr-script-out .. code-block:: none Pipeline steps: preprocessor: ColumnTransformer(transformers=[('std', StandardScaler(), ['age', 'income']), ('mm', MinMaxScaler(), ['score', 'balance'])]) clf: LogisticRegression(max_iter=500) .. GENERATED FROM PYTHON SOURCE LINES 90-97 3. Convert to ONNX using a DataFrame as the dummy input -------------------------------------------------------- Passing *df* directly to :func:`~yobx.sklearn.to_onnx` triggers per-column ONNX inputs. The ColumnTransformer's string column selectors are resolved to integer positions via ``feature_names_in_`` that scikit-learn sets during ``fit``. .. GENERATED FROM PYTHON SOURCE LINES 97-109 .. code-block:: Python onx = to_onnx(pipe, (df,)) print("\nONNX graph inputs:") for inp in onx.proto.graph.input: shape = inp.type.tensor_type.shape dims = [d.dim_param or d.dim_value for d in shape.dim] print(f" {inp.name!r:12s} shape={dims}") print("\nONNX graph outputs:", [out.name for out in onx.proto.graph.output]) print("Number of nodes :", len(onx.proto.graph.node)) .. rst-class:: sphx-glr-script-out .. code-block:: none ONNX graph inputs: 'age' shape=['batch', 1] 'income' shape=['batch', 1] 'score' shape=['batch', 1] 'balance' shape=['batch', 1] ONNX graph outputs: ['label', 'probabilities'] Number of nodes : 13 .. GENERATED FROM PYTHON SOURCE LINES 110-115 4. Run the ONNX model and compare with scikit-learn ---------------------------------------------------- The ONNX runtime expects one 1-D array per column, matching the graph inputs registered during conversion. .. GENERATED FROM PYTHON SOURCE LINES 115-136 .. code-block:: Python X_test_raw = rng.standard_normal((30, 4)).astype(np.float32) df_test = pd.DataFrame(X_test_raw, columns=df.columns) feed = {col: df_test[[col]].values for col in df.columns} sess = onnxruntime.InferenceSession( onx.proto.SerializeToString(), providers=["CPUExecutionProvider"] ) label_onnx, proba_onnx = sess.run(None, feed) label_sk = pipe.predict(df_test) proba_sk = pipe.predict_proba(df_test).astype(np.float32) print("\nFirst 5 labels (sklearn):", label_sk[:5]) print("First 5 labels (ONNX) :", label_onnx[:5]) assert np.array_equal(label_sk, label_onnx), "Label mismatch!" assert np.allclose(proba_sk, proba_onnx, atol=1e-5), "Probability mismatch!" print("\nAll predictions match ✓") .. rst-class:: sphx-glr-script-out .. code-block:: none First 5 labels (sklearn): [1 0 1 0 1] First 5 labels (ONNX) : [1 0 1 0 1] All predictions match ✓ .. GENERATED FROM PYTHON SOURCE LINES 137-141 5. Standalone ColumnTransformer (no classifier) ------------------------------------------------- The same pattern works when converting only the preprocessing part. .. GENERATED FROM PYTHON SOURCE LINES 141-157 .. code-block:: Python onx_ct = to_onnx(ct, (df,)) print("\nColumnTransformer ONNX inputs:") for inp in onx_ct.proto.graph.input: print(f" {inp.name!r}") feed_ct = {col: df_test[[col]].values for col in df.columns} (ct_out_onnx,) = onnxruntime.InferenceSession( onx_ct.proto.SerializeToString(), providers=["CPUExecutionProvider"] ).run(None, feed_ct) ct_out_sk = ct.transform(df_test).astype(np.float32) assert np.allclose(ct_out_sk, ct_out_onnx, atol=1e-5), "ColumnTransformer output mismatch!" print("ColumnTransformer output matches sklearn ✓") .. rst-class:: sphx-glr-script-out .. code-block:: none ColumnTransformer ONNX inputs: 'age' 'income' 'score' 'balance' ColumnTransformer output matches sklearn ✓ .. GENERATED FROM PYTHON SOURCE LINES 158-164 6. Visualize the ONNX graph ---------------------------- The graph starts with one ``Unsqueeze`` node per column input, followed by a single ``Concat`` that assembles the 2-D matrix handed to the ColumnTransformer's ``Gather`` nodes. .. GENERATED FROM PYTHON SOURCE LINES 164-166 .. code-block:: Python plot_dot(onx) .. image-sg:: /auto_examples_sklearn/images/sphx_glr_plot_sklearn_dataframe_pipeline_001.png :alt: plot sklearn dataframe pipeline :srcset: /auto_examples_sklearn/images/sphx_glr_plot_sklearn_dataframe_pipeline_001.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 0.338 seconds) .. _sphx_glr_download_auto_examples_sklearn_plot_sklearn_dataframe_pipeline.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_sklearn_dataframe_pipeline.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_sklearn_dataframe_pipeline.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: plot_sklearn_dataframe_pipeline.zip ` .. include:: plot_sklearn_dataframe_pipeline.recommendations .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_