.. _l-design-dataframe-pipeline: DataFrame Input to a Pipeline with ColumnTransformer ===================================================== This page explains how :func:`yobx.sklearn.to_onnx` handles a :class:`pandas.DataFrame` as the dummy input when the model being converted contains a :class:`~sklearn.compose.ColumnTransformer`. Overview -------- A common :epkg:`scikit-learn` pattern is to build a :class:`~sklearn.pipeline.Pipeline` whose first step is a :class:`~sklearn.compose.ColumnTransformer`. The transformer selects subsets of columns by **name**, applies a different preprocessing step to each subset, and concatenates the results. For example: .. code-block:: python import pandas as pd import numpy as np from sklearn.compose import ColumnTransformer from sklearn.linear_model import LogisticRegression from sklearn.pipeline import Pipeline from sklearn.preprocessing import MinMaxScaler, StandardScaler df = pd.DataFrame( np.random.randn(100, 4).astype(np.float32), columns=["age", "income", "score", "balance"], ) y = (df["age"] > 0).astype(int).to_numpy() ct = ColumnTransformer([ ("std", StandardScaler(), ["age", "income"]), ("mm", MinMaxScaler(), ["score", "balance"]), ]) pipe = Pipeline([("preprocessor", ct), ("clf", LogisticRegression())]) pipe.fit(df, y) To convert this pipeline to ONNX, pass the fitted :class:`~pandas.DataFrame` directly as the dummy input: .. code-block:: python from yobx.sklearn import to_onnx onx = to_onnx(pipe, (df,)) Per-column ONNX inputs ---------------------- When a :class:`~pandas.DataFrame` is detected, :func:`yobx.sklearn.to_onnx` expands it column-by-column: each column is registered as a **separate 1-D ONNX graph input** named after the column. An ``Unsqueeze`` + ``Concat`` node sequence assembles the per-column tensors back into the 2-D matrix ``(batch, n_cols)`` expected by the converter. .. code-block:: text "age" ──Unsqueeze──┐ "income" ──Unsqueeze──┤ Concat ──► X (batch, 4) ──► pipeline ... "score" ──Unsqueeze──┤ "balance" ──Unsqueeze──┘ This produces an ONNX model with four separate inputs instead of a single ``X`` matrix, which matches the interface of a serving endpoint that receives one scalar value per feature. String column selectors ----------------------- scikit-learn's :class:`~sklearn.compose.ColumnTransformer` stores the original column specification in ``transformers_`` after fitting. When columns are specified by **name** (strings), those names are preserved in ``transformers_``; only the overall number of features is stored in ``n_features_in_``. The ONNX converter resolves string column names to integer positions using ``feature_names_in_``, which scikit-learn automatically sets on the :class:`~sklearn.compose.ColumnTransformer` when it is fitted on a DataFrame: .. code-block:: python # After fitting: print(ct.feature_names_in_) # ['age' 'income' 'score' 'balance'] print(ct.transformers_) # [('std', StandardScaler(), ['age', 'income']), # ('mm', MinMaxScaler(), ['score', 'balance'])] The mapping ``name → index`` is computed once and reused for every transformer entry in ``transformers_``. Inference --------- At inference time the ONNX model expects one **1-D array** per column, matching the graph inputs created during conversion: .. code-block:: python import onnxruntime sess = onnxruntime.InferenceSession(onx.proto.SerializeToString()) # Pass a dict with one key per column: feed = {col: df_test[col].to_numpy() for col in df.columns} labels, probas = sess.run(None, feed) Full working example -------------------- See :ref:`l-plot-sklearn-dataframe-pipeline` for a complete runnable example that trains the pipeline, converts it to ONNX, runs inference with :epkg:`onnxruntime`, and verifies that the outputs match scikit-learn.