DataFrame Input to a Pipeline with ColumnTransformer#

This page explains how yobx.sklearn.to_onnx() handles a pandas.DataFrame as the dummy input when the model being converted contains a ColumnTransformer.

Overview#

A common scikit-learn pattern is to build a Pipeline whose first step is a ColumnTransformer. The transformer selects subsets of columns by name, applies a different preprocessing step to each subset, and concatenates the results. For example:

import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler, StandardScaler

df = pd.DataFrame(
    np.random.randn(100, 4).astype(np.float32),
    columns=["age", "income", "score", "balance"],
)
y = (df["age"] > 0).astype(int).to_numpy()

ct = ColumnTransformer([
    ("std", StandardScaler(),  ["age", "income"]),
    ("mm",  MinMaxScaler(),    ["score", "balance"]),
])
pipe = Pipeline([("preprocessor", ct), ("clf", LogisticRegression())])
pipe.fit(df, y)

To convert this pipeline to ONNX, pass the fitted DataFrame directly as the dummy input:

from yobx.sklearn import to_onnx

onx = to_onnx(pipe, (df,))

Per-column ONNX inputs#

When a DataFrame is detected, yobx.sklearn.to_onnx() expands it column-by-column: each column is registered as a separate 1-D ONNX graph input named after the column. An Unsqueeze + Concat node sequence assembles the per-column tensors back into the 2-D matrix (batch, n_cols) expected by the converter.

"age"     ──Unsqueeze──┐
"income"  ──Unsqueeze──┤ Concat ──► X (batch, 4) ──► pipeline ...
"score"   ──Unsqueeze──┤
"balance" ──Unsqueeze──┘

This produces an ONNX model with four separate inputs instead of a single X matrix, which matches the interface of a serving endpoint that receives one scalar value per feature.

String column selectors#

scikit-learn’s ColumnTransformer stores the original column specification in transformers_ after fitting. When columns are specified by name (strings), those names are preserved in transformers_; only the overall number of features is stored in n_features_in_.

The ONNX converter resolves string column names to integer positions using feature_names_in_, which scikit-learn automatically sets on the ColumnTransformer when it is fitted on a DataFrame:

# After fitting:
print(ct.feature_names_in_)
# ['age' 'income' 'score' 'balance']

print(ct.transformers_)
# [('std', StandardScaler(), ['age', 'income']),
#  ('mm',  MinMaxScaler(),   ['score', 'balance'])]

The mapping name index is computed once and reused for every transformer entry in transformers_.

Inference#

At inference time the ONNX model expects one 1-D array per column, matching the graph inputs created during conversion:

import onnxruntime

sess = onnxruntime.InferenceSession(onx.proto.SerializeToString())

# Pass a dict with one key per column:
feed = {col: df_test[col].to_numpy() for col in df.columns}
labels, probas = sess.run(None, feed)

Full working example#

See DataFrame input to a Pipeline with ColumnTransformer for a complete runnable example that trains the pipeline, converts it to ONNX, runs inference with onnxruntime, and verifies that the outputs match scikit-learn.