DataFrame Input to a Pipeline with ColumnTransformer#
This page explains how yobx.sklearn.to_onnx() handles a
pandas.DataFrame as the dummy input when the model being converted
contains a ColumnTransformer.
Overview#
A common scikit-learn pattern is to build a
Pipeline whose first step is a
ColumnTransformer. The transformer selects
subsets of columns by name, applies a different preprocessing step to
each subset, and concatenates the results. For example:
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler, StandardScaler
df = pd.DataFrame(
np.random.randn(100, 4).astype(np.float32),
columns=["age", "income", "score", "balance"],
)
y = (df["age"] > 0).astype(int).to_numpy()
ct = ColumnTransformer([
("std", StandardScaler(), ["age", "income"]),
("mm", MinMaxScaler(), ["score", "balance"]),
])
pipe = Pipeline([("preprocessor", ct), ("clf", LogisticRegression())])
pipe.fit(df, y)
To convert this pipeline to ONNX, pass the fitted DataFrame
directly as the dummy input:
from yobx.sklearn import to_onnx
onx = to_onnx(pipe, (df,))
Per-column ONNX inputs#
When a DataFrame is detected, yobx.sklearn.to_onnx()
expands it column-by-column: each column is registered as a separate 1-D
ONNX graph input named after the column. An Unsqueeze + Concat
node sequence assembles the per-column tensors back into the 2-D matrix
(batch, n_cols) expected by the converter.
"age" ──Unsqueeze──┐
"income" ──Unsqueeze──┤ Concat ──► X (batch, 4) ──► pipeline ...
"score" ──Unsqueeze──┤
"balance" ──Unsqueeze──┘
This produces an ONNX model with four separate inputs instead of a single
X matrix, which matches the interface of a serving endpoint that
receives one scalar value per feature.
String column selectors#
scikit-learn’s ColumnTransformer stores the
original column specification in transformers_ after fitting. When
columns are specified by name (strings), those names are preserved in
transformers_; only the overall number of features is stored in
n_features_in_.
The ONNX converter resolves string column names to integer positions using
feature_names_in_, which scikit-learn automatically sets on the
ColumnTransformer when it is fitted on a
DataFrame:
# After fitting:
print(ct.feature_names_in_)
# ['age' 'income' 'score' 'balance']
print(ct.transformers_)
# [('std', StandardScaler(), ['age', 'income']),
# ('mm', MinMaxScaler(), ['score', 'balance'])]
The mapping name → index is computed once and reused for every
transformer entry in transformers_.
Inference#
At inference time the ONNX model expects one 1-D array per column, matching the graph inputs created during conversion:
import onnxruntime
sess = onnxruntime.InferenceSession(onx.proto.SerializeToString())
# Pass a dict with one key per column:
feed = {col: df_test[col].to_numpy() for col in df.columns}
labels, probas = sess.run(None, feed)
Full working example#
See DataFrame input to a Pipeline with ColumnTransformer for a complete runnable example that trains the pipeline, converts it to ONNX, runs inference with onnxruntime, and verifies that the outputs match scikit-learn.