DataFrame Tracing#
yobx can export a Python function that operates on a
TracedDataFrame to ONNX via
dataframe_to_onnx(). Instead of running the function on real
data, the framework passes lightweight proxy objects that record each
DataFrame operation — filtering, column arithmetic, grouping, joining — as an
AST node. Once recording is complete the accumulated AST is compiled to ONNX
by the same SQL-to-ONNX backend that powers the plain-SQL converter.
Overview#
The mechanism is built on three proxy classes and two driver functions:
TracedDataFrame— the main proxy. It exposes a pandas-inspired API (filter,select,assign,groupby,join,pivot_table, …). Calling any of these methods does not execute the operation; it appends the corresponding AST node and returns a newTracedDataFrame.TracedSeries— represents a single column expression (aColumnRefor a derivedBinaryExpr). Arithmetic operators (+,-,*,/) and comparisons (>,<,==, …) return newTracedSeriesorTracedConditionobjects. The aggregation methods (.sum(),.mean(),.min(),.max(),.count()) wrap the expression in anAggExpr.TracedCondition— a thin wrapper around aConditionAST node. Boolean operators&(AND) and|(OR) compose conditions.
Driver functions
trace_dataframe()— low-level driver. Builds the proxy frame(s), calls func, and returns the recordedParsedQuery(or a list of them when the function returns multiple frames).dataframe_to_onnx()— high-level entry point. Callstrace_dataframe()and thenparsed_query_to_onnx()to produce a self-containedExportArtifact.
High-level entry point#
dataframe_to_onnx() is the recommended way to convert a
DataFrame function to ONNX:
def dataframe_to_onnx(
func: Callable,
input_dtypes: Union[Dict[str, dtype], List[Dict[str, dtype]]],
target_opset: int = DEFAULT_TARGET_OPSET,
custom_functions: Optional[Dict[str, Callable]] = None,
builder_cls: type = GraphBuilder,
filename: Optional[str] = None,
verbose: int = 0,
) -> ExportArtifact: ...
Parameter |
Description |
|---|---|
|
Python callable that accepts one or more
|
|
|
|
ONNX opset version to target. |
|
Optional dict mapping SQL function names to Python
callables for use inside |
|
|
|
Optional path to write the model file. |
|
Verbosity level (0 = silent). |
Supported operations#
The table below summarises the TracedDataFrame
operations and their SQL/ONNX equivalents.
Operation |
SQL equivalent |
|---|---|
|
Column reference |
|
|
|
|
|
|
|
|
|
|
|
|
|
Function composition |
|
Copy (no-op node) |
|
Arithmetic expression |
|
Comparison condition |
|
|
|
Aggregation functions |
|
Aggregation functions |
|
Column alias |
Walkthrough examples#
Filter and select#
<<<
import numpy as np
from yobx.sql import dataframe_to_onnx
from yobx.helpers.onnx_helper import pretty_onnx
def transform(df):
filtered = df.filter(df["a"] > 0)
return filtered.select([(df["a"] + df["b"]).alias("total")])
dtypes = {"a": np.float32, "b": np.float32}
artifact = dataframe_to_onnx(transform, dtypes)
print(pretty_onnx(artifact.proto))
>>>
opset: domain='' version=21
input: name='a' type=dtype('float32') shape=['N']
input: name='b' type=dtype('float32') shape=['N']
init: name='filter_mask_r_lit' type=int64 shape=(1,) -- array([0])
CastLike(filter_mask_r_lit, a) -> _onx_castlike_filter_mask_r_lit
Greater(a, _onx_castlike_filter_mask_r_lit) -> _onx_greater_a
Compress(a, _onx_greater_a, axis=0) -> _onx_compress_a
Compress(b, _onx_greater_a, axis=0) -> _onx_compress_b
Add(_onx_compress_a, _onx_compress_b) -> total
output: name='total' type='NOTENSOR' shape=None
Group-by aggregation#
The group-by key column must be listed explicitly in the agg output list
alongside the aggregated values:
<<<
import numpy as np
from yobx.sql import dataframe_to_onnx
from yobx.helpers.onnx_helper import pretty_onnx
def transform(df):
return df.groupby("key").agg(
[df["key"].alias("key"), df["val"].sum().alias("total")]
)
dtypes = {"key": np.int64, "val": np.float32}
artifact = dataframe_to_onnx(transform, dtypes)
print(pretty_onnx(artifact.proto))
>>>
opset: domain='' version=21
input: name='key' type=dtype('int64') shape=['N']
input: name='val' type=dtype('float32') shape=['N']
init: name='init7_s_0' type=int64 shape=() -- array([0]) -- Opset.make_node.1/Shape
init: name='init7_s1_0' type=int64 shape=(1,) -- array([0]) -- Opset.make_node.1/Shape
Unique(key, sorted=1) -> output_0, _onx_unique_key_1, _onx_unique_key_2, _onx_unique_key_3
Shape(output_0) -> _onx_unique_key_0::Shape:
Gather(_onx_unique_key_0::Shape:, init7_s_0, axis=0) -> _onx_gather_unique_key_0::Shape:
Unsqueeze(_onx_gather_unique_key_0::Shape:, init7_s1_0) -> _onx_gather_unique_key_0::Shape:::UnSq0
ConstantOfShape(_onx_gather_unique_key_0::Shape:::UnSq0) -> _onx_constantofshape_gather_unique_key_0::Shape:::UnSq0
CastLike(_onx_constantofshape_gather_unique_key_0::Shape:::UnSq0, val) -> _onx_castlike_constantofshape_gather_unique_key_0::Shape:::UnSq0
ScatterElements(_onx_castlike_constantofshape_gather_unique_key_0::Shape:::UnSq0, _onx_unique_key_2, val, axis=0, reduction=b'add') -> total
output: name='output_0' type='NOTENSOR' shape=None
output: name='total' type='NOTENSOR' shape=None
Join two frames#
<<<
import numpy as np
from yobx.sql import dataframe_to_onnx
from yobx.helpers.onnx_helper import pretty_onnx
def transform(df1, df2):
joined = df1.join(df2, left_key="id", right_key="id")
return joined.select([(df1["a"] + df2["b"]).alias("sum_ab")])
dtypes1 = {"id": np.int64, "a": np.float32}
dtypes2 = {"id": np.int64, "b": np.float32}
artifact = dataframe_to_onnx(transform, [dtypes1, dtypes2])
print(pretty_onnx(artifact.proto))
>>>
opset: domain='' version=21
input: name='id' type=dtype('int64') shape=['N']
input: name='a' type=dtype('float32') shape=['N']
input: name='id_right' type=dtype('int64') shape=['N']
input: name='b' type=dtype('float32') shape=['N']
init: name='init7_s1_1' type=int64 shape=(1,) -- array([1]) -- Opset.make_node.1/Shape##Opset.make_node.1/Shape##ReduceArgTopKPattern.K##ReduceArgTopKPattern.K
init: name='init7_s1_0' type=int64 shape=(1,) -- array([0]) -- Opset.make_node.1/Shape
Unsqueeze(id, init7_s1_1) -> id::UnSq1
Unsqueeze(id_right, init7_s1_0) -> id_right::UnSq0
Equal(id::UnSq1, id_right::UnSq0) -> _onx_equal_id::UnSq1
Cast(_onx_equal_id::UnSq1, to=6) -> _onx_equal_id::UnSq1::C6
TopK(_onx_equal_id::UnSq1::C6, init7_s1_1, axis=1, largest=1) -> ReduceArgTopKPattern__onx_reducemax_equal_id::UnSq1::C6, ReduceArgTopKPattern__onx_argmax_equal_id::UnSq1::C6
Squeeze(ReduceArgTopKPattern__onx_reducemax_equal_id::UnSq1::C6, init7_s1_1) -> _onx_reducemax_equal_id::UnSq1::C6
Cast(_onx_reducemax_equal_id::UnSq1::C6, to=9) -> _onx_reducemax_equal_id::UnSq1::C6::C9
Compress(a, _onx_reducemax_equal_id::UnSq1::C6::C9, axis=0) -> _onx_compress_a
Squeeze(ReduceArgTopKPattern__onx_argmax_equal_id::UnSq1::C6, init7_s1_1) -> _onx_argmax_equal_id::UnSq1::C6
Compress(_onx_argmax_equal_id::UnSq1::C6, _onx_reducemax_equal_id::UnSq1::C6::C9, axis=0) -> _onx_compress_argmax_equal_id::UnSq1::C6
Gather(b, _onx_compress_argmax_equal_id::UnSq1::C6, axis=0) -> _onx_gather_b
Add(_onx_compress_a, _onx_gather_b) -> sum_ab
output: name='sum_ab' type='NOTENSOR' shape=None
Multiple output frames#
When func returns a tuple or list of
TracedDataFrame objects, all outputs are gathered
into a single ONNX graph with multiple output tensors:
<<<
import numpy as np
from yobx.sql import dataframe_to_onnx
from yobx.helpers.onnx_helper import pretty_onnx
def transform(df):
out1 = df.select([(df["a"] + df["b"]).alias("sum_ab")])
out2 = df.select([(df["a"] - df["b"]).alias("diff_ab")])
return out1, out2
dtypes = {"a": np.float32, "b": np.float32}
artifact = dataframe_to_onnx(transform, dtypes)
print(pretty_onnx(artifact.proto))
>>>
opset: domain='' version=21
input: name='a' type=dtype('float32') shape=['N']
input: name='b' type=dtype('float32') shape=['N']
Add(a, b) -> sum_ab
Sub(a, b) -> diff_ab
output: name='sum_ab' type='NOTENSOR' shape=None
output: name='diff_ab' type='NOTENSOR' shape=None
Assign new columns#
assign() adds computed columns while
keeping all existing ones, similar to pandas.DataFrame.assign:
<<<
import numpy as np
from yobx.sql import dataframe_to_onnx
from yobx.helpers.onnx_helper import pretty_onnx
def transform(df):
df = df.assign(scaled=(df["a"] * 2.0).alias("scaled"))
return df.select(["a", "b", "scaled"])
dtypes = {"a": np.float32, "b": np.float32}
artifact = dataframe_to_onnx(transform, dtypes)
print(pretty_onnx(artifact.proto))
>>>
opset: domain='' version=21
input: name='a' type=dtype('float32') shape=['N']
input: name='b' type=dtype('float32') shape=['N']
init: name='select_scaled_r_lit' type=float32 shape=(1,) -- array([2.], dtype=float32)
CastLike(select_scaled_r_lit, a) -> _onx_castlike_select_scaled_r_lit
Mul(a, _onx_castlike_select_scaled_r_lit) -> scaled
Identity(a) -> output_0
Identity(b) -> output_1
output: name='output_0' type='NOTENSOR' shape=None
output: name='output_1' type='NOTENSOR' shape=None
output: name='scaled' type='NOTENSOR' shape=None
Low-level API#
trace_dataframe() is useful when you
want to inspect the recorded AST before compiling to ONNX:
<<<
import numpy as np
from yobx.xtracing.dataframe_trace import trace_dataframe
def transform(df):
df = df.filter(df["a"] > 0)
return df.select([(df["a"] + df["b"]).alias("total")])
pq = trace_dataframe(transform, {"a": np.float32, "b": np.float32})
for op in pq.operations:
print(type(op).__name__, "—", op)
>>>
FilterOp — FilterOp(condition=Condition(left=ColumnRef(column='a', table=None, dtype=1), op='>', right=Literal(value=0)))
SelectOp — SelectOp(items=[SelectItem(expr=BinaryExpr(left=ColumnRef(column='a', table=None, dtype=1), op='+', right=ColumnRef(column='b', table=None, dtype=1)), alias='total')], distinct=False)
The returned ParsedQuery can then be passed to
parsed_query_to_onnx() to build the ONNX graph.
Relation to the SQL converter#
DataFrame tracing is a front-end for the same SQL-to-ONNX back-end used by
plain SQL strings and Polars LazyFrame inputs. The
TracedDataFrame API records operations as the same
ParsedQuery AST nodes that the SQL parser produces,
so the two paths share identical ONNX code generation.
The unified entry point yobx.sql.to_onnx() accepts all three flavours
and delegates automatically:
a string → SQL parser →
sql_to_onnx()a Polars LazyFrame →
lazyframe_to_onnx()a callable →
dataframe_to_onnx()
See also
Numpy-Tracing and FunctionTransformer — numpy tracing, used to
convert FunctionTransformer to ONNX.