DataFrame Tracing#

yobx can export a Python function that operates on a TracedDataFrame to ONNX via dataframe_to_onnx(). Instead of running the function on real data, the framework passes lightweight proxy objects that record each DataFrame operation — filtering, column arithmetic, grouping, joining — as an AST node. Once recording is complete the accumulated AST is compiled to ONNX by the same SQL-to-ONNX backend that powers the plain-SQL converter.

Overview#

The mechanism is built on three proxy classes and two driver functions:

TracedDataFrame — the main proxy. It exposes a pandas-inspired API (filter, select, assign, groupby, join, pivot_table, …). Calling any of these methods does not execute the operation; it appends the corresponding AST node and returns a new TracedDataFrame.
TracedSeries — represents a single column expression (a ColumnRef or a derived BinaryExpr). Arithmetic operators (+, -, *, /) and comparisons (>, <, ==, …) return new TracedSeries or TracedCondition objects. The aggregation methods (.sum(), .mean(), .min(), .max(), .count()) wrap the expression in an AggExpr.
TracedCondition — a thin wrapper around a Condition AST node. Boolean operators & (AND) and | (OR) compose conditions.

Driver functions

trace_dataframe() — low-level driver. Builds the proxy frame(s), calls func, and returns the recorded ParsedQuery (or a list of them when the function returns multiple frames).
dataframe_to_onnx() — high-level entry point. Calls trace_dataframe() and then parsed_query_to_onnx() to produce a self-contained ExportArtifact.

High-level entry point#

dataframe_to_onnx() is the recommended way to convert a DataFrame function to ONNX:

def dataframe_to_onnx(
    func: Callable,
    input_dtypes: Union[Dict[str, dtype], List[Dict[str, dtype]]],
    target_opset: int = DEFAULT_TARGET_OPSET,
    custom_functions: Optional[Dict[str, Callable]] = None,
    builder_cls: type = GraphBuilder,
    filename: Optional[str] = None,
    verbose: int = 0,
) -> ExportArtifact: ...

Parameter	Description
`func`	Python callable that accepts one or more `TracedDataFrame` objects and returns a `TracedDataFrame` or a tuple/list of them.
`input_dtypes`	`{column: dtype}` dict (single input frame) or a list of such dicts (one per input frame). A `DataFrame` can be passed directly; its column names and dtypes are extracted automatically.
`target_opset`	ONNX opset version to target.
`custom_functions`	Optional dict mapping SQL function names to Python callables for use inside `select`-level expressions.
`builder_cls`	`GraphBuilder` subclass to use.
`filename`	Optional path to write the model file.
`verbose`	Verbosity level (0 = silent).

Supported operations#

The table below summarises the TracedDataFrame operations and their SQL/ONNX equivalents.

Operation	SQL equivalent
`df["col"]` / `df.col`	Column reference
`df.filter(cond)` / `df[cond]`	`WHERE`
`df.select([…])` / `df[[…]]`	`SELECT`
`df.assign(new_col=expr)`	`SELECT …, expr AS new_col`
`df.groupby("key").agg({…})`	`GROUP BY … + aggregation`
`df.join(right, left_key, right_key)`	`JOIN`
`df.pivot_table(values, index, columns)`	`PIVOT`
`df.pipe(func)` / `df.pipe(func, *args)`	Function composition
`df.copy()`	Copy (no-op node)
`series + / - / * / series_or_scalar`	Arithmetic expression
`series > / < / >= / <= / == / != value`	Comparison condition
`cond1 & cond2` / `cond1 \| cond2`	`AND` / `OR`
`series.sum()` / `.mean()` / `.min()`	Aggregation functions
`.max()` / `.count()`	Aggregation functions
`series.alias("name")`	Column alias

Walkthrough examples#

Filter and select#

<<<

import numpy as np
from yobx.sql import dataframe_to_onnx
from yobx.helpers.onnx_helper import pretty_onnx


def transform(df):
    filtered = df.filter(df["a"] > 0)
    return filtered.select([(df["a"] + df["b"]).alias("total")])


dtypes = {"a": np.float32, "b": np.float32}
artifact = dataframe_to_onnx(transform, dtypes)
print(pretty_onnx(artifact.proto))

>>>

    opset: domain='' version=21
    input: name='a' type=dtype('float32') shape=['N']
    input: name='b' type=dtype('float32') shape=['N']
    init: name='filter_mask_r_lit' type=int64 shape=(1,) -- array([0])
    CastLike(filter_mask_r_lit, a) -> _onx_castlike_filter_mask_r_lit
      Greater(a, _onx_castlike_filter_mask_r_lit) -> _onx_greater_a
        Compress(a, _onx_greater_a, axis=0) -> _onx_compress_a
    Compress(b, _onx_greater_a, axis=0) -> _onx_compress_b
      Add(_onx_compress_a, _onx_compress_b) -> total
    output: name='total' type='NOTENSOR' shape=None

Group-by aggregation#

The group-by key column must be listed explicitly in the agg output list alongside the aggregated values:

<<<

import numpy as np
from yobx.sql import dataframe_to_onnx
from yobx.helpers.onnx_helper import pretty_onnx


def transform(df):
    return df.groupby("key").agg(
        [df["key"].alias("key"), df["val"].sum().alias("total")]
    )


dtypes = {"key": np.int64, "val": np.float32}
artifact = dataframe_to_onnx(transform, dtypes)
print(pretty_onnx(artifact.proto))

>>>

    opset: domain='' version=21
    input: name='key' type=dtype('int64') shape=['N']
    input: name='val' type=dtype('float32') shape=['N']
    Unique(key, sorted=1) -> output_0, _onx_unique_key_1, _onx_unique_key_2, _onx_unique_key_3
      Shape(output_0, end=1, start=0) -> GatherShapePattern__onx_gather_unique_key_0::Shape:_1d
        ConstantOfShape(GatherShapePattern__onx_gather_unique_key_0::Shape:_1d) -> _onx_constantofshape_gather_unique_key_0::Shape:::UnSq0
          CastLike(_onx_constantofshape_gather_unique_key_0::Shape:::UnSq0, val) -> _onx_castlike_constantofshape_gather_unique_key_0::Shape:::UnSq0
      ScatterElements(_onx_castlike_constantofshape_gather_unique_key_0::Shape:::UnSq0, _onx_unique_key_2, val, axis=0, reduction=b'add') -> total
    output: name='output_0' type='NOTENSOR' shape=None
    output: name='total' type='NOTENSOR' shape=None

Join two frames#

<<<

import numpy as np
from yobx.sql import dataframe_to_onnx
from yobx.helpers.onnx_helper import pretty_onnx


def transform(df1, df2):
    joined = df1.join(df2, left_key="id", right_key="id")
    return joined.select([(df1["a"] + df2["b"]).alias("sum_ab")])


dtypes1 = {"id": np.int64, "a": np.float32}
dtypes2 = {"id": np.int64, "b": np.float32}
artifact = dataframe_to_onnx(transform, [dtypes1, dtypes2])
print(pretty_onnx(artifact.proto))

>>>

    opset: domain='' version=21
    input: name='id' type=dtype('int64') shape=['N']
    input: name='a' type=dtype('float32') shape=['N']
    input: name='id_right' type=dtype('int64') shape=['N']
    input: name='b' type=dtype('float32') shape=['N']
    init: name='init7_s1_1' type=int64 shape=(1,) -- array([1])           -- Opset.make_node.1/Shape##Opset.make_node.1/Shape##ReduceArgTopKPattern.K##ReduceArgTopKPattern.K
    init: name='init7_s1_0' type=int64 shape=(1,) -- array([0])           -- Opset.make_node.1/Shape
    Unsqueeze(id, init7_s1_1) -> id::UnSq1
    Unsqueeze(id_right, init7_s1_0) -> id_right::UnSq0
      Equal(id::UnSq1, id_right::UnSq0) -> _onx_equal_id::UnSq1
        Cast(_onx_equal_id::UnSq1, to=6) -> _onx_equal_id::UnSq1::C6
          TopK(_onx_equal_id::UnSq1::C6, init7_s1_1, axis=1, largest=1) -> ReduceArgTopKPattern__onx_reducemax_equal_id::UnSq1::C6, ReduceArgTopKPattern__onx_argmax_equal_id::UnSq1::C6
            Squeeze(ReduceArgTopKPattern__onx_reducemax_equal_id::UnSq1::C6, init7_s1_1) -> _onx_reducemax_equal_id::UnSq1::C6
              Cast(_onx_reducemax_equal_id::UnSq1::C6, to=9) -> _onx_reducemax_equal_id::UnSq1::C6::C9
                Compress(a, _onx_reducemax_equal_id::UnSq1::C6::C9, axis=0) -> _onx_compress_a
            Squeeze(ReduceArgTopKPattern__onx_argmax_equal_id::UnSq1::C6, init7_s1_1) -> _onx_argmax_equal_id::UnSq1::C6
              Compress(_onx_argmax_equal_id::UnSq1::C6, _onx_reducemax_equal_id::UnSq1::C6::C9, axis=0) -> _onx_compress_argmax_equal_id::UnSq1::C6
                Gather(b, _onx_compress_argmax_equal_id::UnSq1::C6, axis=0) -> _onx_gather_b
                  Add(_onx_compress_a, _onx_gather_b) -> sum_ab
    output: name='sum_ab' type='NOTENSOR' shape=None

Multiple output frames#

When func returns a tuple or list of TracedDataFrame objects, all outputs are gathered into a single ONNX graph with multiple output tensors:

<<<

import numpy as np
from yobx.sql import dataframe_to_onnx
from yobx.helpers.onnx_helper import pretty_onnx


def transform(df):
    out1 = df.select([(df["a"] + df["b"]).alias("sum_ab")])
    out2 = df.select([(df["a"] - df["b"]).alias("diff_ab")])
    return out1, out2


dtypes = {"a": np.float32, "b": np.float32}
artifact = dataframe_to_onnx(transform, dtypes)
print(pretty_onnx(artifact.proto))

>>>

    opset: domain='' version=21
    input: name='a' type=dtype('float32') shape=['N']
    input: name='b' type=dtype('float32') shape=['N']
    Add(a, b) -> sum_ab
    Sub(a, b) -> diff_ab
    output: name='sum_ab' type='NOTENSOR' shape=None
    output: name='diff_ab' type='NOTENSOR' shape=None

Assign new columns#

assign() adds computed columns while keeping all existing ones, similar to pandas.DataFrame.assign:

<<<

import numpy as np
from yobx.sql import dataframe_to_onnx
from yobx.helpers.onnx_helper import pretty_onnx


def transform(df):
    df = df.assign(scaled=(df["a"] * 2.0).alias("scaled"))
    return df.select(["a", "b", "scaled"])


dtypes = {"a": np.float32, "b": np.float32}
artifact = dataframe_to_onnx(transform, dtypes)
print(pretty_onnx(artifact.proto))

>>>

    opset: domain='' version=21
    input: name='a' type=dtype('float32') shape=['N']
    input: name='b' type=dtype('float32') shape=['N']
    init: name='select_scaled_r_lit' type=float32 shape=(1,) -- array([2.], dtype=float32)
    CastLike(select_scaled_r_lit, a) -> _onx_castlike_select_scaled_r_lit
      Mul(a, _onx_castlike_select_scaled_r_lit) -> scaled
    Identity(a) -> output_0
    Identity(b) -> output_1
    output: name='output_0' type='NOTENSOR' shape=None
    output: name='output_1' type='NOTENSOR' shape=None
    output: name='scaled' type='NOTENSOR' shape=None

Low-level API#

trace_dataframe() is useful when you want to inspect the recorded AST before compiling to ONNX:

<<<

import numpy as np
from yobx.xtracing.dataframe_trace import trace_dataframe


def transform(df):
    df = df.filter(df["a"] > 0)
    return df.select([(df["a"] + df["b"]).alias("total")])


pq = trace_dataframe(transform, {"a": np.float32, "b": np.float32})
for op in pq.operations:
    print(type(op).__name__, "—", op)

>>>

    FilterOp — FilterOp(condition=Condition(left=ColumnRef(column='a', table=None, dtype=1), op='>', right=Literal(value=0)))
    SelectOp — SelectOp(items=[SelectItem(expr=BinaryExpr(left=ColumnRef(column='a', table=None, dtype=1), op='+', right=ColumnRef(column='b', table=None, dtype=1)), alias='total')], distinct=False)

The returned ParsedQuery can then be passed to parsed_query_to_onnx() to build the ONNX graph.

Relation to the SQL converter#

DataFrame tracing is a front-end for the same SQL-to-ONNX back-end used by plain SQL strings and Polars LazyFrame inputs. The TracedDataFrame API records operations as the same ParsedQuery AST nodes that the SQL parser produces, so the two paths share identical ONNX code generation.

The unified entry point yobx.sql.to_onnx() accepts all three flavours and delegates automatically:

a string → SQL parser → sql_to_onnx()
a Polars LazyFrame → lazyframe_to_onnx()
a callable → dataframe_to_onnx()