DataFrame Tracing#

yobx can export a Python function that operates on a TracedDataFrame to ONNX via dataframe_to_onnx(). Instead of running the function on real data, the framework passes lightweight proxy objects that record each DataFrame operation — filtering, column arithmetic, grouping, joining — as an AST node. Once recording is complete the accumulated AST is compiled to ONNX by the same SQL-to-ONNX backend that powers the plain-SQL converter.

Overview#

The mechanism is built on three proxy classes and two driver functions:

  1. TracedDataFrame — the main proxy. It exposes a pandas-inspired API (filter, select, assign, groupby, join, pivot_table, …). Calling any of these methods does not execute the operation; it appends the corresponding AST node and returns a new TracedDataFrame.

  2. TracedSeries — represents a single column expression (a ColumnRef or a derived BinaryExpr). Arithmetic operators (+, -, *, /) and comparisons (>, <, ==, …) return new TracedSeries or TracedCondition objects. The aggregation methods (.sum(), .mean(), .min(), .max(), .count()) wrap the expression in an AggExpr.

  3. TracedCondition — a thin wrapper around a Condition AST node. Boolean operators & (AND) and | (OR) compose conditions.

Driver functions

  • trace_dataframe() — low-level driver. Builds the proxy frame(s), calls func, and returns the recorded ParsedQuery (or a list of them when the function returns multiple frames).

  • dataframe_to_onnx() — high-level entry point. Calls trace_dataframe() and then parsed_query_to_onnx() to produce a self-contained ExportArtifact.

High-level entry point#

dataframe_to_onnx() is the recommended way to convert a DataFrame function to ONNX:

def dataframe_to_onnx(
    func: Callable,
    input_dtypes: Union[Dict[str, dtype], List[Dict[str, dtype]]],
    target_opset: int = DEFAULT_TARGET_OPSET,
    custom_functions: Optional[Dict[str, Callable]] = None,
    builder_cls: type = GraphBuilder,
    filename: Optional[str] = None,
    verbose: int = 0,
) -> ExportArtifact: ...

Parameter

Description

func

Python callable that accepts one or more TracedDataFrame objects and returns a TracedDataFrame or a tuple/list of them.

input_dtypes

{column: dtype} dict (single input frame) or a list of such dicts (one per input frame). A DataFrame can be passed directly; its column names and dtypes are extracted automatically.

target_opset

ONNX opset version to target.

custom_functions

Optional dict mapping SQL function names to Python callables for use inside select-level expressions.

builder_cls

GraphBuilder subclass to use.

filename

Optional path to write the model file.

verbose

Verbosity level (0 = silent).

Supported operations#

The table below summarises the TracedDataFrame operations and their SQL/ONNX equivalents.

Operation

SQL equivalent

df["col"] / df.col

Column reference

df.filter(cond) / df[cond]

WHERE

df.select([…]) / df[[…]]

SELECT

df.assign(new_col=expr)

SELECT …, expr AS new_col

df.groupby("key").agg({…})

GROUP BY + aggregation

df.join(right, left_key, right_key)

JOIN

df.pivot_table(values, index, columns)

PIVOT

df.pipe(func) / df.pipe(func, *args)

Function composition

df.copy()

Copy (no-op node)

series + / - / * / series_or_scalar

Arithmetic expression

series > / < / >= / <= / == / != value

Comparison condition

cond1 & cond2 / cond1 | cond2

AND / OR

series.sum() / .mean() / .min()

Aggregation functions

.max() / .count()

Aggregation functions

series.alias("name")

Column alias

Walkthrough examples#

Filter and select#

<<<

import numpy as np
from yobx.sql import dataframe_to_onnx
from yobx.helpers.onnx_helper import pretty_onnx


def transform(df):
    filtered = df.filter(df["a"] > 0)
    return filtered.select([(df["a"] + df["b"]).alias("total")])


dtypes = {"a": np.float32, "b": np.float32}
artifact = dataframe_to_onnx(transform, dtypes)
print(pretty_onnx(artifact.proto))

>>>

    opset: domain='' version=21
    input: name='a' type=dtype('float32') shape=['N']
    input: name='b' type=dtype('float32') shape=['N']
    init: name='filter_mask_r_lit' type=int64 shape=(1,) -- array([0])
    CastLike(filter_mask_r_lit, a) -> _onx_castlike_filter_mask_r_lit
      Greater(a, _onx_castlike_filter_mask_r_lit) -> _onx_greater_a
        Compress(a, _onx_greater_a, axis=0) -> _onx_compress_a
    Compress(b, _onx_greater_a, axis=0) -> _onx_compress_b
      Add(_onx_compress_a, _onx_compress_b) -> total
    output: name='total' type='NOTENSOR' shape=None

Group-by aggregation#

The group-by key column must be listed explicitly in the agg output list alongside the aggregated values:

<<<

import numpy as np
from yobx.sql import dataframe_to_onnx
from yobx.helpers.onnx_helper import pretty_onnx


def transform(df):
    return df.groupby("key").agg(
        [df["key"].alias("key"), df["val"].sum().alias("total")]
    )


dtypes = {"key": np.int64, "val": np.float32}
artifact = dataframe_to_onnx(transform, dtypes)
print(pretty_onnx(artifact.proto))

>>>

    opset: domain='' version=21
    input: name='key' type=dtype('int64') shape=['N']
    input: name='val' type=dtype('float32') shape=['N']
    init: name='init7_s_0' type=int64 shape=() -- array([0])              -- Opset.make_node.1/Shape
    init: name='init7_s1_0' type=int64 shape=(1,) -- array([0])           -- Opset.make_node.1/Shape
    Unique(key, sorted=1) -> output_0, _onx_unique_key_1, _onx_unique_key_2, _onx_unique_key_3
      Shape(output_0) -> _onx_unique_key_0::Shape:
        Gather(_onx_unique_key_0::Shape:, init7_s_0, axis=0) -> _onx_gather_unique_key_0::Shape:
          Unsqueeze(_onx_gather_unique_key_0::Shape:, init7_s1_0) -> _onx_gather_unique_key_0::Shape:::UnSq0
            ConstantOfShape(_onx_gather_unique_key_0::Shape:::UnSq0) -> _onx_constantofshape_gather_unique_key_0::Shape:::UnSq0
              CastLike(_onx_constantofshape_gather_unique_key_0::Shape:::UnSq0, val) -> _onx_castlike_constantofshape_gather_unique_key_0::Shape:::UnSq0
      ScatterElements(_onx_castlike_constantofshape_gather_unique_key_0::Shape:::UnSq0, _onx_unique_key_2, val, axis=0, reduction=b'add') -> total
    output: name='output_0' type='NOTENSOR' shape=None
    output: name='total' type='NOTENSOR' shape=None

Join two frames#

<<<

import numpy as np
from yobx.sql import dataframe_to_onnx
from yobx.helpers.onnx_helper import pretty_onnx


def transform(df1, df2):
    joined = df1.join(df2, left_key="id", right_key="id")
    return joined.select([(df1["a"] + df2["b"]).alias("sum_ab")])


dtypes1 = {"id": np.int64, "a": np.float32}
dtypes2 = {"id": np.int64, "b": np.float32}
artifact = dataframe_to_onnx(transform, [dtypes1, dtypes2])
print(pretty_onnx(artifact.proto))

>>>

    opset: domain='' version=21
    input: name='id' type=dtype('int64') shape=['N']
    input: name='a' type=dtype('float32') shape=['N']
    input: name='id_right' type=dtype('int64') shape=['N']
    input: name='b' type=dtype('float32') shape=['N']
    init: name='init7_s1_1' type=int64 shape=(1,) -- array([1])           -- Opset.make_node.1/Shape##Opset.make_node.1/Shape##ReduceArgTopKPattern.K##ReduceArgTopKPattern.K
    init: name='init7_s1_0' type=int64 shape=(1,) -- array([0])           -- Opset.make_node.1/Shape
    Unsqueeze(id, init7_s1_1) -> id::UnSq1
    Unsqueeze(id_right, init7_s1_0) -> id_right::UnSq0
      Equal(id::UnSq1, id_right::UnSq0) -> _onx_equal_id::UnSq1
        Cast(_onx_equal_id::UnSq1, to=6) -> _onx_equal_id::UnSq1::C6
          TopK(_onx_equal_id::UnSq1::C6, init7_s1_1, axis=1, largest=1) -> ReduceArgTopKPattern__onx_reducemax_equal_id::UnSq1::C6, ReduceArgTopKPattern__onx_argmax_equal_id::UnSq1::C6
            Squeeze(ReduceArgTopKPattern__onx_reducemax_equal_id::UnSq1::C6, init7_s1_1) -> _onx_reducemax_equal_id::UnSq1::C6
              Cast(_onx_reducemax_equal_id::UnSq1::C6, to=9) -> _onx_reducemax_equal_id::UnSq1::C6::C9
                Compress(a, _onx_reducemax_equal_id::UnSq1::C6::C9, axis=0) -> _onx_compress_a
            Squeeze(ReduceArgTopKPattern__onx_argmax_equal_id::UnSq1::C6, init7_s1_1) -> _onx_argmax_equal_id::UnSq1::C6
              Compress(_onx_argmax_equal_id::UnSq1::C6, _onx_reducemax_equal_id::UnSq1::C6::C9, axis=0) -> _onx_compress_argmax_equal_id::UnSq1::C6
                Gather(b, _onx_compress_argmax_equal_id::UnSq1::C6, axis=0) -> _onx_gather_b
                  Add(_onx_compress_a, _onx_gather_b) -> sum_ab
    output: name='sum_ab' type='NOTENSOR' shape=None

Multiple output frames#

When func returns a tuple or list of TracedDataFrame objects, all outputs are gathered into a single ONNX graph with multiple output tensors:

<<<

import numpy as np
from yobx.sql import dataframe_to_onnx
from yobx.helpers.onnx_helper import pretty_onnx


def transform(df):
    out1 = df.select([(df["a"] + df["b"]).alias("sum_ab")])
    out2 = df.select([(df["a"] - df["b"]).alias("diff_ab")])
    return out1, out2


dtypes = {"a": np.float32, "b": np.float32}
artifact = dataframe_to_onnx(transform, dtypes)
print(pretty_onnx(artifact.proto))

>>>

    opset: domain='' version=21
    input: name='a' type=dtype('float32') shape=['N']
    input: name='b' type=dtype('float32') shape=['N']
    Add(a, b) -> sum_ab
    Sub(a, b) -> diff_ab
    output: name='sum_ab' type='NOTENSOR' shape=None
    output: name='diff_ab' type='NOTENSOR' shape=None

Assign new columns#

assign() adds computed columns while keeping all existing ones, similar to pandas.DataFrame.assign:

<<<

import numpy as np
from yobx.sql import dataframe_to_onnx
from yobx.helpers.onnx_helper import pretty_onnx


def transform(df):
    df = df.assign(scaled=(df["a"] * 2.0).alias("scaled"))
    return df.select(["a", "b", "scaled"])


dtypes = {"a": np.float32, "b": np.float32}
artifact = dataframe_to_onnx(transform, dtypes)
print(pretty_onnx(artifact.proto))

>>>

    opset: domain='' version=21
    input: name='a' type=dtype('float32') shape=['N']
    input: name='b' type=dtype('float32') shape=['N']
    init: name='select_scaled_r_lit' type=float32 shape=(1,) -- array([2.], dtype=float32)
    CastLike(select_scaled_r_lit, a) -> _onx_castlike_select_scaled_r_lit
      Mul(a, _onx_castlike_select_scaled_r_lit) -> scaled
    Identity(a) -> output_0
    Identity(b) -> output_1
    output: name='output_0' type='NOTENSOR' shape=None
    output: name='output_1' type='NOTENSOR' shape=None
    output: name='scaled' type='NOTENSOR' shape=None

Low-level API#

trace_dataframe() is useful when you want to inspect the recorded AST before compiling to ONNX:

<<<

import numpy as np
from yobx.xtracing.dataframe_trace import trace_dataframe


def transform(df):
    df = df.filter(df["a"] > 0)
    return df.select([(df["a"] + df["b"]).alias("total")])


pq = trace_dataframe(transform, {"a": np.float32, "b": np.float32})
for op in pq.operations:
    print(type(op).__name__, "—", op)

>>>

    FilterOp — FilterOp(condition=Condition(left=ColumnRef(column='a', table=None, dtype=1), op='>', right=Literal(value=0)))
    SelectOp — SelectOp(items=[SelectItem(expr=BinaryExpr(left=ColumnRef(column='a', table=None, dtype=1), op='+', right=ColumnRef(column='b', table=None, dtype=1)), alias='total')], distinct=False)

The returned ParsedQuery can then be passed to parsed_query_to_onnx() to build the ONNX graph.

Relation to the SQL converter#

DataFrame tracing is a front-end for the same SQL-to-ONNX back-end used by plain SQL strings and Polars LazyFrame inputs. The TracedDataFrame API records operations as the same ParsedQuery AST nodes that the SQL parser produces, so the two paths share identical ONNX code generation.

The unified entry point yobx.sql.to_onnx() accepts all three flavours and delegates automatically:

  • a string → SQL parser → sql_to_onnx()

  • a Polars LazyFramelazyframe_to_onnx()

  • a callabledataframe_to_onnx()

See also

Numpy-Tracing and FunctionTransformer — numpy tracing, used to convert FunctionTransformer to ONNX.