DataFrame Function Tracer#

Overview#

In addition to accepting SQL strings and polars.LazyFrame objects, yobx.sql provides a lightweight pandas-inspired API for tracing Python functions that operate on a virtual DataFrame.

The tracer works by passing a TracedDataFrame proxy to the user function. Every operation performed on the proxy — column access, arithmetic, filtering, aggregation — is recorded as an AST node rather than being executed. The resulting AST is assembled into a ParsedQuery which is then compiled to ONNX by the existing SQL converter.

Architecture#

Python function (TracedDataFrame)
    │
    ▼
trace_dataframe()   ─── ParsedQuery ──► operations list
                                            │
                                            ▼
parsed_query_to_onnx() ─ GraphBuilder ──► ExportArtifact

Key classes and functions#

TracedDataFrame — proxy DataFrame with .filter(), .select(), .assign(), .groupby(), .join(), and .pivot_table() operations.
TracedSeries — proxy for a column or expression; supports arithmetic (+, -, *, /), comparisons (>, <, >=, <=, ==, !=), aggregations (.sum(), .mean(), .min(), .max(), .count()) and .alias().
TracedCondition — boolean predicate proxy; supports & (AND) and | (OR) combination.
trace_dataframe() — traces a function and returns a ParsedQuery.
dataframe_to_onnx() — end-to-end converter: function → ExportArtifact.
parsed_query_to_onnx() — convert an already-built ParsedQuery to ONNX (bypasses SQL string parsing).

Tracing a function#

The following example traces a function and prints the list of captured operations before compiling to ONNX:

<<<

import numpy as np
from yobx.sql import trace_dataframe


def transform(df):
    df = df.filter(df["a"] > 0)
    return df.select([(df["a"] + df["b"]).alias("total")])


pq = trace_dataframe(transform, {"a": np.float32, "b": np.float32})
for op in pq.operations:
    print(type(op).__name__, "—", op)

>>>

    FilterOp — FilterOp(condition=Condition(left=ColumnRef(column='a', table=None, dtype=1), op='>', right=Literal(value=0)))
    SelectOp — SelectOp(items=[SelectItem(expr=BinaryExpr(left=ColumnRef(column='a', table=None, dtype=1), op='+', right=ColumnRef(column='b', table=None, dtype=1)), alias='total')], distinct=False)

End-to-end conversion#

The dataframe_to_onnx() function combines tracing and ONNX emission in a single call:

<<<

import numpy as np
from yobx.sql import dataframe_to_onnx
from yobx.reference import ExtendedReferenceEvaluator


def transform(df):
    df = df.filter(df["a"] > 0)
    return df.select([(df["a"] + df["b"]).alias("total")])


dtypes = {"a": np.float32, "b": np.float32}
artifact = dataframe_to_onnx(transform, dtypes)

ref = ExtendedReferenceEvaluator(artifact)
a = np.array([1.0, -2.0, 3.0], dtype=np.float32)
b = np.array([4.0, 5.0, 6.0], dtype=np.float32)
(total,) = ref.run(None, {"a": a, "b": b})
print(total)  # → [5.  9.]   (rows where a > 0)

>>>

    [5. 9.]

Supported operations#

The following pandas-inspired operations can be traced:

Operation	TracedDataFrame / TracedSeries API	ONNX nodes emitted
Row filter	`df.filter(condition)`	`Compress`, `Equal`, `Less`, `Greater`, …
Column selection	`df.select([series, …])`	`Identity`, `Add`, `Sub`, `Mul`, `Div`
Column addition	`df.assign(name=series)`	`Add`, `Sub`, `Mul`, `Div`
Aggregation	`.sum()`, `.mean()`, `.min()`, `.max()`, `.count()`	`ReduceSum`, `ReduceMean`, `ReduceMin`, `ReduceMax`
Group by	`df.groupby(cols)`	`Unique`, `ScatterElements` (per-group aggregations)
Boolean AND / OR	`cond1 & cond2`, `cond1 \| cond2`	`And`, `Or`
Equi-join	`df.join(right, left_key, right_key, join_type)`	`Unsqueeze`, `Equal`, `ArgMax`, `Compress`, `Gather`
Pivot table	`df.pivot_table(values, index, columns, aggfunc, column_values=…)`	`Unique`, `ScatterElements`, `Gather`

Limitations#

Tracing captures only the operations performed during the single forward pass through the function. Conditional branches (if/else) are not supported.
GROUP BY on multiple columns casts the key columns to float64 before combining them, which causes precision loss for integer keys greater than 2**53.
pivot_table requires an explicit column_values list: ONNX static graphs need the output column count known at conversion time.
The TracedDataFrame API is a subset of the full pandas/polars API; operations outside this subset will raise NotImplementedError.