DataFrame Function Tracer#

Overview#

In addition to accepting SQL strings and polars.LazyFrame objects, yobx.sql provides a lightweight pandas-inspired API for tracing Python functions that operate on a virtual DataFrame.

The tracer works by passing a TracedDataFrame proxy to the user function. Every operation performed on the proxy — column access, arithmetic, filtering, aggregation — is recorded as an AST node rather than being executed. The resulting AST is assembled into a ParsedQuery which is then compiled to ONNX by the existing SQL converter.

Architecture#

Python function (TracedDataFrame)
    │
    ▼
trace_dataframe()   ─── ParsedQuery ──► operations list
                                            │
                                            ▼
parsed_query_to_onnx() ─ GraphBuilder ──► ExportArtifact

Key classes and functions#

  • TracedDataFrame — proxy DataFrame with .filter(), .select(), .assign(), .groupby(), .join(), and .pivot_table() operations.

  • TracedSeries — proxy for a column or expression; supports arithmetic (+, -, *, /), comparisons (>, <, >=, <=, ==, !=), aggregations (.sum(), .mean(), .min(), .max(), .count()) and .alias().

  • TracedCondition — boolean predicate proxy; supports & (AND) and | (OR) combination.

  • trace_dataframe() — traces a function and returns a ParsedQuery.

  • dataframe_to_onnx() — end-to-end converter: function → ExportArtifact.

  • parsed_query_to_onnx() — convert an already-built ParsedQuery to ONNX (bypasses SQL string parsing).

Tracing a function#

The following example traces a function and prints the list of captured operations before compiling to ONNX:

<<<

import numpy as np
from yobx.sql import trace_dataframe


def transform(df):
    df = df.filter(df["a"] > 0)
    return df.select([(df["a"] + df["b"]).alias("total")])


pq = trace_dataframe(transform, {"a": np.float32, "b": np.float32})
for op in pq.operations:
    print(type(op).__name__, "—", op)

>>>

    FilterOp — FilterOp(condition=Condition(left=ColumnRef(column='a', table=None, dtype=1), op='>', right=Literal(value=0)))
    SelectOp — SelectOp(items=[SelectItem(expr=BinaryExpr(left=ColumnRef(column='a', table=None, dtype=1), op='+', right=ColumnRef(column='b', table=None, dtype=1)), alias='total')], distinct=False)

End-to-end conversion#

The dataframe_to_onnx() function combines tracing and ONNX emission in a single call:

<<<

import numpy as np
from yobx.sql import dataframe_to_onnx
from yobx.reference import ExtendedReferenceEvaluator


def transform(df):
    df = df.filter(df["a"] > 0)
    return df.select([(df["a"] + df["b"]).alias("total")])


dtypes = {"a": np.float32, "b": np.float32}
artifact = dataframe_to_onnx(transform, dtypes)

ref = ExtendedReferenceEvaluator(artifact)
a = np.array([1.0, -2.0, 3.0], dtype=np.float32)
b = np.array([4.0, 5.0, 6.0], dtype=np.float32)
(total,) = ref.run(None, {"a": a, "b": b})
print(total)  # → [5.  9.]   (rows where a > 0)

>>>

    [5. 9.]

Supported operations#

The following pandas-inspired operations can be traced:

Operation

TracedDataFrame / TracedSeries API

ONNX nodes emitted

Row filter

df.filter(condition)

Compress, Equal, Less, Greater, …

Column selection

df.select([series, …])

Identity, Add, Sub, Mul, Div

Column addition

df.assign(name=series)

Add, Sub, Mul, Div

Aggregation

.sum(), .mean(), .min(), .max(), .count()

ReduceSum, ReduceMean, ReduceMin, ReduceMax

Group by

df.groupby(cols)

Unique, ScatterElements (per-group aggregations)

Boolean AND / OR

cond1 & cond2, cond1 | cond2

And, Or

Equi-join

df.join(right, left_key, right_key, join_type)

Unsqueeze, Equal, ArgMax, Compress, Gather

Pivot table

df.pivot_table(values, index, columns, aggfunc, column_values=…)

Unique, ScatterElements, Gather

Limitations#

  • Tracing captures only the operations performed during the single forward pass through the function. Conditional branches (if/else) are not supported.

  • GROUP BY on multiple columns casts the key columns to float64 before combining them, which causes precision loss for integer keys greater than 2**53.

  • pivot_table requires an explicit column_values list: ONNX static graphs need the output column count known at conversion time.

  • The TracedDataFrame API is a subset of the full pandas/polars API; operations outside this subset will raise NotImplementedError.