DataFrame Function Tracer#
Overview#
In addition to accepting SQL strings and polars.LazyFrame objects,
yobx.sql provides a lightweight pandas-inspired API for tracing Python
functions that operate on a virtual DataFrame.
The tracer works by passing a TracedDataFrame
proxy to the user function. Every operation performed on the proxy —
column access, arithmetic, filtering, aggregation — is recorded as an AST
node rather than being executed. The resulting AST is assembled into a
ParsedQuery which is then compiled to ONNX by the
existing SQL converter.
Architecture#
Python function (TracedDataFrame)
│
▼
trace_dataframe() ─── ParsedQuery ──► operations list
│
▼
parsed_query_to_onnx() ─ GraphBuilder ──► ExportArtifact
Key classes and functions#
TracedDataFrame— proxy DataFrame with.filter(),.select(),.assign(),.groupby(),.join(), and.pivot_table()operations.TracedSeries— proxy for a column or expression; supports arithmetic (+,-,*,/), comparisons (>,<,>=,<=,==,!=), aggregations (.sum(),.mean(),.min(),.max(),.count()) and.alias().TracedCondition— boolean predicate proxy; supports&(AND) and|(OR) combination.trace_dataframe()— traces a function and returns aParsedQuery.dataframe_to_onnx()— end-to-end converter: function →ExportArtifact.parsed_query_to_onnx()— convert an already-builtParsedQueryto ONNX (bypasses SQL string parsing).
Tracing a function#
The following example traces a function and prints the list of captured operations before compiling to ONNX:
<<<
import numpy as np
from yobx.sql import trace_dataframe
def transform(df):
df = df.filter(df["a"] > 0)
return df.select([(df["a"] + df["b"]).alias("total")])
pq = trace_dataframe(transform, {"a": np.float32, "b": np.float32})
for op in pq.operations:
print(type(op).__name__, "—", op)
>>>
FilterOp — FilterOp(condition=Condition(left=ColumnRef(column='a', table=None, dtype=1), op='>', right=Literal(value=0)))
SelectOp — SelectOp(items=[SelectItem(expr=BinaryExpr(left=ColumnRef(column='a', table=None, dtype=1), op='+', right=ColumnRef(column='b', table=None, dtype=1)), alias='total')], distinct=False)
End-to-end conversion#
The dataframe_to_onnx() function combines tracing and ONNX
emission in a single call:
<<<
import numpy as np
from yobx.sql import dataframe_to_onnx
from yobx.reference import ExtendedReferenceEvaluator
def transform(df):
df = df.filter(df["a"] > 0)
return df.select([(df["a"] + df["b"]).alias("total")])
dtypes = {"a": np.float32, "b": np.float32}
artifact = dataframe_to_onnx(transform, dtypes)
ref = ExtendedReferenceEvaluator(artifact)
a = np.array([1.0, -2.0, 3.0], dtype=np.float32)
b = np.array([4.0, 5.0, 6.0], dtype=np.float32)
(total,) = ref.run(None, {"a": a, "b": b})
print(total) # → [5. 9.] (rows where a > 0)
>>>
[5. 9.]
Supported operations#
The following pandas-inspired operations can be traced:
Operation |
TracedDataFrame / TracedSeries API |
ONNX nodes emitted |
|---|---|---|
Row filter |
|
|
Column selection |
|
|
Column addition |
|
|
Aggregation |
|
|
Group by |
|
|
Boolean AND / OR |
|
|
Equi-join |
|
|
Pivot table |
|
|
Limitations#
Tracing captures only the operations performed during the single forward pass through the function. Conditional branches (
if/else) are not supported.GROUP BYon multiple columns casts the key columns tofloat64before combining them, which causes precision loss for integer keys greater than 2**53.pivot_tablerequires an explicitcolumn_valueslist: ONNX static graphs need the output column count known at conversion time.The
TracedDataFrameAPI is a subset of the full pandas/polars API; operations outside this subset will raiseNotImplementedError.