.. _l-design-dataframe-tracing: ================= DataFrame Tracing ================= ``yobx`` can export a Python function that operates on a :class:`~yobx.xtracing.TracedDataFrame` to ONNX via :func:`~yobx.sql.dataframe_to_onnx`. Instead of running the function on real data, the framework passes lightweight *proxy* objects that *record* each DataFrame operation — filtering, column arithmetic, grouping, joining — as an AST node. Once recording is complete the accumulated AST is compiled to ONNX by the same SQL-to-ONNX backend that powers the plain-SQL converter. Overview ======== The mechanism is built on three proxy classes and two driver functions: 1. :class:`~yobx.xtracing.TracedDataFrame` — the main proxy. It exposes a pandas-inspired API (``filter``, ``select``, ``assign``, ``groupby``, ``join``, ``pivot_table``, …). Calling any of these methods does *not* execute the operation; it appends the corresponding AST node and returns a new :class:`~yobx.xtracing.TracedDataFrame`. 2. :class:`~yobx.xtracing.dataframe_trace.TracedSeries` — represents a single column expression (a :class:`~yobx.xtracing.parse.ColumnRef` or a derived :class:`~yobx.xtracing.parse.BinaryExpr`). Arithmetic operators (``+``, ``-``, ``*``, ``/``) and comparisons (``>``, ``<``, ``==``, …) return new :class:`~yobx.xtracing.dataframe_trace.TracedSeries` or :class:`~yobx.xtracing.dataframe_trace.TracedCondition` objects. The aggregation methods (``.sum()``, ``.mean()``, ``.min()``, ``.max()``, ``.count()``) wrap the expression in an :class:`~yobx.xtracing.parse.AggExpr`. 3. :class:`~yobx.xtracing.dataframe_trace.TracedCondition` — a thin wrapper around a :class:`~yobx.xtracing.parse.Condition` AST node. Boolean operators ``&`` (AND) and ``|`` (OR) compose conditions. **Driver functions** * :func:`~yobx.xtracing.dataframe_trace.trace_dataframe` — low-level driver. Builds the proxy frame(s), calls *func*, and returns the recorded :class:`~yobx.sql.parse.ParsedQuery` (or a list of them when the function returns multiple frames). * :func:`~yobx.sql.dataframe_to_onnx` — high-level entry point. Calls :func:`~yobx.xtracing.dataframe_trace.trace_dataframe` and then :func:`~yobx.sql.sql_convert.parsed_query_to_onnx` to produce a self-contained :class:`~yobx.container.ExportArtifact`. High-level entry point ====================== :func:`~yobx.sql.dataframe_to_onnx` is the recommended way to convert a DataFrame function to ONNX: .. code-block:: python def dataframe_to_onnx( func: Callable, input_dtypes: Union[Dict[str, dtype], List[Dict[str, dtype]]], target_opset: int = DEFAULT_TARGET_OPSET, custom_functions: Optional[Dict[str, Callable]] = None, builder_cls: type = GraphBuilder, filename: Optional[str] = None, verbose: int = 0, ) -> ExportArtifact: ... ==================== ================================================================ Parameter Description ==================== ================================================================ ``func`` Python callable that accepts one or more :class:`~yobx.xtracing.TracedDataFrame` objects and returns a :class:`~yobx.xtracing.TracedDataFrame` or a tuple/list of them. ``input_dtypes`` ``{column: dtype}`` dict (single input frame) **or** a list of such dicts (one per input frame). A :class:`~pandas.DataFrame` can be passed directly; its column names and dtypes are extracted automatically. ``target_opset`` ONNX opset version to target. ``custom_functions`` Optional dict mapping SQL function names to Python callables for use inside ``select``-level expressions. ``builder_cls`` :class:`~yobx.xbuilder.GraphBuilder` subclass to use. ``filename`` Optional path to write the model file. ``verbose`` Verbosity level (0 = silent). ==================== ================================================================ Supported operations ==================== The table below summarises the :class:`~yobx.xtracing.TracedDataFrame` operations and their SQL/ONNX equivalents. ================================================== ====================================== Operation SQL equivalent ================================================== ====================================== ``df["col"]`` / ``df.col`` Column reference ``df.filter(cond)`` / ``df[cond]`` ``WHERE`` ``df.select([…])`` / ``df[[…]]`` ``SELECT`` ``df.assign(new_col=expr)`` ``SELECT …, expr AS new_col`` ``df.groupby("key").agg({…})`` ``GROUP BY … + aggregation`` ``df.join(right, left_key, right_key)`` ``JOIN`` ``df.pivot_table(values, index, columns)`` ``PIVOT`` ``df.pipe(func)`` / ``df.pipe(func, *args)`` Function composition ``df.copy()`` Copy (no-op node) ``series + / - / * / series_or_scalar`` Arithmetic expression ``series > / < / >= / <= / == / != value`` Comparison condition ``cond1 & cond2`` / ``cond1 | cond2`` ``AND`` / ``OR`` ``series.sum()`` / ``.mean()`` / ``.min()`` Aggregation functions ``.max()`` / ``.count()`` Aggregation functions ``series.alias("name")`` Column alias ================================================== ====================================== Walkthrough examples ==================== Filter and select ----------------- .. runpython:: :showcode: import numpy as np from yobx.sql import dataframe_to_onnx from yobx.helpers.onnx_helper import pretty_onnx def transform(df): filtered = df.filter(df["a"] > 0) return filtered.select([(df["a"] + df["b"]).alias("total")]) dtypes = {"a": np.float32, "b": np.float32} artifact = dataframe_to_onnx(transform, dtypes) print(pretty_onnx(artifact.proto)) Group-by aggregation -------------------- The group-by key column must be listed explicitly in the ``agg`` output list alongside the aggregated values: .. runpython:: :showcode: import numpy as np from yobx.sql import dataframe_to_onnx from yobx.helpers.onnx_helper import pretty_onnx def transform(df): return df.groupby("key").agg( [df["key"].alias("key"), df["val"].sum().alias("total")] ) dtypes = {"key": np.int64, "val": np.float32} artifact = dataframe_to_onnx(transform, dtypes) print(pretty_onnx(artifact.proto)) Join two frames --------------- .. runpython:: :showcode: import numpy as np from yobx.sql import dataframe_to_onnx from yobx.helpers.onnx_helper import pretty_onnx def transform(df1, df2): joined = df1.join(df2, left_key="id", right_key="id") return joined.select([(df1["a"] + df2["b"]).alias("sum_ab")]) dtypes1 = {"id": np.int64, "a": np.float32} dtypes2 = {"id": np.int64, "b": np.float32} artifact = dataframe_to_onnx(transform, [dtypes1, dtypes2]) print(pretty_onnx(artifact.proto)) Multiple output frames ---------------------- When *func* returns a tuple or list of :class:`~yobx.xtracing.TracedDataFrame` objects, all outputs are gathered into a single ONNX graph with multiple output tensors: .. runpython:: :showcode: import numpy as np from yobx.sql import dataframe_to_onnx from yobx.helpers.onnx_helper import pretty_onnx def transform(df): out1 = df.select([(df["a"] + df["b"]).alias("sum_ab")]) out2 = df.select([(df["a"] - df["b"]).alias("diff_ab")]) return out1, out2 dtypes = {"a": np.float32, "b": np.float32} artifact = dataframe_to_onnx(transform, dtypes) print(pretty_onnx(artifact.proto)) Assign new columns ------------------ :meth:`~yobx.xtracing.TracedDataFrame.assign` adds computed columns while keeping all existing ones, similar to ``pandas.DataFrame.assign``: .. runpython:: :showcode: import numpy as np from yobx.sql import dataframe_to_onnx from yobx.helpers.onnx_helper import pretty_onnx def transform(df): df = df.assign(scaled=(df["a"] * 2.0).alias("scaled")) return df.select(["a", "b", "scaled"]) dtypes = {"a": np.float32, "b": np.float32} artifact = dataframe_to_onnx(transform, dtypes) print(pretty_onnx(artifact.proto)) Low-level API ============= :func:`~yobx.xtracing.dataframe_trace.trace_dataframe` is useful when you want to inspect the recorded AST before compiling to ONNX: .. runpython:: :showcode: import numpy as np from yobx.xtracing.dataframe_trace import trace_dataframe def transform(df): df = df.filter(df["a"] > 0) return df.select([(df["a"] + df["b"]).alias("total")]) pq = trace_dataframe(transform, {"a": np.float32, "b": np.float32}) for op in pq.operations: print(type(op).__name__, "—", op) The returned :class:`~yobx.sql.parse.ParsedQuery` can then be passed to :func:`~yobx.sql.sql_convert.parsed_query_to_onnx` to build the ONNX graph. Relation to the SQL converter ============================== DataFrame tracing is a *front-end* for the same SQL-to-ONNX back-end used by plain SQL strings and Polars ``LazyFrame`` inputs. The :class:`~yobx.xtracing.TracedDataFrame` API records operations as the same :class:`~yobx.sql.parse.ParsedQuery` AST nodes that the SQL parser produces, so the two paths share identical ONNX code generation. The unified entry point :func:`yobx.sql.to_onnx` accepts all three flavours and delegates automatically: * a **string** → SQL parser → :func:`~yobx.sql.sql_convert.sql_to_onnx` * a **Polars LazyFrame** → :func:`~yobx.sql.polars_convert.lazyframe_to_onnx` * a **callable** → :func:`~yobx.sql.dataframe_to_onnx` .. seealso:: :ref:`l-design-function-transformer-tracing` — numpy tracing, used to convert :class:`~sklearn.preprocessing.FunctionTransformer` to ONNX.