Polars LazyFrame to ONNX#

Overview#

yobx.sql can convert a polars.LazyFrame execution plan directly into a self-contained ONNX model. The conversion works by extracting the logical plan from the LazyFrame via polars.LazyFrame.explain(), translating that plan into an intermediate SQL query, and then delegating to the SQL-to-ONNX pipeline (sql_to_onnx()).

Architecture#

polars.LazyFrame
    │
    ▼
lf.explain()        ─── execution plan string
    │
    ▼
_parse_polars_plan()─── _PolarsPlan (select / filter / group_by)
    │
    ▼
_plan_to_sql()      ─── SQL query string
    │
    ▼
sql_to_onnx()       ─── GraphBuilder ──► ExportArtifact

Supported LazyFrame operations#

Polars operation	SQL clause generated	ONNX nodes emitted
`select([col, expr, …])`	`SELECT expr [AS alias], …`	`Identity`, `Add`, `Sub`, `Mul`, `Div`
`filter(condition)`	`WHERE condition`	`Compress`, `Equal`, `Less`, `Greater`, …
`group_by(cols).agg([…])`	`SELECT agg … GROUP BY cols`	`ReduceSum`, `ReduceMean`, `ReduceMin`, `ReduceMax`
Arithmetic (`+`, `-`, `*`, `/`)	Inlined into `SELECT` expressions	`Add`, `Sub`, `Mul`, `Div`
Comparisons (`>`, `<`, `>=`, `<=`, `==`, `!=`)	`WHERE` predicates	`Greater`, `Less`, `Equal`, …
Boolean compound (`&`, `\|`)	`AND` / `OR` in `WHERE`	`And`, `Or`
`.alias("name")`	`… AS name` in `SELECT`	(rename only)
Aggregation methods (`.sum()`, `.mean()`, `.min()`, `.max()`, `.count()`)	`SUM(…)`, `AVG(…)`, `MIN(…)`, `MAX(…)`, `COUNT(…)`	`ReduceSum`, `ReduceMean`, `ReduceMin`, `ReduceMax`

Columnar input convention#

As with the SQL converter, each source column of the plan is represented as a separate 1-D ONNX input tensor. The input_dtypes parameter maps source column names to numpy dtypes and must include every column that appears in the plan.

Polars dtype mapping#

The following polars data types are mapped to numpy equivalents:

Polars type	numpy dtype
`pl.Float32`	`float32`
`pl.Float64`	`float64`
`pl.Int8` / `pl.Int16` / `pl.Int32` / `pl.Int64`	`int8` / `int16` / `int32` / `int64`
`pl.UInt8` / `pl.UInt16` / `pl.UInt32` / `pl.UInt64`	`uint8` / `uint16` / `uint32` / `uint64`
`pl.Boolean`	`bool`
`pl.String` / `pl.Utf8`	`object`

Example#

import numpy as np
import polars as pl
from yobx.sql import lazyframe_to_onnx
from yobx.reference import ExtendedReferenceEvaluator

lf = pl.LazyFrame({"a": [1.0, 2.0, 3.0], "b": [4.0, 5.0, 6.0]})
lf = lf.filter(pl.col("a") > 0).select(
    [(pl.col("a") + pl.col("b")).alias("total")]
)

dtypes = {"a": np.float64, "b": np.float64}
artifact = lazyframe_to_onnx(lf, dtypes)

ref = ExtendedReferenceEvaluator(artifact)
a = np.array([1.0, -2.0, 3.0], dtype=np.float64)
b = np.array([4.0,  5.0, 6.0], dtype=np.float64)
(total,) = ref.run(None, {"a": a, "b": b})
# total contains rows where a > 0: [5.0, 9.0]

Limitations#

GROUP BY on multiple columns casts the key columns to float64 before combining them, which causes precision loss for integer keys greater than 2**53.
Only a single filter step, a single select step, and a single group_by/agg step are handled. Complex multi-step plans may not translate correctly.
join, sort, limit, distinct, pivot, melt, and other advanced polars operations are not yet supported.
The plan text produced by polars.LazyFrame.explain() may change between polars versions; the parser targets the format used by polars ≥ 0.19.