yobx.sql.to_onnx#

yobx.sql.to_onnx(dataframe_or_query: Union[str, Callable[[TracedDataFrame], TracedDataFrame], 'polars.LazyFrame'], args: Optional[Union[np.ndarray, Tuple[np.ndarray, ...], 'pandas.DataFrame', Tuple['pandas.DataFrame', ...], Dict[str, Union[np.dtype, type, str]], List[Dict[str, Union[np.dtype, type, str]]]]] = None, target_opset: int = 21, custom_functions: Optional[Dict[str, Callable]] = None, builder_cls: Union[type, Callable] = <class 'yobx.xbuilder.graph_builder.GraphBuilder'>, filename: Optional[str] = None, verbose: int = 0, input_names: Optional[Sequence[str]] = None, dynamic_shapes: Optional[Tuple[Dict[int, str], ...]] = None, large_model: bool = False, external_threshold: int = 1024, return_optimize_report: bool = False) ExportArtifact[source]#

Convert a SQL string, a DataFrame-tracing function, or a polars LazyFrame to ONNX.

This is the unified entry point that dispatches to:

  • sql_to_onnx() — when dataframe_or_query is a string.

  • dataframe_to_onnx() — when dataframe_or_query is a callable (a Python function that accepts a TracedDataFrame and returns one).

  • trace_numpy_to_onnx() — when dataframe_or_query is a callable and args is a numpy.ndarray or a tuple/list of numpy.ndarray objects (sample inputs for numpy-function tracing).

  • lazyframe_to_onnx() — for any other value (expected to be a polars.LazyFrame).

Each source column is represented as a separate 1-D ONNX input tensor. The model outputs correspond to the SELECT expressions (SQL / callable) or the select / agg step of the LazyFrame plan.

Parameters:
  • dataframe_or_query

    one of:

    • SQL string — supported clauses: SELECT, FROM, [INNER|LEFT|RIGHT|FULL] JOIN ON, WHERE, GROUP BY. Custom Python functions can be called by name in the SELECT and WHERE clauses when registered via custom_functions.

    • callable — a Python function (df: TracedDataFrame) -> TracedDataFrame or a function that accepts multiple TracedDataFrame arguments. The function is traced to capture all filter, select, aggregation, and join operations it performs, which are then compiled to ONNX. When args contains numpy arrays the function is treated as a numpy function and traced via trace_numpy_to_onnx() instead.

    • polars.LazyFrame — the execution plan is extracted via polars.LazyFrame.explain() and translated into SQL before conversion. See lazyframe_to_onnx() for details of supported operations.

  • args

    one of:

    • A single {column: dtype} mapping or a list of such mappings (one per TracedDataFrame argument) for DataFrame-tracing callables or SQL queries.

    • A numpy.ndarray or a tuple/list of numpy.ndarray objects — when dataframe_or_query is a numpy function, these sample arrays are used to determine the element types and shapes of the ONNX graph inputs; the function is then traced via trace_numpy_to_onnx().

    • A pandas DataFrame (or a tuple/list of DataFrames) — column names and per-column dtypes are extracted automatically.

    For SQL queries this maps left-table columns; for a LazyFrame it maps the source DataFrame columns referenced in the plan. Only columns that actually appear in the query / plan need to be listed. Supported numpy dtypes: float16, float32, float64, int8, int16, int32, int64, uint8, uint16, uint32, uint64, bool, object (string).

  • target_opset – ONNX opset version to target (default: yobx.DEFAULT_TARGET_OPSET).

  • custom_functions

    an optional mapping from function name (as it appears in the SQL string) to a Python callable. Each callable must accept one or more numpy arrays and return a numpy array. The function body is traced with trace_numpy_function() so that numpy arithmetic is translated into ONNX nodes. Ignored when dataframe_or_query is a polars.LazyFrame or when args contains numpy arrays.

    Example:

    import numpy as np
    from yobx.sql import to_onnx
    
    artifact = to_onnx(
        "SELECT my_sqrt(a) AS r FROM t",
        {"a": np.float32},
        custom_functions={"my_sqrt": np.sqrt},
    )
    

  • builder_cls – the graph-builder class (or factory callable) to use. Defaults to GraphBuilder. Any class that implements the Shape and type tracking can be supplied here, e.g. a custom subclass that adds extra optimisation passes.

  • filename – if set, the exported ONNX model is saved to this path and the ExportReport is written as a companion Excel file (same base name with .xlsx extension).

  • verbose – verbosity level (0 = silent).

  • input_names – optional list of tensor names for the ONNX graph inputs. Only used when dataframe_or_query is a numpy function (i.e. args contains numpy.ndarray objects); ignored for SQL strings and DataFrame-tracing callables.

  • dynamic_shapes – optional per-input axis-to-dimension-name mappings. Only used when dataframe_or_query is a numpy function; ignored for SQL strings and DataFrame-tracing callables.

  • large_model – if True the returned ExportArtifact has its container attribute set to an ExtendedModelContainer

  • external_threshold – if large_model is True, every tensor whose element count exceeds this threshold is stored as external data

  • return_optimize_report – if True, the returned ExportArtifact has its report attribute populated with per-pattern optimization statistics

Returns:

ExportArtifact wrapping the exported ONNX model together with an ExportReport.

Example — from a SQL string:

import numpy as np
from yobx.sql import to_onnx
from yobx.reference import ExtendedReferenceEvaluator

dtypes = {"a": np.float32, "b": np.float32}
artifact = to_onnx("SELECT a + b AS total FROM t WHERE a > 0", dtypes)

ref = ExtendedReferenceEvaluator(artifact)
a = np.array([1.0, -2.0, 3.0], dtype=np.float32)
b = np.array([4.0,  5.0, 6.0], dtype=np.float32)
(total,) = ref.run(None, {"a": a, "b": b})
# total == array([5., 9.], dtype=float32)  (rows where a > 0)

Example — from a DataFrame-tracing callable:

import numpy as np
from yobx.sql import to_onnx
from yobx.reference import ExtendedReferenceEvaluator

def transform(df):
    df = df.filter(df["a"] > 0)
    return df.select([(df["a"] + df["b"]).alias("total")])

dtypes = {"a": np.float32, "b": np.float32}
artifact = to_onnx(transform, dtypes)

ref = ExtendedReferenceEvaluator(artifact)
a = np.array([1.0, -2.0, 3.0], dtype=np.float32)
b = np.array([4.0,  5.0, 6.0], dtype=np.float32)
(total,) = ref.run(None, {"a": a, "b": b})
# total == array([5., 9.], dtype=float32)  (rows where a > 0)

Example — from a numpy-function callable with sample inputs:

import numpy as np
from yobx.sql import to_onnx
from yobx.reference import ExtendedReferenceEvaluator

def my_func(x):
    return np.sqrt(np.abs(x) + 1)

x = np.array([1.0, -2.0, 3.0], dtype=np.float32)
artifact = to_onnx(my_func, (x,))

ref = ExtendedReferenceEvaluator(artifact)
(result,) = ref.run(None, {"X": x})

Example — from a polars LazyFrame:

import numpy as np
import polars as pl
from yobx.sql import to_onnx
from yobx.reference import ExtendedReferenceEvaluator

lf = pl.LazyFrame({"a": [1.0, -2.0, 3.0], "b": [4.0, 5.0, 6.0]})
lf = lf.filter(pl.col("a") > 0).select(
    [(pl.col("a") + pl.col("b")).alias("total")]
)
dtypes = {"a": np.float64, "b": np.float64}
artifact = to_onnx(lf, dtypes)

ref = ExtendedReferenceEvaluator(artifact)
a = np.array([1.0, -2.0, 3.0], dtype=np.float64)
b = np.array([4.0,  5.0, 6.0], dtype=np.float64)
(total,) = ref.run(None, {"a": a, "b": b})
# total == array([5., 9.])  (rows where a > 0)

Note

GROUP BY produces one output row per unique key combination. Supported aggregations: SUM, AVG, MIN, MAX, COUNT(*). For multi-column GROUP BY the grouping keys are cast to float64 internally, which may lose precision for integers larger than 2^53.