.. _l-design-sql: ================================= SQL-to-ONNX Converter Design ================================= Overview ======== ``yobx.sql`` converts a SQL query string into a self-contained :class:`onnx.ModelProto`. The primary design principle is: **Every column referenced in the query is a distinct 1-D ONNX input tensor.** This mirrors the columnar representation used in tabular data pipelines: the caller feeds one vector per column rather than a single 2-D matrix. The ONNX model can therefore be applied to any subset of rows without copying the full table. Architecture ============ The conversion pipeline consists of two stages: 1. **Parsing** (:func:`~yobx.sql.parse.parse_sql`) — turns the SQL string into a :class:`~yobx.sql.parse.ParsedQuery` containing an ordered list of :class:`~yobx.sql.parse.SqlOperation` objects, one per SQL clause. 2. **Emission** (:func:`~yobx.sql.sql_to_onnx`) — iterates over the operations in execution order and appends ONNX nodes to a :class:`~yobx.xbuilder.GraphBuilder`. .. code-block:: text SQL string │ ▼ parse_sql() ─── ParsedQuery ──► operations list │ ▼ sql_to_onnx() ─── GraphBuilder ──► ModelProto Supported SQL clauses ===================== .. list-table:: :header-rows: 1 * - Clause - SqlOperation - ONNX nodes emitted * - ``SELECT expr [AS alias], …`` - :class:`~yobx.sql.parse.SelectOp` - ``Identity``, ``Add``, ``Sub``, ``Mul``, ``Div``, ``ReduceSum``, ``ReduceMean``, ``ReduceMin``, ``ReduceMax`` * - ``WHERE condition`` - :class:`~yobx.sql.parse.FilterOp` - ``Compress``, ``Equal``, ``Less``, ``Greater``, ``LessOrEqual``, ``GreaterOrEqual``, ``And``, ``Or``, ``Not`` * - ``GROUP BY col, …`` - :class:`~yobx.sql.parse.GroupByOp` - (groups are processed together with ``SelectOp`` aggregations) * - ``[INNER|LEFT|RIGHT|FULL] JOIN … ON col = col`` - :class:`~yobx.sql.parse.JoinOp` - ``Unsqueeze``, ``Equal``, ``ArgMax``, ``ReduceMax``, ``Compress``, ``Gather`` Columnar input convention ========================= Each column becomes a separate ONNX graph input whose name matches the column name in the SQL query. Column names are normalised to lower-case. .. code-block:: python import numpy as np from yobx.sql import sql_to_onnx # Columns "a" and "b" → two separate ONNX inputs dtypes = {"a": np.float32, "b": np.float32} onx = sql_to_onnx( "SELECT a + b AS total FROM t WHERE a > 0", dtypes, ) # Inputs of the ONNX model for inp in onx.graph.input: print(inp.name) # → "a", then "b" Execution order =============== Operations are applied in the following logical order: 1. **JoinOp** — merge left and right tables using an equi-join key. 2. **FilterOp** — apply the ``WHERE`` predicate as a boolean mask (``Compress``) to all column tensors simultaneously. 3. **GroupByOp** — record the group keys; referenced by aggregation expressions in the ``SelectOp``. 4. **SelectOp** — compute output expressions (arithmetic, aggregations) over the filtered/joined columns. .. code-block:: text inputs: col_a, col_b, col_key_left, col_key_right │ ├── JoinOp: Unsqueeze/Equal/ArgMax → Compress/Gather aligned columns │ ├── FilterOp: Greater/And/… → Compress (row mask applied to all cols) │ ├── GroupByOp: (records group columns, used by aggregations) │ └── SelectOp: Add/ReduceSum/… → Identity → outputs Parser design ============= The parser (:mod:`yobx.sql.parse`) is a hand-written recursive-descent parser using a single-pass tokeniser. No third-party SQL library is required. Tokens ------ The tokeniser (:func:`yobx.sql.parse._tokenize`) classifies input characters into four token kinds: * ``"num"`` — integer or floating-point literals. * ``"str"`` — single- or double-quoted string literals. * ``"op"`` — operators and punctuation (``+``, ``-``, ``*``, ``/``, ``=``, ``<``, ``>``, ``<=``, ``>=``, ``<>``, ``!=``, ``(``, ``)``, ``,``). * ``"id"`` — identifiers (SQL keywords and column/table names). All identifiers are lowercased so that parsing is case-insensitive. Expression grammar ------------------ .. code-block:: text primary ::= agg_func "(" expr ")" | "(" expr ")" | number | string | identifier multiplicative ::= primary ( ("*" | "/") primary )* additive ::= multiplicative ( ("+" | "-") multiplicative )* expr ::= additive comparison ::= expr ( "=" | "<" | ">" | "<=" | ">=" | "<>" | "!=" ) expr and_pred ::= comparison ( "AND" comparison )* condition ::= and_pred ( "OR" and_pred )* Limitations and future work ============================ * ``GROUP BY`` uses a whole-dataset aggregation (``ReduceSum`` etc.) rather than true per-group aggregation. True per-group semantics require an ONNX ``Loop`` or a custom kernel. * ``SELECT DISTINCT`` is parsed but raises :class:`NotImplementedError` during conversion. * Only equi-joins on a single key column are supported for ``JOIN``. * ``HAVING``, ``ORDER BY``, ``LIMIT``, and subqueries are not yet supported. * String equality (``WHERE name = 'alice'``) is not yet supported (string literals are parsed but ONNX ``Equal`` on strings may need a separate handling path). Example ======= .. runpython:: :showcode: import numpy as np from yobx.helpers.onnx_helper import pretty_onnx from yobx.sql import sql_to_onnx, parse_sql from yobx.reference import ExtendedReferenceEvaluator # ── parse ──────────────────────────────────────────────────────── pq = parse_sql("SELECT a + b AS total FROM t WHERE a > 0") for op in pq.operations: print(type(op).__name__, "—", op) # ── convert ────────────────────────────────────────────────────── dtypes = {"a": np.float32, "b": np.float32} onx = sql_to_onnx( "SELECT a + b AS total FROM t WHERE a > 0", dtypes, ) print(pretty_onnx(onx)) # ── run ────────────────────────────────────────────────────────── ref = ExtendedReferenceEvaluator(onx) a = np.array([ 1.0, -2.0, 3.0], dtype=np.float32) b = np.array([ 4.0, 5.0, 6.0], dtype=np.float32) (total,) = ref.run(None, {"a": a, "b": b}) print(total) # → [5. 9.] (rows where a > 0)