.. _l-cube: ========================================= Statistics on Logs Through a Cube of Data ========================================= :class:`CubeLogs ` is a lightweight data-analysis layer built on top of :mod:`pandas`. It treats experiment logs as a *cube* — a structured table whose columns play one of four roles: * **time** — a single date/timestamp column (default ``"date"``). It is optional but strongly recommended. * **keys** — categorical identifiers (model name, exporter, hardware device, …). Together with *time*, the key tuple must uniquely identify every row. * **values** — numerical or string measurements that can be aggregated (latency, memory, error counts, …). * **ignored** — columns explicitly excluded from all three roles above. Everything else is silently dropped after loading. :class:`CubeLogsPerformance ` is a ready-made subclass whose defaults match the conventions used by benchmarking scripts in this project (columns prefixed ``time_``, ``disc_``, ``onnx_``, etc.). Loading data ============ Pass any of the following to the constructor, then call :meth:`load `: * a :class:`pandas.DataFrame` * a list of :class:`pandas.DataFrame` objects (concatenated automatically) * a list of dicts (each dict becomes one row) * a list of file paths / directory paths to CSV files .. runpython:: :showcode: import io import textwrap import pandas from yobx.helpers.cube_helper import CubeLogs raw = pandas.read_csv( io.StringIO( textwrap.dedent( """ date,version_python,model_name,model_exporter,time_latency,time_baseline 2025/01/01,3.13,phi3,export,0.10,0.10 2025/01/02,3.13,phi3,export,0.11,0.10 2025/01/01,3.13,phi4,export,0.10,0.105 2025/01/01,3.12,phi4,onnx-dynamo,0.14,0.999 """ ) ) ) cube = CubeLogs( raw, time="date", keys=["version_.*", "model_.*"], values=["time_.*"], recent=True, # keep only the most recent row per key tuple ).load() print("shape :", cube.shape) print("time :", cube.time) print("keys :", cube.keys_no_time) print("values:", cube.values) When ``recent=True`` the cube keeps only the most recent row (highest *time* value) for each combination of key columns. This is useful when experiment logs are appended over time and you only care about the latest run. Column patterns --------------- The ``keys``, ``values``, and ``ignored`` arguments accept plain column names **or** Python regular expressions. A string is treated as a regular expression when it contains any of the characters ``"^.*+{}"``:: CubeLogs(df, keys=["^version_.*", "model_name", "device"], ...) Adding computed columns (formulas) ==================================== Pass a dictionary of ``{new_column_name: callable}`` as the ``formulas`` argument. Each callable receives the full :class:`~pandas.DataFrame` and must return a :class:`~pandas.Series`. Computed columns are appended to the *values* list. .. runpython:: :showcode: import io import textwrap import pandas from yobx.helpers.cube_helper import CubeLogs raw = pandas.read_csv( io.StringIO( textwrap.dedent( """ date,version_python,model_name,model_exporter,time_latency,time_baseline 2025/01/01,3.13,phi3,export,0.10,0.12 2025/01/01,3.13,phi4,export,0.10,0.105 2025/01/01,3.12,phi4,onnx-dynamo,0.14,0.999 """ ) ) ) cube = CubeLogs( raw, time="date", keys=["version_.*", "model_.*"], values=["time_.*"], formulas={"speedup": lambda df: df["time_baseline"] / df["time_latency"]}, ).load() print("values:", cube.values) print(cube.data[["model_name", "model_exporter", "speedup"]]) Pivot views =========== :meth:`CubeLogs.view ` creates a pivot table from the cube. Rows and columns of the pivot are controlled by a :class:`CubeViewDef ` object. .. runpython:: :showcode: import io import textwrap import pandas from yobx.helpers.cube_helper import CubeLogs, CubeViewDef raw = pandas.read_csv( io.StringIO( textwrap.dedent( """ date,version_python,model_name,model_exporter,time_latency,time_baseline 2025/01/01,3.13,phi3,export,0.10,0.12 2025/01/01,3.13,phi4,export,0.10,0.105 2025/01/01,3.12,phi4,onnx-dynamo,0.14,0.999 """ ) ) ) cube = CubeLogs( raw, time="date", keys=["version_.*", "model_.*"], values=["time_latency", "time_baseline"], ).load() view = cube.view( CubeViewDef( key_index=["version_python", "model_name"], values=["time_latency", "time_baseline"], ignore_columns=["date"], ) ) print(view.to_string()) The result is a :class:`~pandas.DataFrame` with a :class:`~pandas.MultiIndex` on both the row index (``key_index``) and the columns (metrics × remaining key values). CubeViewDef options -------------------- +---------------------------+------------------------------------------------------------+ | Parameter | Purpose | +===========================+============================================================+ | ``key_index`` | Key columns placed in the row index of the pivot table | +---------------------------+------------------------------------------------------------+ | ``values`` | Metric columns to include in the pivot | +---------------------------+------------------------------------------------------------+ | ``ignore_unique`` | Drop key columns that have only one distinct value | +---------------------------+------------------------------------------------------------+ | ``key_agg`` | Keys to aggregate away before pivoting | +---------------------------+------------------------------------------------------------+ | ``agg_args`` | Aggregation function(s) passed to ``groupby(...).agg()`` | | | (can be a callable ``column_name → agg_func``) | +---------------------------+------------------------------------------------------------+ | ``agg_multi`` | Extra aggregations over multiple columns simultaneously | +---------------------------+------------------------------------------------------------+ | ``order`` | Explicit ordering of column levels | +---------------------------+------------------------------------------------------------+ | ``ignore_columns`` | Columns to exclude from the view | +---------------------------+------------------------------------------------------------+ | ``keep_columns_in_index`` | Keep a column even if it has only one distinct value | +---------------------------+------------------------------------------------------------+ | ``transpose`` | Transpose rows and columns | +---------------------------+------------------------------------------------------------+ | ``plots`` | Attach a :class:`CubePlot` to this view in Excel export | +---------------------------+------------------------------------------------------------+ | ``no_index`` | Reset the index, returning a flat DataFrame | +---------------------------+------------------------------------------------------------+ Aggregated views ---------------- Set ``key_agg`` to collapse one or more key dimensions before building the pivot. This is useful for summarising across all models or all dates: .. code-block:: python view = cube.view( CubeViewDef( key_index=["version_python"], values=["time_latency"], key_agg=["model_name", "date"], agg_args=lambda col: "mean", name="aggregated", ) ) Describing the cube =================== :meth:`CubeLogs.describe ` returns a summary :class:`~pandas.DataFrame` with one row per column, showing its role (time / keys / values / ignored), dtype, missing-value count, and basic statistics: .. runpython:: :showcode: import io import textwrap import pandas from yobx.helpers.cube_helper import CubeLogs raw = pandas.read_csv( io.StringIO( textwrap.dedent( """ date,version_python,model_name,model_exporter,time_latency,time_baseline 2025/01/01,3.13,phi3,export,0.10,0.12 2025/01/01,3.13,phi4,export,0.10,0.105 2025/01/01,3.12,phi4,onnx-dynamo,0.14,0.999 """ ) ) ) cube = CubeLogs( raw, time="date", keys=["version_.*", "model_.*"], values=["time_.*"], ).load() print(cube.describe()[["kind", "dtype", "missing", "min", "max"]].to_string()) Exporting to Excel ================== :meth:`CubeLogs.to_excel ` writes an ``.xlsx`` workbook. Each view is placed on its own sheet. A *raw* sheet with the full cube data and a *main* sheet with per-column statistics can be added automatically. .. code-block:: python cube.to_excel( "results.xlsx", views={ "latency": CubeViewDef( key_index=["version_python", "model_name"], values=["time_latency"], ignore_columns=["date"], name="latency", plots=True, ), "baseline": "time_baseline", # shorthand: auto-generated view }, main="main", # sheet with per-column statistics raw="raw", # sheet with the complete cube data ) Passing a plain string as a view is a shorthand: the cube calls :meth:`make_view_def ` to produce a sensible default view for that metric. CubeLogsPerformance =================== :class:`CubeLogsPerformance ` is a subclass of :class:`CubeLogs ` with defaults tuned for ML benchmarking logs. Its constructor pre-configures: * **time** column: ``"DATE"`` (uppercase; contrast with the ``"date"`` default in the base :class:`CubeLogs `) * **key** patterns: ``version_.*``, ``model_.*``, ``device``, ``exporter``, ``suite``, ``machine``, ``dtype``, ``architecture``, and several others. * **value** patterns: ``time_.*``, ``disc.*``, ``ERR_.*``, ``onnx_.*``, and related prefixes. * **formulas**: ``speedup``, ``ERR1``, ``n_models``, ``n_model_faster``, a full suite of ``n_node_*`` counts, and more — all computed automatically from the raw columns. * ``recent=True``: only the most recent row per key tuple is kept. Usage is identical to :class:`CubeLogs `: .. code-block:: python from yobx.helpers.cube_helper import CubeLogsPerformance cube = CubeLogsPerformance(df).load() view = cube.view(CubeViewDef(...)) .. seealso:: :class:`CubeLogs API ` — full API reference generated from docstrings. :class:`CubeLogsPerformance API ` — subclass pre-configured for ML performance logs.