Statistics on Logs Through a Cube of Data¶

CubeLogs is a lightweight data-analysis layer built on top of pandas. It treats experiment logs as a cube — a structured table whose columns play one of four roles:

time — a single date/timestamp column (default "date"). It is optional but strongly recommended.
keys — categorical identifiers (model name, exporter, hardware device, …). Together with time, the key tuple must uniquely identify every row.
values — numerical or string measurements that can be aggregated (latency, memory, error counts, …).
ignored — columns explicitly excluded from all three roles above.

Everything else is silently dropped after loading.

CubeLogsPerformance is a ready-made subclass whose defaults match the conventions used by benchmarking scripts in this project (columns prefixed time_, disc_, onnx_, etc.).

Loading data¶

Pass any of the following to the constructor, then call load:

a pandas.DataFrame
a list of pandas.DataFrame objects (concatenated automatically)
a list of dicts (each dict becomes one row)
a list of file paths / directory paths to CSV files

<<<

import io
import textwrap
import pandas
from yobx.helpers.cube_helper import CubeLogs

raw = pandas.read_csv(io.StringIO(textwrap.dedent("""
            date,version_python,model_name,model_exporter,time_latency,time_baseline
            2025/01/01,3.13,phi3,export,0.10,0.10
            2025/01/02,3.13,phi3,export,0.11,0.10
            2025/01/01,3.13,phi4,export,0.10,0.105
            2025/01/01,3.12,phi4,onnx-dynamo,0.14,0.999
            """)))

cube = CubeLogs(
    raw,
    time="date",
    keys=["version_.*", "model_.*"],
    values=["time_.*"],
    recent=True,  # keep only the most recent row per key tuple
).load()

print("shape :", cube.shape)
print("time  :", cube.time)
print("keys  :", cube.keys_no_time)
print("values:", cube.values)

>>>

    shape : (3, 6)
    time  : date
    keys  : ['model_exporter', 'model_name', 'version_python']
    values: ['time_baseline', 'time_latency']

When recent=True the cube keeps only the most recent row (highest time value) for each combination of key columns. This is useful when experiment logs are appended over time and you only care about the latest run.

Column patterns¶

The keys, values, and ignored arguments accept plain column names or Python regular expressions. A string is treated as a regular expression when it contains any of the characters "^.*+{}":

CubeLogs(df, keys=["^version_.*", "model_name", "device"], ...)

Adding computed columns (formulas)¶

Pass a dictionary of {new_column_name: callable} as the formulas argument. Each callable receives the full DataFrame and must return a Series. Computed columns are appended to the values list.

<<<

import io
import textwrap
import pandas
from yobx.helpers.cube_helper import CubeLogs

raw = pandas.read_csv(io.StringIO(textwrap.dedent("""
            date,version_python,model_name,model_exporter,time_latency,time_baseline
            2025/01/01,3.13,phi3,export,0.10,0.12
            2025/01/01,3.13,phi4,export,0.10,0.105
            2025/01/01,3.12,phi4,onnx-dynamo,0.14,0.999
            """)))

cube = CubeLogs(
    raw,
    time="date",
    keys=["version_.*", "model_.*"],
    values=["time_.*"],
    formulas={"speedup": lambda df: df["time_baseline"] / df["time_latency"]},
).load()

print("values:", cube.values)
print(cube.data[["model_name", "model_exporter", "speedup"]])

>>>

    values: ['time_baseline', 'time_latency', 'speedup']
      model_name model_exporter   speedup
    0       phi3         export  1.200000
    1       phi4         export  1.050000
    2       phi4    onnx-dynamo  7.135714

Pivot views¶

CubeLogs.view creates a pivot table from the cube. Rows and columns of the pivot are controlled by a CubeViewDef object.

<<<

import io
import textwrap
import pandas
from yobx.helpers.cube_helper import CubeLogs, CubeViewDef

raw = pandas.read_csv(io.StringIO(textwrap.dedent("""
            date,version_python,model_name,model_exporter,time_latency,time_baseline
            2025/01/01,3.13,phi3,export,0.10,0.12
            2025/01/01,3.13,phi4,export,0.10,0.105
            2025/01/01,3.12,phi4,onnx-dynamo,0.14,0.999
            """)))

cube = CubeLogs(
    raw,
    time="date",
    keys=["version_.*", "model_.*"],
    values=["time_latency", "time_baseline"],
).load()

view = cube.view(
    CubeViewDef(
        key_index=["version_python", "model_name"],
        values=["time_latency", "time_baseline"],
        ignore_columns=["date"],
    )
)
print(view.to_string())

>>>

    /home/xadupre/github/yet-another-onnx-builder/yobx/helpers/cube_helper.py:1073: Pandas4Warning: Starting with pandas version 4.0 all arguments of all will be keyword-only.
      new_data = data[~v.isnull().all(1)]
    METRICS                   time_baseline             time_latency            
    model_exporter                   export onnx-dynamo       export onnx-dynamo
    version_python model_name                                                   
    3.12           phi4                 NaN       0.999          NaN        0.14
    3.13           phi3               0.120         NaN          0.1         NaN
                   phi4               0.105         NaN          0.1         NaN

The result is a DataFrame with a MultiIndex on both the row index (key_index) and the columns (metrics × remaining key values).

CubeViewDef options¶

Parameter	Purpose
`key_index`	Key columns placed in the row index of the pivot table
`values`	Metric columns to include in the pivot
`ignore_unique`	Drop key columns that have only one distinct value
`key_agg`	Keys to aggregate away before pivoting
`agg_args`	Aggregation function(s) passed to `groupby(...).agg()` (can be a callable `column_name → agg_func`)
`agg_multi`	Extra aggregations over multiple columns simultaneously
`order`	Explicit ordering of column levels
`ignore_columns`	Columns to exclude from the view
`keep_columns_in_index`	Keep a column even if it has only one distinct value
`transpose`	Transpose rows and columns
`plots`	Attach a `CubePlot` to this view in Excel export
`no_index`	Reset the index, returning a flat DataFrame

Aggregated views¶

Set key_agg to collapse one or more key dimensions before building the pivot. This is useful for summarising across all models or all dates:

view = cube.view(
    CubeViewDef(
        key_index=["version_python"],
        values=["time_latency"],
        key_agg=["model_name", "date"],
        agg_args=lambda col: "mean",
        name="aggregated",
    )
)

Describing the cube¶

CubeLogs.describe returns a summary DataFrame with one row per column, showing its role (time / keys / values / ignored), dtype, missing-value count, and basic statistics:

<<<

import io
import textwrap
import pandas
from yobx.helpers.cube_helper import CubeLogs

raw = pandas.read_csv(io.StringIO(textwrap.dedent("""
            date,version_python,model_name,model_exporter,time_latency,time_baseline
            2025/01/01,3.13,phi3,export,0.10,0.12
            2025/01/01,3.13,phi4,export,0.10,0.105
            2025/01/01,3.12,phi4,onnx-dynamo,0.14,0.999
            """)))

cube = CubeLogs(
    raw,
    time="date",
    keys=["version_.*", "model_.*"],
    values=["time_.*"],
).load()

print(cube.describe()[["kind", "dtype", "missing", "min", "max"]].to_string())

>>>

                      kind           dtype  missing                  min                  max
    name                                                                                     
    date              time  datetime64[us]        0  2025-01-01 00:00:00  2025-01-01 00:00:00
    version_python    keys         float64        0                 3.12                 3.13
    model_name        keys             str        0                  NaN                  NaN
    model_exporter    keys             str        0                  NaN                  NaN
    time_latency    values         float64        0                  0.1                 0.14
    time_baseline   values         float64        0                0.105                0.999

Exporting to Excel¶

CubeLogs.to_excel writes an .xlsx workbook. Each view is placed on its own sheet. A raw sheet with the full cube data and a main sheet with per-column statistics can be added automatically.

cube.to_excel(
    "results.xlsx",
    views={
        "latency": CubeViewDef(
            key_index=["version_python", "model_name"],
            values=["time_latency"],
            ignore_columns=["date"],
            name="latency",
            plots=True,
        ),
        "baseline": "time_baseline",   # shorthand: auto-generated view
    },
    main="main",    # sheet with per-column statistics
    raw="raw",      # sheet with the complete cube data
)

Passing a plain string as a view is a shorthand: the cube calls make_view_def to produce a sensible default view for that metric.

CubeLogsPerformance¶

CubeLogsPerformance is a subclass of CubeLogs with defaults tuned for ML benchmarking logs. Its constructor pre-configures:

time column: "DATE" (uppercase; contrast with the "date" default in the base CubeLogs)
key patterns: version_.*, model_.*, device, exporter, suite, machine, dtype, architecture, and several others.
value patterns: time_.*, disc.*, ERR_.*, onnx_.*, and related prefixes.
formulas: speedup, ERR1, n_models, n_model_faster, a full suite of n_node_* counts, and more — all computed automatically from the raw columns.
recent=True: only the most recent row per key tuple is kept.

Usage is identical to CubeLogs:

from yobx.helpers.cube_helper import CubeLogsPerformance

cube = CubeLogsPerformance(df).load()
view = cube.view(CubeViewDef(...))