-m onnx_diagnostic sbs … runs a side-by-side torch/onnx

Description

It compares the intermediate results between an exported program saved with torch.export.save() and an exported model on saved inputs with torch.save(). It assumes intermediate results share the same names.

    usage: side-by-side (sbs) [-h] -i INPUTS -e EP -m ONNX -o OUTPUT [--atol ATOL]
                              [--rtol RTOL] [-v VERBOSE] [-r RATIO]
                              [--first | --no-first]
                              [-2 | --second-run | --no-second-run]
                              [--reset RESET] [-s REPLAY_THRESHOLD]
                              [-n REPLAY_NAMES] [-t REPLAY_OP_TYPES]
                              [-f REPLAY_FOLDER]
    
    Compares the intermediate outputs between the exported program and the
    exported onnx model. It assumes some names are common. The execution of the
    exported program and the onnx model are done in parallel. The device is the
    one used to store the model and the inputs. Where do discrepancies start? This
    function tries to answer that question.
    
    options:
      -h, --help            show this help message and exit
      -i INPUTS, --inputs INPUTS
                            model inputs saved with torch.save
      -e EP, --ep EP        exported program saved with torch.export.save, input
                            sets saved with torch.save,
      -m ONNX, --onnx ONNX  exported model in onnx format
      -o OUTPUT, --output OUTPUT
                            output name to stored what the command line produces,
                            it should be an excel file
      --atol ATOL           absolute tolerance
      --rtol RTOL           relative tolerance
      -v VERBOSE, --verbose VERBOSE
                            verbosity
      -r RATIO, --ratio RATIO
                            Saves the result in an excel file every <ratio> nodes,
                            default is 100.
      --first, --no-first   First runs the whole model.
      -2, --second-run, --no-second-run
                            Tries to run all onnx nodes with torch results
                            produced by the exported program. It then measures the
                            discrepancies again. It can be used to identify kernel
                            introduces discrepancies from other just propagating
                            them.
      --reset RESET         List of result names separated by a comma. For those
                            results, the side-by-side will take torch results
                            instead of onnx results to compute the rest of the
                            onnx model.
      -s REPLAY_THRESHOLD, --replay-threshold REPLAY_THRESHOLD
                            Triggers the replay if the discrepancies are higher
                            than this value.
      -n REPLAY_NAMES, --replay-names REPLAY_NAMES
                            Triggers the replay if a result name is in this set of
                            values (comma separated)
      -t REPLAY_OP_TYPES, --replay-op-types REPLAY_OP_TYPES
                            Triggers the replay if an onnx type is in this set of
                            values (comma separated)
      -f REPLAY_FOLDER, --replay-folder REPLAY_FOLDER
                            If the replay is triggered, this defines the folder
                            where everything is dumped.
    
    The command line expects the following files to be saved with the following
    function. inputs is a dictionary of the input of the model. -
    torch.export.save(ep: torch.export.ExportedProgram) - torch.save(**inputs) -
    onnx.save(...) The Replay functionality is just a way to investigates a part
    of a model. It saves torch and onnx inputs, the torch outputs, and the minimal
    onnx model which shares its inputs with the exported program. This is used to
    investigate the discrepancies between the torch model (through the exported
    program) and its onnx conversion. This functionality dumps everything it can
    to disk so that it be replayed in a separate process.

CPU, CUDA

Inputs are saved torch.save(). The execution will run on CUDA if the device of the inputs is CUDA, same goes on CPU.

Example

python -m onnx_diagnostic sbs \
    -i qwen25_vli_visual.inputs.pt \
    --ep test_qwen25_vli_visual.cuda.float16.custom.graph.ep.pt2 \
    -m test_qwen25_vli_visual.cuda.float16.custom.onnx \
    -o results.dynamo.float16.xlsx \
    -v 1 --atol=0.1 --rtol=1 \
    --replay-names conv3d,rsqrt,to_4,mul_48,linear,linear_2,linear_84,linear_89,mul_172,linear_156,linear_159 \
    -2 --reset conv3d

A snippet of the table it produces:

ep_name         onnx_name       ep_target               onnx_op_type            onnx_id_output   ep_shape_type      onnx_shape_type    err_abs
transpose_18    transpose_18    aten.transpose.int      Transpose                           0    GT10s16x1292x80    GT10s16x1292x80    0.0083
unsqueeze_50    unsqueeze_50    aten.unsqueeze.default  Unsqueeze                           0    GT10s1x16x1292x80  GT10s1x16x1292x80  0.0083
eq_20           eq_20           aten.eq.Scalar          Equal                               0    GT9s1292x1292      GT9s1292x1292      0
unsqueeze_56    unsqueeze_56    aten.unsqueeze.default  Unsqueeze                           0    GT9s1x1x1292x1292  GT9s1x1x1292x1292  0
slice_29        slice_29        aten.slice.Tensor       Slice                               0    GT9s1x1x1292x1292  GT9s1x1x1292x1292  0
transpose_19    transpose_19    aten.transpose.int      Transpose                           0    GT10s1x1292x16x80  GT10s1x1292x16x80  0.0071
reshape_20      reshape_20      aten.reshape.default    Reshape                             0    GT10s1292x1280     GT10s1292x1280     0.0071
linear_21       linear_21       aten.linear.default     Gemm                                0    GT10s1292x1280     GT10s1292x1280     0.0015
mul_54          mul_54          aten.mul.Tensor         SkipSimplifiedLayerNormalization    0    GT10s1292x1280     GT10s1292x1280     0.0098
add_32          add_32          aten.add.Tensor         SkipSimplifiedLayerNormalization    3    GT10s1292x1280     GT10s1292x1280     0.0313
linear_22       linear_22       aten.linear.default     Gemm                                0    GT10s1292x3420     GT10s1292x3420     0.0078
silu_4          silu_4          aten.silu.default       QuickGelu                           0    GT10s1292x3420     GT10s1292x3420     0.0059

The available column are described by RunAlignedRecord. It is possible to dump pieces of the model to study some particular input with ReplayConfiguration.