-m onnx_diagnostic sbs … runs a side-by-side torch/onnx¶

Description¶

It compares the intermediate results between an exported program saved with torch.export.save() and an exported model on saved inputs with torch.save(). It assumes intermediate results share the same names.

    usage: side-by-side (sbs) [-h] -i INPUTS -e EP -m ONNX -o OUTPUT [--atol ATOL] [--rtol RTOL] [-v VERBOSE] [-r RATIO] [--first | --no-first] [--sbs | --no-sbs] [-2 | --second-run | --no-second-run]
                              [--reset RESET] [-s REPLAY_THRESHOLD] [-n REPLAY_NAMES] [-t REPLAY_OP_TYPES] [-f REPLAY_FOLDER] [-p | --replay-prefix-model | --no-replay-prefix-model]
    
    Compares the intermediate outputs between the exported program and the exported onnx model. It assumes some names are common. The execution of the exported program and the onnx model are done in
    parallel. The device is the one used to store the model and the inputs. Where do discrepancies start? This function tries to answer that question.
    
    options:
      -h, --help            show this help message and exit
      -i INPUTS, --inputs INPUTS
                            model inputs saved with torch.save
      -e EP, --ep EP        exported program saved with torch.export.save, input sets saved with torch.save,
      -m ONNX, --onnx ONNX  exported model in onnx format
      -o OUTPUT, --output OUTPUT
                            output name to stored what the command line produces, it should be an excel file
      --atol ATOL           absolute tolerance
      --rtol RTOL           relative tolerance
      -v VERBOSE, --verbose VERBOSE
                            verbosity
      -r RATIO, --ratio RATIO
                            Saves the result in an excel file every <ratio> nodes, default is 100.
      --first, --no-first   First runs the whole model (default is False).
      --sbs, --no-sbs       Runs the side-by-side (default is True).
      -2, --second-run, --no-second-run
                            Tries to run all onnx nodes with torch results produced by the exported program. It then measures the discrepancies again. It can be used to identify kernel introduces
                            discrepancies from other just propagating them.
      --reset RESET         List of result names separated by a comma. For those results, the side-by-side will take torch results instead of onnx results to compute the rest of the onnx model.
      -s REPLAY_THRESHOLD, --replay-threshold REPLAY_THRESHOLD
                            Triggers the replay if the discrepancies are higher than this value.
      -n REPLAY_NAMES, --replay-names REPLAY_NAMES
                            Triggers the replay if a result name is in this set of values (comma separated)
      -t REPLAY_OP_TYPES, --replay-op-types REPLAY_OP_TYPES
                            Triggers the replay if an onnx type is in this set of values (comma separated)
      -f REPLAY_FOLDER, --replay-folder REPLAY_FOLDER
                            If the replay is triggered, this defines the folder where everything is dumped.
      -p, --replay-prefix-model, --no-replay-prefix-model
                            There are two ways to recompute an intermediate output, the first one is to " produce the minimal model between torch and onnx. The second one is to dump onnx models from the
                            inputs to the considered intermediate results. This enables the second one.
    
    The command line expects the following files to be saved with the following function. inputs is a dictionary of the input of the model. - torch.export.save(ep: torch.export.ExportedProgram) -
    torch.save(**inputs) - onnx.save(...) The Replay functionality is just a way to investigates a part of a model. It saves torch and onnx inputs, the torch outputs, and the minimal onnx model which shares
    its inputs with the exported program. This is used to investigate the discrepancies between the torch model (through the exported program) and its onnx conversion. This functionality dumps everything it
    can to disk so that it be replayed in a separate process.

CPU, CUDA¶

Inputs are saved torch.save(). The execution will run on CUDA if the device of the inputs is CUDA, same goes on CPU.

Example¶

python -m onnx_diagnostic sbs \
    -i qwen25_vli_visual.inputs.pt \
    --ep test_qwen25_vli_visual.cuda.float16.custom.graph.ep.pt2 \
    -m test_qwen25_vli_visual.cuda.float16.custom.onnx \
    -o results.dynamo.float16.xlsx \
    -v 1 --atol=0.1 --rtol=1 \
    --replay-names conv3d,rsqrt,to_4,mul_48,linear,linear_2,linear_84,linear_89,mul_172,linear_156,linear_159 \
    -2 --reset conv3d

A snippet of the table it produces:

ep_name         onnx_name       ep_target               onnx_op_type            onnx_id_output   ep_shape_type      onnx_shape_type    err_abs
transpose_18    transpose_18    aten.transpose.int      Transpose                           0    GT10s16x1292x80    GT10s16x1292x80    0.0083
unsqueeze_50    unsqueeze_50    aten.unsqueeze.default  Unsqueeze                           0    GT10s1x16x1292x80  GT10s1x16x1292x80  0.0083
eq_20           eq_20           aten.eq.Scalar          Equal                               0    GT9s1292x1292      GT9s1292x1292      0
unsqueeze_56    unsqueeze_56    aten.unsqueeze.default  Unsqueeze                           0    GT9s1x1x1292x1292  GT9s1x1x1292x1292  0
slice_29        slice_29        aten.slice.Tensor       Slice                               0    GT9s1x1x1292x1292  GT9s1x1x1292x1292  0
transpose_19    transpose_19    aten.transpose.int      Transpose                           0    GT10s1x1292x16x80  GT10s1x1292x16x80  0.0071
reshape_20      reshape_20      aten.reshape.default    Reshape                             0    GT10s1292x1280     GT10s1292x1280     0.0071
linear_21       linear_21       aten.linear.default     Gemm                                0    GT10s1292x1280     GT10s1292x1280     0.0015
mul_54          mul_54          aten.mul.Tensor         SkipSimplifiedLayerNormalization    0    GT10s1292x1280     GT10s1292x1280     0.0098
add_32          add_32          aten.add.Tensor         SkipSimplifiedLayerNormalization    3    GT10s1292x1280     GT10s1292x1280     0.0313
linear_22       linear_22       aten.linear.default     Gemm                                0    GT10s1292x3420     GT10s1292x3420     0.0078
silu_4          silu_4          aten.silu.default       QuickGelu                           0    GT10s1292x3420     GT10s1292x3420     0.0059

The available column are described by RunAlignedRecord. It is possible to dump pieces of the model to study some particular input with ReplayConfiguration.