onnx_diagnostic.reference¶
modules
ExtendedReferenceEvaluator¶
- class onnx_diagnostic.reference.ExtendedReferenceEvaluator(proto: Any, opsets: Dict[str, int] | None = None, functions: List[ReferenceEvaluator | FunctionProto] | None = None, verbose: int = 0, new_ops: List[type[OpRun]] | None = None, **kwargs)[source][source]¶
- This class replaces the python implementation by custom implementation. The evaluator allows to test scenarios outside what an onnx backend bound to the official onnx operators definition could do such as optimization patterns involving onnxruntime contrib operators. - from onnx_diagnostic.reference import ExtendedReferenceEvaluator ref = ExtendedReferenceEvaluator(...) - The class overloads or adds the following operators by default: - <<< - import pprint from onnx_diagnostic.reference import ExtendedReferenceEvaluator pprint.pprint(ExtendedReferenceEvaluator.default_ops) - >>> - [<class 'onnx_diagnostic.reference.ops.op_add_add_mul_mul.AddAdd'>, <class 'onnx_diagnostic.reference.ops.op_add_add_mul_mul.AddMul'>, <class 'onnx_diagnostic.reference.ops.op_add_add_mul_mul.AddSharedInput'>, <class 'onnx_diagnostic.reference.ops.op_attention.Attention'>, <class 'onnx_diagnostic.reference.ops.op_average_pool_grad.AveragePoolGrad'>, <class 'onnx_diagnostic.reference.ops.op_bias_softmax.BiasSoftmax'>, <class 'onnx_diagnostic.reference.ops.op_concat.Concat'>, <class 'onnx_diagnostic.reference.ops.op_cast_like.CastLike_15'>, <class 'onnx_diagnostic.reference.ops.op_cast_like.CastLike_19'>, <class 'onnx_diagnostic.reference.ops.op_complex.ComplexModule'>, <class 'onnx_diagnostic.reference.ops.op_constant_of_shape.ConstantOfShape'>, <class 'onnx_diagnostic.reference.ops.op_fused_matmul.FusedMatMul'>, <class 'onnx_diagnostic.reference.ops.op_gather.Gather'>, <class 'onnx_diagnostic.reference.ops.op_gather_elements.GatherElements'>, <class 'onnx_diagnostic.reference.ops.op_gather_grad.GatherGrad'>, <class 'onnx_diagnostic.reference.ops.op_scatternd_of_shape.MaskedScatterNDOfShape'>, <class 'onnx_diagnostic.reference.ops.op_memcpy_host.MemcpyFromHost'>, <class 'onnx_diagnostic.reference.ops.op_memcpy_host.MemcpyToHost'>, <class 'onnx_diagnostic.reference.ops.op_add_add_mul_mul.MulAdd'>, <class 'onnx_diagnostic.reference.ops.op_add_add_mul_mul.MulMul'>, <class 'onnx_diagnostic.reference.ops.op_add_add_mul_mul.MulSharedInput'>, <class 'onnx_diagnostic.reference.ops.op_mul_sigmoid.MulSigmoid'>, <class 'onnx_diagnostic.reference.ops.op_add_add_mul_mul.MulSub'>, <class 'onnx_diagnostic.reference.ops.op_negxplus1.NegXplus1'>, <class 'onnx_diagnostic.reference.ops.op_qlinear_conv.QLinearConv'>, <class 'onnx_diagnostic.reference.ops.op_qlinear_average_pool.QLinearAveragePool'>, <class 'onnx_diagnostic.reference.ops.op_quick_gelu.QuickGelu'>, <class 'onnx_diagnostic.reference.ops.op_replace_zero.ReplaceZero'>, <class 'onnx_diagnostic.reference.ops.op_rotary.Rotary'>, <class 'onnx_diagnostic.reference.ops.op_scan.Scan'>, <class 'onnx_diagnostic.reference.ops.op_scatter_elements.ScatterElements'>, <class 'onnx_diagnostic.reference.ops.op_scatternd_of_shape.ScatterNDOfShape'>, <class 'onnx_diagnostic.reference.ops.op_simplified_layer_normalization.SimplifiedLayerNormalization'>, <class 'onnx_diagnostic.reference.ops.op_skip_layer_normalization.SkipLayerNormalization'>, <class 'onnx_diagnostic.reference.ops.op_slice.Slice_1'>, <class 'onnx_diagnostic.reference.ops.op_slice.Slice_10'>, <class 'onnx_diagnostic.reference.ops.op_add_add_mul_mul.SubMul'>, <class 'onnx_diagnostic.reference.ops.op_complex.ToComplex'>, <class 'onnx_diagnostic.reference.ops.op_transpose_cast.Transpose2DCastFP16'>, <class 'onnx_diagnostic.reference.ops.op_transpose_cast.Transpose2DCastFP32'>, <class 'onnx_diagnostic.reference.ops.op_tri_matrix.TriMatrix'>]
OnnxruntimeEvaluator¶
- class onnx_diagnostic.reference.OnnxruntimeEvaluator(proto: str | FunctionProto | ModelProto | GraphProto | NodeProto | OnnxruntimeEvaluator, session_options: SessionOptions | None = None, providers: str | List[str] | None = None, nvtx: bool = False, enable_profiling: bool = False, graph_optimization_level: GraphOptimizationLevel | bool = None, log_severity_level: int | None = None, log_verbosity_level: int | None = None, optimized_model_filepath: str | None = None, disable_aot_function_inlining: bool | None = None, use_training_api: bool = False, verbose: int = 0, local_functions: Dict[Tuple[str, str], FunctionProto | ModelProto | GraphProto | NodeProto | OnnxruntimeEvaluator] | None = None, ir_version: int = 10, opsets: int | Dict[str, int] | None = None, whole: bool = False, torch_or_numpy: bool | None = None)[source][source]¶
- This class loads an onnx model and the executes one by one the nodes with onnxruntime. This class is mostly meant for debugging. - Parameters:
- proto – proto or filename 
- session_options – options 
- providers – providers 
- nvtx – enable nvidia events 
- providers – None, “CPU”, “CUDA” or a list of providers 
- graph_optimization_level – see - onnxruntime.SessionOptions
- log_severity_level – see - onnxruntime.SessionOptions
- log_verbosity_level – see - onnxruntime.SessionOptions
- optimized_model_filepath – see - onnxruntime.SessionOptions
- disable_aot_function_inlining – see - onnxruntime.SessionOptions
- use_training_api – use onnxruntime-traning API 
- verbose – verbosity 
- local_functions – additional local function 
- ir_version – ir version to use when unknown 
- opsets – opsets to use when unknown 
- whole – if True, do not split node by node 
- torch_or_numpy – force the use of one of them, True for torch, False for numpy, None to let the class choose 
 
 - run(outputs: List[str] | None, feed_inputs: Dict[str, Any], intermediate: bool = False, report_cmp: ReportResultComparison | None = None) Dict[str, Any] | List[Any][source][source]¶
- Runs the model. It only works with numpy arrays. - Parameters:
- outputs – required outputs or None for all 
- feed_inputs – inputs 
- intermediate – returns all output instead of the last ones 
- report_cmp – used as a reference, every intermediate results is compare to every existing one, if not empty, it is an instance of - onnx_diagnostic.reference.ReportResultComparison
 
- Returns:
- outputs, as a list if return_all is False, as a dictionary if return_all is True 
 
 
ReportResultComparison¶
- class onnx_diagnostic.reference.ReportResultComparison(tensors: Dict[str | Tuple[str, int, str], torch.Tensor])[source][source]¶
- Holds tensors a runtime can use as a reference to compare intermediate results. See - onnx_diagnostic.reference.TorchOnnxEvaluator.run().- Parameters:
- tensors – tensor 
 - key(tensor: torch.Tensor) Tuple[int, Tuple[int, ...]][source][source]¶
- Returns a key for a tensor, (onnx dtype, shape). 
 - report(outputs: Dict[str, torch.Tensor]) List[Tuple[Tuple[int, str], str | Tuple[str, int, str], Dict[str, float | str]]][source][source]¶
- For every tensor in outputs, compares it to every tensor held by this class if it shares the same type and shape. The function returns the results of the comparison. The function also collects the results into a dictionary the user can retrieve later. 
 
TorchOnnxEvaluator¶
- class onnx_diagnostic.reference.TorchOnnxEvaluator(proto: FunctionProto | GraphProto | ModelProto, providers: Tuple[str, ...] = ('CPUExecutionProvider',), opsets: Dict[str, int] | None = None, local_functions: Dict[Tuple[str, str], TorchOnnxEvaluator] | None = None, verbose: int = 0, custom_kernels: Dict[Tuple[str, str], type[OpRunKernel]] | None = None)[source][source]¶
- Torch evaluator for onnx models. The model does not stores the original proto it evaluates to avoid - Parameters:
- proto – a proto 
- providers – where to run the model 
- opsets – needed if proto is a graph 
- functions – known local functions 
- verbose – verbosity level 
- custom_kernels – dictionary of kernels the user can defined to overwrite a specific implementation: - ("", "LayerNormalization"): CustomKernel
 
 - The class holds the following attributes: - providers: providers 
- default_device: default torch device 
- constants: all initializers or constants 
- kernels: kernels 
- runtime_info: produced by - first_used_last_used
- last_used: contains the list of intermediate results,
- to remove after every node execution, this avoid the memory to grow too much 
 
- functions: local functions 
 - The class is not multithreaded. runtime_info gets updated by the the class. The list of available kernels is returned by function - onnx_diagnostic.reference.torch_evaluator.get_kernels(). Example:- <<< - import onnx import onnx.helper as oh import torch from onnx_diagnostic.helpers import string_type from onnx_diagnostic.reference import TorchOnnxEvaluator TFLOAT = onnx.TensorProto.FLOAT proto = oh.make_model( oh.make_graph( [ oh.make_node("Sigmoid", ["Y"], ["sy"]), oh.make_node("Mul", ["Y", "sy"], ["ysy"]), oh.make_node("Mul", ["X", "ysy"], ["final"]), ], "-nd-", [ oh.make_tensor_value_info("X", TFLOAT, [1, "b", "c"]), oh.make_tensor_value_info("Y", TFLOAT, ["a", "b", "c"]), ], [oh.make_tensor_value_info("final", TFLOAT, ["a", "b", "c"])], ), opset_imports=[oh.make_opsetid("", 18)], ir_version=9, ) sess = TorchOnnxEvaluator(proto) feeds = dict(X=torch.rand((4, 5)), Y=torch.rand((4, 5))) result = sess.run(None, feeds) print(string_type(result, with_shape=True, with_min_max=True)) - >>> - #1[T1s4x5[9.667049744166434e-05,0.6623387336730957:A0.19934171827771935]]- With - verbose=1, the class prints out every kernel run and and every result deleted along the run. It shows when a result is not needed anymore. In that case, it is deleted to free the memory it takes.- <<< - import onnx import onnx.helper as oh import torch from onnx_diagnostic.helpers import string_type from onnx_diagnostic.reference import TorchOnnxEvaluator TFLOAT = onnx.TensorProto.FLOAT proto = oh.make_model( oh.make_graph( [ oh.make_node("Sigmoid", ["Y"], ["sy"]), oh.make_node("Mul", ["Y", "sy"], ["ysy"]), oh.make_node("Mul", ["X", "ysy"], ["final"]), ], "-nd-", [ oh.make_tensor_value_info("X", TFLOAT, [1, "b", "c"]), oh.make_tensor_value_info("Y", TFLOAT, ["a", "b", "c"]), ], [oh.make_tensor_value_info("final", TFLOAT, ["a", "b", "c"])], ), opset_imports=[oh.make_opsetid("", 18)], ir_version=9, ) sess = TorchOnnxEvaluator(proto, verbose=1) feeds = dict(X=torch.rand((4, 5)), Y=torch.rand((4, 5))) result = sess.run(None, feeds) print(string_type(result, with_shape=True, with_min_max=True)) - >>> - +I X: RuntimeValue(name='X', kind=5, shape=(4, 5), value=CT1s4x5[0.060998737812042236,0.9662450551986694:A0.5427517205476761]) +I Y: RuntimeValue(name='Y', kind=5, shape=(4, 5), value=CT1s4x5[0.00012814998626708984,0.9490360617637634:A0.4726764440536499]) Sigmoid_6(Y) -> sy +R sy: RuntimeValue(name='sy', kind=1, shape=(4, 5), is_shape=False, value=CT1s4x5[0.5000320672988892,0.7209212779998779:A0.6135843843221664]) Mul_1(Y, sy) -> ysy +R ysy: RuntimeValue(name='ysy', kind=1, shape=(4, 5), is_shape=False, value=CT1s4x5[6.407910404959694e-05,0.6841803193092346:A0.31057898236504117]) - clean Y - clean sy Mul_1(X, ysy) -> final +R final: RuntimeValue(name='final', kind=9, shape=(4, 5), is_shape=False, value=CT1s4x5[4.8135196266230196e-05,0.47582873702049255:A0.16478462278719236]) - clean X - clean ysy ++ outputs final - clean X - clean Y - clean final #1[T1s4x5[4.8135196266230196e-05,0.47582873702049255:A0.16478462278719236]] - The runtime can also execute the kernel the onnx model on CUDA. It follows the same logic as - onnxruntime.InferenceSession:- providers=["CUDAExecutionProvider"]. It is better in that case to move the input on CUDA. The class tries to move every weight on CUDA but tries to keep any tensor identified as a shape in CPU. Some bugs may remain as torch raises an exception when devices are expected to be the same. The runtime was validated with model arnir0/Tiny-LLM. Next example shows how to replace a kernel with a different one based on onnxruntime.- <<< - import numpy as np import onnx import onnx.helper as oh import onnxruntime import torch from onnx_diagnostic.helpers import string_type from onnx_diagnostic.helpers.torch_helper import onnx_dtype_to_torch_dtype from onnx_diagnostic.reference import TorchOnnxEvaluator from onnx_diagnostic.reference.torch_ops import OpRunKernel, OpRunTensor TFLOAT16 = onnx.TensorProto.FLOAT16 class LayerNormalizationOrt(OpRunKernel): "LayerNormalization based on onnxruntime" def __init__(self, node: onnx.NodeProto, version=None, verbose=0): super().__init__(node, version, verbose=verbose) self.axis = self.get_attribute_int(node, "axis", -1) self.epsilon = self.get_attribute_float(node, "epsilon", 1e-5) self.stash_type = onnx_dtype_to_torch_dtype( self.get_attribute_int(node, "stash_type", onnx.TensorProto.FLOAT) ) self.compute_std = len(node.output) > 1 assert not self.compute_std, "The keren only computes the first output." layer_model = oh.make_model( oh.make_graph( [ oh.make_node( "LayerNormalization", ["X", "W", "B"], ["Z"], axis=-1, epsilon=9.999999974752427e-7, ) ], "dummy", [ oh.make_tensor_value_info("X", TFLOAT16, ["b", "c", "d"]), oh.make_tensor_value_info("W", TFLOAT16, ["d"]), oh.make_tensor_value_info("B", TFLOAT16, ["d"]), ], [oh.make_tensor_value_info("Z", TFLOAT16, ["b", "c", "d"])], ), ir_version=9, opset_imports=[oh.make_opsetid("", 17)], ) self.ort_sess = onnxruntime.InferenceSession( layer_model.SerializeToString(), providers=["CUDAExecutionProvider"] ) def run(self, x, scale, bias=None): print(f"-- running {self.__class__.__name__}") feeds = dict(X=x, W=scale) if bias is not None: feeds["B"] = bias feeds = {k: v.tensor.detach().cpu().numpy() for k, v in feeds.items()} got = self.ort_sess.run(None, feeds)[0] return OpRunTensor(torch.from_numpy(got).to(x.dtype).to(x.device)) # This kernel is tested on this model. model = oh.make_model( oh.make_graph( [ oh.make_node( "LayerNormalization", ["X", "W", "B"], ["ln"], axis=-1, epsilon=9.999999974752427e-7, ), oh.make_node( "Add", ["ln", "W"], ["Z"], axis=-1, epsilon=9.999999974752427e-7 ), ], "dummy", [ oh.make_tensor_value_info("X", TFLOAT16, ["b", "c", "d"]), oh.make_tensor_value_info("W", TFLOAT16, ["d"]), oh.make_tensor_value_info("B", TFLOAT16, ["d"]), ], [oh.make_tensor_value_info("Z", TFLOAT16, ["b", "c", "d"])], ), ir_version=9, opset_imports=[oh.make_opsetid("", 17)], ) torch_sess = TorchOnnxEvaluator( model, custom_kernels={("", "LayerNormalization"): LayerNormalizationOrt}, verbose=1, ) feeds = dict( zip( torch_sess.input_names, [ torch.rand(3, 4, 5, dtype=torch.float16), torch.abs(torch.rand(5, dtype=torch.float16)), torch.rand(5, dtype=torch.float16), ], ) ) res = torch_sess.run(None, feeds) print(string_type(res, with_shape=True, with_min_max=True)) - >>> - +I X: RuntimeValue(name='X', kind=5, shape=(3, 4, 5), value=CT10s3x4x5[0.017578125,0.97705078125:A0.5531575520833333]) +I W: RuntimeValue(name='W', kind=5, shape=(5,), value=CT10s5[0.00341796875,0.79638671875:A0.40244140625]) +I B: RuntimeValue(name='B', kind=5, shape=(5,), value=CT10s5[0.02734375,0.833984375:A0.4279296875]) LayerNormalizationOrt(X, W, B) -> ln -- running LayerNormalizationOrt +R ln: RuntimeValue(name='ln', kind=1, shape=(3, 4, 5), is_shape=False, value=CT10s3x4x5[-0.78955078125,1.8408203125:A0.4383379618326823]) - clean X - clean B Add_1(ln, W) -> Z +R Z: RuntimeValue(name='Z', kind=9, shape=(3, 4, 5), is_shape=False, value=CT10s3x4x5[-0.1162109375,2.63671875:A0.8408365885416667]) - clean W - clean ln ++ outputs Z - clean X - clean W - clean B - clean Z #1[T10s3x4x5[-0.1162109375,2.63671875:A0.8408365885416667]] - run(outputs: List[str] | None, feeds: Dict[str, Tensor] | Dict[str, ndarray], report_cmp: ReportResultComparison | None = None) List[Tensor | None] | List[ndarray | None][source][source]¶
- Runs the ONNX model. - Parameters:
- outputs – outputs required 
- feeds – inputs 
- report_cmp – used as a reference, every intermediate results is compare to every existing one, if not empty, it is an instance of - onnx_diagnostic.reference.ReportResultComparison
 
- Returns:
- output tensors. 
 
 - run_with_values(*args: OpRunTensor | None, context: Dict[str, RuntimeValue] | None = None) OpRunValue | Tuple[OpRunValue, ...][source][source]¶
- Runs the ONNX model. The signature is different. This method is called by every kernel hokding a subgraph. The local variables are stored in context. - Parameters:
- args – inputs 
- context – local context for the execution of subgraphs 
 
- Returns:
- output OpRunTensor