ONNX Benchmarks

Shows the list of benchmarks implemented the Examples Gallery.

CPU

plot_bench_tfidf

See Measuring performance of TfIdfVectorizer.

This benchmark measures the computation time when the kernel outputs sparse tensors.

plot_op_tree_ensemble_optim

See TreeEnsemble optimization.

This packages implements a custom kernel for TreeEnsembleRegressor and TreeEnsembleClassifier and let the users choose the parallelization parameters. This scripts tries many values to select the best one for trees trains with scikit-learn and a sklearn.ensemble.RandomForestRegressor.

plot_op_tree_ensemble_sparse

See TreeEnsemble, dense, and sparse.

This packages implements a custom kernel for TreeEnsembleRegressor and TreeEnsembleClassifier and let the users choose the parallelization parameters. This scripts tries many values to select the best one for trees trains with scikit-learn and a sklearn.ensemble.RandomForestRegressor.

plot_op_tree_ensemble_implementations

Test several implementations of TreeEnsemble is more simple way, see Evaluate different implementation of TreeEnsemble.

plot_op_einsum

See Compares implementations of Einsum.

Function einsum can be decomposed into a matrix multiplication and other transpose operators. What is the best decomposition?

CUDA

These tests only works if they are run a computer with CUDA enabled.

plot_bench_gemm_f8

See Measuring Gemm performance with different input and output tests.

The script checks the speed of cublasLtMatmul for various types and dimensions on square matricies. The code is implementation in C++ and does not involve onnxruntime. It checks configurations implemented in cuda_gemm.cu. See function gemm_benchmark_test in onnx_extended.validation.cuda.cuda_example_py.

plot_bench_gemm_ort

See Measuring performance about Gemm with onnxruntime.

The script checks the speed of cublasLtMatmul with a custom operator for onnxruntime and implemented in custom_gemm.cu.

plot_profile_gemm_ort

See Profiles a simple onnx graph including a singleGemm.

The benchmark profiles the execution of Gemm for different types and configuration. That includes a custom operator only available on CUDA calling function cublasLtMatmul.

plot_op_gemm2_cuda

See Gemm Exploration with CUDA.

One big Gemm or two smaller gemm.

plot_op_mul_cuda

See Fusing multiplication operators on CUDA.

The benchmark compares two operators Mul profiles with their fusion into a single operator.

plot_op_scatternd_cuda

See Optimizing ScatterND operator on CUDA.

The benchmark compares two operators ScatterND, using atomic, no atomic.

plot_op_scatternd_mask_cuda

See Optimizing Masked ScatterND operator on CUDA.

The benchmark compares three operators ScatterND to update a matrix.

plot_op_transpose2dcast_cuda

See Fuse Tranpose and Cast on CUDA.

The benchmark looks into the fusion to Transpose + Cast.

No specific provider

plot_bench_cypy_ort

See Measuring onnxruntime performance against a cython binding.

The python package for onnxruntime is implemented with pybind11. It is less efficient than cython which makes direct calls to the Python C API. The benchmark evaluates that cost.