ONNX Benchmarks#

Shows the list of benchmarks implemented the Examples Gallery.

CPU#

plot_bench_tfidf#

See Measuring performance of TfIdfVectorizer.

This benchmark measures the computation time when the kernel outputs sparse tensors.

plot_op_tree_ensemble_optim#

This packages implements a custom kernel for TreeEnsembleRegressor and TreeEnsembleClassifier and let the users choose the parallelization parameters. This scripts tries many values to select the best one for trees trains with scikit-learn and a sklearn.ensemble.RandomForestRegressor.

plot_op_tree_ensemble_sparse#

See TreeEnsemble, dense, and sparse.

plot_op_tree_ensemble_implementations#

Test several implementations of TreeEnsemble is more simple way, see Evaluate different implementation of TreeEnsemble.

plot_op_einsum#

See Compares implementations of Einsum.

Function einsum can be decomposed into a matrix multiplication and other transpose operators. What is the best decomposition?

CUDA#

These tests only works if they are run a computer with CUDA enabled.

plot_bench_gemm_f8#

See Measuring Gemm performance with different input and output tests.

The script checks the speed of cublasLtMatmul for various types and dimensions on square matricies. The code is implementation in C++ and does not involve onnxruntime. It checks configurations implemented in cuda_gemm.cu. See function gemm_benchmark_test in onnx_extended.validation.cuda.cuda_example_py.

plot_bench_gemm_ort#

See Measuring performance about Gemm with onnxruntime.

The script checks the speed of cublasLtMatmul with a custom operator for onnxruntime and implemented in custom_gemm.cu.

plot_profile_gemm_ort#

See Profiles a simple onnx graph including a singleGemm.

The benchmark profiles the execution of Gemm for different types and configuration. That includes a custom operator only available on CUDA calling function cublasLtMatmul.

No specific provider#

plot_bench_cypy_ort#

See Measuring onnxruntime performance against a cython binding.

The python package for onnxruntime is implemented with pybind11. It is less efficient than cython which makes direct calls to the Python C API. The benchmark evaluates that cost.