ONNX Benchmarks#
Shows the list of benchmarks implemented the Examples Gallery.
See Measuring performance of TfIdfVectorizer.
This benchmark measures the computation time when the kernel outputs sparse tensors.
See TreeEnsemble optimization.
This packages implements a custom kernel for
TreeEnsembleRegressor and TreeEnsembleClassifier
and let the users choose the parallelization parameters.
This scripts tries many values to select the best one
for trees trains with scikit-learn and a
See TreeEnsemble, dense, and sparse.
This packages implements a custom kernel for
TreeEnsembleRegressor and TreeEnsembleClassifier
and let the users choose the parallelization parameters.
This scripts tries many values to select the best one
for trees trains with scikit-learn and a
Test several implementations of TreeEnsemble is more simple way, see Evaluate different implementation of TreeEnsemble.
See Compares implementations of Einsum.
Function einsum can be decomposed into a matrix multiplication and other transpose operators. What is the best decomposition?
These tests only works if they are run a computer with CUDA enabled.
See Measuring Gemm performance with different input and output tests.
The script checks the speed of cublasLtMatmul for various types and dimensions on square matricies. The code is implementation in C++ and does not involve onnxruntime. It checks configurations implemented in cuda_gemm.cu. See function gemm_benchmark_test in onnx_extended.validation.cuda.cuda_example_py.
See Measuring performance about Gemm with onnxruntime.
The script checks the speed of cublasLtMatmul with a custom operator for onnxruntime and implemented in custom_gemm.cu.
See Profiles a simple onnx graph including a singleGemm.
The benchmark profiles the execution of Gemm for different types and configuration. That includes a custom operator only available on CUDA calling function cublasLtMatmul.
No specific provider#
See Measuring onnxruntime performance against a cython binding.
The python package for onnxruntime is implemented with pybind11. It is less efficient than cython which makes direct calls to the Python C API. The benchmark evaluates that cost.