ONNX Benchmarks¶
Shows the list of benchmarks implemented the Examples Gallery.
CPU¶
plot_bench_tfidf¶
See Measuring performance of TfIdfVectorizer.
This benchmark measures the computation time when the kernel outputs sparse tensors.
plot_op_tree_ensemble_optim¶
See TreeEnsemble optimization.
This packages implements a custom kernel for
TreeEnsembleRegressor and TreeEnsembleClassifier
and let the users choose the parallelization parameters.
This scripts tries many values to select the best one
for trees trains with scikit-learn and a
sklearn.ensemble.RandomForestRegressor
.
plot_op_tree_ensemble_sparse¶
See TreeEnsemble, dense, and sparse.
This packages implements a custom kernel for
TreeEnsembleRegressor and TreeEnsembleClassifier
and let the users choose the parallelization parameters.
This scripts tries many values to select the best one
for trees trains with scikit-learn and a
sklearn.ensemble.RandomForestRegressor
.
plot_op_tree_ensemble_implementations¶
Test several implementations of TreeEnsemble is more simple way, see Evaluate different implementation of TreeEnsemble.
plot_op_einsum¶
See Compares implementations of Einsum.
Function einsum can be decomposed into a matrix multiplication and other transpose operators. What is the best decomposition?
CUDA¶
These tests only works if they are run a computer with CUDA enabled.
plot_bench_gemm_f8¶
See Measuring Gemm performance with different input and output tests.
The script checks the speed of cublasLtMatmul for various types and dimensions on square matricies. The code is implementation in C++ and does not involve onnxruntime. It checks configurations implemented in cuda_gemm.cu. See function gemm_benchmark_test in onnx_extended.validation.cuda.cuda_example_py.
plot_bench_gemm_ort¶
See Measuring performance about Gemm with onnxruntime.
The script checks the speed of cublasLtMatmul with a custom operator for onnxruntime and implemented in custom_gemm.cu.
plot_profile_gemm_ort¶
See Profiles a simple onnx graph including a singleGemm.
The benchmark profiles the execution of Gemm for different types and configuration. That includes a custom operator only available on CUDA calling function cublasLtMatmul.
plot_op_gemm2_cuda¶
See Gemm Exploration with CUDA.
One big Gemm or two smaller gemm.
plot_op_mul_cuda¶
See Fusing multiplication operators on CUDA.
The benchmark compares two operators Mul profiles with their fusion into a single operator.
plot_op_scatternd_cuda¶
See Optimizing ScatterND operator on CUDA.
The benchmark compares two operators ScatterND, using atomic, no atomic.
plot_op_scatternd_mask_cuda¶
See Optimizing Masked ScatterND operator on CUDA.
The benchmark compares three operators ScatterND to update a matrix.
plot_op_transpose2dcast_cuda¶
See Fuse Tranpose and Cast on CUDA.
The benchmark looks into the fusion to Transpose + Cast.
No specific provider¶
plot_bench_cypy_ort¶
See Measuring onnxruntime performance against a cython binding.
The python package for onnxruntime is implemented with pybind11. It is less efficient than cython which makes direct calls to the Python C API. The benchmark evaluates that cost.