Fused Kernel CUDA Custom Ops#
This page lists all custom ONNX Runtime operators in the fused-kernel CUDA family provided by yet-another-onnxruntime-extensions. The catalogue is generated dynamically at documentation-build time by parsing the C++ source files, so it always reflects the actual implementation without any manual maintenance.
These operators are registered under the
yaourt.ortops.fused_kernel.cuda domain and run on the
CUDAExecutionProvider. The shared library must be compiled from source
with a CUDA-enabled CMake build — see Getting Started for instructions.
Once built, it can be loaded via
FUSED_KERNEL_CUDA_LIB_PATH.
Note
CUDA operators require a GPU and a CUDA-capable build of ONNX Runtime. They are not included in the pre-built wheel.
Operators#
AddAdd#
Domain |
|
Execution provider |
|
Inputs |
3 |
Outputs |
1 |
Fused AddAdd / MulMul CUDA custom operator for three inputs.
Declares the kernel and operator classes for two element-wise ternary operations on three broadcastable input tensors A, B, C:
AddAdd (
addition=true): <math>MulMul (
addition=false):<math>
When all inputs share the same shape the kernel uses a no-broadcast path for better performance.
Supported element types: float, half.
AddAddAdd#
Domain |
|
Execution provider |
|
Inputs |
4 |
Outputs |
1 |
Fused AddAddAdd / MulMulMul CUDA custom operator for four inputs.
Declares the kernel and operator classes for two element-wise quaternary operations on four broadcastable input tensors A, B, C, D:
AddAddAdd (
addition=true): <math>MulMulMul (
addition=false):<math>
Supported element types: float, half.
AddMul#
Domain |
|
Execution provider |
|
Inputs |
3 |
Outputs |
1 |
Fused AddMul / MulAdd CUDA custom operator.
Declares the kernel and operator classes for two element-wise ternary operations on three broadcastable input tensors A, B, C:
AddMul (
addition=true): <math>MulAdd (
addition=false):<math>
Both variants support an optional kernel attribute transposeMiddle. When
set to true on a 4-D input the two middle axes of the output are
transposed, which avoids a separate Transpose node in common attention-kernel
patterns.
Supported element types: float, half.
MaskedScatterNDOfShape#
Domain |
|
Execution provider |
|
Inputs |
3 |
Outputs |
1 |
MaskedScatterNDOfShape CUDA custom operator.
Declares the kernel and operator classes for a masked variant of the
ScatterNDOfShape operation. Each scatter step is skipped when the
corresponding index value equals a configurable masked_value, allowing
padding indices to be ignored without pre-filtering:
output = zeros(shape) for each i: if indices[i] != masked_value: output[indices[i]] op= updates[i]
Inputs:
0:
shape— 1-Dint64tensor defining the output shape (CPU).1:
indices— integer indices tensor (CPU).2:
updates— data tensor of typeT(GPU).
The reduction and masked_value behaviour is configured via kernel
attributes of the same names.
Supported element types: float, half.
MulAdd#
Domain |
|
Execution provider |
|
Inputs |
3 |
Outputs |
1 |
Fused AddMul / MulAdd CUDA custom operator.
Declares the kernel and operator classes for two element-wise ternary operations on three broadcastable input tensors A, B, C:
AddMul (
addition=true): <math>MulAdd (
addition=false):<math>
Both variants support an optional kernel attribute transposeMiddle. When
set to true on a 4-D input the two middle axes of the output are
transposed, which avoids a separate Transpose node in common attention-kernel
patterns.
Supported element types: float, half.
MulMul#
Domain |
|
Execution provider |
|
Inputs |
3 |
Outputs |
1 |
Fused AddAdd / MulMul CUDA custom operator for three inputs.
Declares the kernel and operator classes for two element-wise ternary operations on three broadcastable input tensors A, B, C:
AddAdd (
addition=true): <math>MulMul (
addition=false):<math>
When all inputs share the same shape the kernel uses a no-broadcast path for better performance.
Supported element types: float, half.
MulMulMul#
Domain |
|
Execution provider |
|
Inputs |
4 |
Outputs |
1 |
Fused AddAddAdd / MulMulMul CUDA custom operator for four inputs.
Declares the kernel and operator classes for two element-wise quaternary operations on four broadcastable input tensors A, B, C, D:
AddAddAdd (
addition=true): <math>MulMulMul (
addition=false):<math>
Supported element types: float, half.
MulMulSigmoid#
Domain |
|
Execution provider |
|
Inputs |
2 |
Outputs |
1 |
Fused MulMulSigmoid CUDA custom operator.
Declares the kernel and operator classes for the element-wise binary operation applied to two broadcastable input tensors x and y:
<math>
The sigmoid is applied only to y, making this a gated variant of the SiLU / Swish activation commonly found in gated linear units (GLUs) used in transformer feed-forward networks.
Supported element types: float, half.
MulSigmoid#
Domain |
|
Execution provider |
|
Inputs |
1 |
Outputs |
1 |
Fused MulSigmoid CUDA custom operator (SiLU / Swish activation).
Declares the kernel and operator classes for the element-wise unary operation:
<math>
This is equivalent to the SiLU (Sigmoid Linear Unit) / Swish activation function, which is commonly used in transformer feed-forward blocks.
Supported element types: float, half.
MulSub#
Domain |
|
Execution provider |
|
Inputs |
3 |
Outputs |
1 |
Fused SubMul / MulSub CUDA custom operator.
Declares the kernel and operator classes for element-wise ternary operations
on three broadcastable input tensors A, B, C. The template parameter
addition selects whether subtraction precedes or follows multiplication,
and the optional kernel attribute negative inverts the sign of the
subtraction operand:
SubMul (
addition=true,negative=false):<math>SubMul (
addition=true,negative=true):<math>MulSub (
addition=false,negative=false):<math>MulSub (
addition=false,negative=true):<math>
Supported element types: float, half.
NegXplus1#
Domain |
|
Execution provider |
|
Inputs |
1 |
Outputs |
1 |
NegXplus1 CUDA custom operator — element-wise complement (1 − x).
Declares the kernel and operator classes for the unary element-wise operation:
<math>
This is the arithmetic complement of the input, useful for computing probability complements (e.g. <math>) without an extra Constant and Sub node in the graph.
Supported element types: float, half, int32_t.
ReplaceZero#
Domain |
|
Execution provider |
|
Inputs |
1 |
Outputs |
1 |
ReplaceZero CUDA custom operator — substitute zero elements.
Declares the kernel and operator classes for the unary element-wise operation:
<math>
The replacement scalar by is read from a kernel attribute of the same
name. The operator is useful for masking out padding tokens or avoiding
division by zero in subsequent operations.
Supported element types: float, half.
Rotary#
Domain |
|
Execution provider |
|
Inputs |
2 |
Outputs |
1 |
Rotary positional embedding CUDA custom operator.
Declares the kernel and operator classes that apply a rotary transformation to the last dimension of an input tensor. The operation implements rotary position encodings (RoPE) as used in models such as LLaMA and GPT-NeoX.
Each pair of elements <math> in the last
dimension is rotated depending on RotarySide:
LEFT side (
rotary_side_= LEFT): <math>RIGHT side (
rotary_side_= RIGHT): <math>
The side is selected via the kernel attribute "side" (integer, 1 = LEFT,
2 = RIGHT). The first input provides the data tensor and is expected in
device memory; the second input (optional shape hint) is read from CPU
memory.
Supported element types: float, half.
ScatterNDOfShape#
Domain |
|
Execution provider |
|
Inputs |
3 |
Outputs |
1 |
ScatterNDOfShape CUDA custom operator.
Declares the kernel and operator classes for scattering updates into a
zero-initialised output tensor whose shape is defined by the first input.
The operation is equivalent to:
output = zeros(shape) output[indices[i]] op= updates[i] for each i
where op is determined by the reduction kernel attribute
(Reduction::None overwrites, Add accumulates, etc.). A second
strategy attribute (Strategy::None or Optimize) lets the kernel choose
a shape-specific fast path at runtime.
Inputs:
0:
shape— 1-Dint64tensor defining the output shape (CPU).1:
indices— integer indices tensor (CPU).2:
updates— data tensor of typeT(GPU).
Supported element types: float, half.
SubMul#
Domain |
|
Execution provider |
|
Inputs |
3 |
Outputs |
1 |
Fused SubMul / MulSub CUDA custom operator.
Declares the kernel and operator classes for element-wise ternary operations
on three broadcastable input tensors A, B, C. The template parameter
addition selects whether subtraction precedes or follows multiplication,
and the optional kernel attribute negative inverts the sign of the
subtraction operand:
SubMul (
addition=true,negative=false):<math>SubMul (
addition=true,negative=true):<math>MulSub (
addition=false,negative=false):<math>MulSub (
addition=false,negative=true):<math>
Supported element types: float, half.
Transpose2DCastFP16#
Domain |
|
Execution provider |
|
Inputs |
1 |
Outputs |
1 |
Transpose2DCast CUDA custom operator — 2-D matrix transpose with type conversion.
Declares the kernel and operator classes for transposing a 2-D matrix and simultaneously casting its elements to a different numeric type. Two registered operator names are available depending on the output type:
Transpose2DCastFP16 — input
float→ outputhalf.Transpose2DCastFP32 — input
half→ outputfloat.
The operator fuses the Transpose and Cast nodes into a single tiled CUDA kernel, avoiding a round-trip through global memory compared to executing them separately.
The input and output types are chosen at construction time and stored in the operator descriptor.
Transpose2DCastFP32#
Domain |
|
Execution provider |
|
Inputs |
1 |
Outputs |
1 |
Transpose2DCast CUDA custom operator — 2-D matrix transpose with type conversion.
Declares the kernel and operator classes for transposing a 2-D matrix and simultaneously casting its elements to a different numeric type. Two registered operator names are available depending on the output type:
Transpose2DCastFP16 — input
float→ outputhalf.Transpose2DCastFP32 — input
half→ outputfloat.
The operator fuses the Transpose and Cast nodes into a single tiled CUDA kernel, avoiding a round-trip through global memory compared to executing them separately.
The input and output types are chosen at construction time and stored in the operator descriptor.
TriMatrix#
Domain |
|
Execution provider |
|
Inputs |
2 |
Outputs |
1 |
TriMatrix CUDA custom operator — triangular matrix generator.
Declares the kernel and operator classes that create a 2-D triangular matrix whose lower-triangle, diagonal, and upper-triangle elements are filled with user-supplied scalar values.
For a matrix of size <math>, element <math> is set to:
<math>
Inputs:
0:
shape— 1-Dint64tensor <math> (CPU).1:
values— 1-D tensor of typeTwith three elements <math> (CPU).
Supported element types: float, half.