Fused Kernel CUDA Custom Ops#

This page lists all custom ONNX Runtime operators in the fused-kernel CUDA family provided by yet-another-onnxruntime-extensions. The catalogue is generated dynamically at documentation-build time by parsing the C++ source files, so it always reflects the actual implementation without any manual maintenance.

These operators are registered under the yaourt.ortops.fused_kernel.cuda domain and run on the CUDAExecutionProvider. The shared library must be compiled from source with a CUDA-enabled CMake build — see Getting Started for instructions. Once built, it can be loaded via FUSED_KERNEL_CUDA_LIB_PATH.

Note

CUDA operators require a GPU and a CUDA-capable build of ONNX Runtime. They are not included in the pre-built wheel.

Operators#

AddAdd#

Domain

yaourt.ortops.fused_kernel.cuda

Execution provider

CUDAExecutionProvider

Inputs

3

Outputs

1

Fused AddAdd / MulMul CUDA custom operator for three inputs.

Declares the kernel and operator classes for two element-wise ternary operations on three broadcastable input tensors A, B, C:

  • AddAdd (addition = true ): <math>

  • MulMul (addition = false): <math>

When all inputs share the same shape the kernel uses a no-broadcast path for better performance.

Supported element types: float, half.

AddAddAdd#

Domain

yaourt.ortops.fused_kernel.cuda

Execution provider

CUDAExecutionProvider

Inputs

4

Outputs

1

Fused AddAddAdd / MulMulMul CUDA custom operator for four inputs.

Declares the kernel and operator classes for two element-wise quaternary operations on four broadcastable input tensors A, B, C, D:

  • AddAddAdd (addition = true ): <math>

  • MulMulMul (addition = false): <math>

Supported element types: float, half.

AddMul#

Domain

yaourt.ortops.fused_kernel.cuda

Execution provider

CUDAExecutionProvider

Inputs

3

Outputs

1

Fused AddMul / MulAdd CUDA custom operator.

Declares the kernel and operator classes for two element-wise ternary operations on three broadcastable input tensors A, B, C:

  • AddMul (addition = true ): <math>

  • MulAdd (addition = false): <math>

Both variants support an optional kernel attribute transposeMiddle. When set to true on a 4-D input the two middle axes of the output are transposed, which avoids a separate Transpose node in common attention-kernel patterns.

Supported element types: float, half.

AddSharedInput#

Domain

yaourt.ortops.fused_kernel.cuda

Execution provider

CUDAExecutionProvider

Inputs

3

Outputs

2

Fused AddSharedInput / MulSharedInput CUDA custom operator.

Declares the kernel and operator classes for an operation that applies the same first input A to two different second inputs B and C simultaneously, producing two outputs:

  • AddSharedInput (addition = true ): <math>

  • MulSharedInput (addition = false): <math>

This fused form avoids reading A twice when computing two independent operations that share the same operand.

Supported element types: float, half.

MaskedScatterNDOfShape#

Domain

yaourt.ortops.fused_kernel.cuda

Execution provider

CUDAExecutionProvider

Inputs

3

Outputs

1

MaskedScatterNDOfShape CUDA custom operator.

Declares the kernel and operator classes for a masked variant of the ScatterNDOfShape operation. Each scatter step is skipped when the corresponding index value equals a configurable masked_value, allowing padding indices to be ignored without pre-filtering:

output = zeros(shape) for each i: if indices[i] != masked_value: output[indices[i]] op= updates[i]

Inputs:

  • 0: shape — 1-D int64 tensor defining the output shape (CPU).

  • 1: indices — integer indices tensor (CPU).

  • 2: updates — data tensor of type T (GPU).

The reduction and masked_value behaviour is configured via kernel attributes of the same names.

Supported element types: float, half.

MulAdd#

Domain

yaourt.ortops.fused_kernel.cuda

Execution provider

CUDAExecutionProvider

Inputs

3

Outputs

1

Fused AddMul / MulAdd CUDA custom operator.

Declares the kernel and operator classes for two element-wise ternary operations on three broadcastable input tensors A, B, C:

  • AddMul (addition = true ): <math>

  • MulAdd (addition = false): <math>

Both variants support an optional kernel attribute transposeMiddle. When set to true on a 4-D input the two middle axes of the output are transposed, which avoids a separate Transpose node in common attention-kernel patterns.

Supported element types: float, half.

MulMul#

Domain

yaourt.ortops.fused_kernel.cuda

Execution provider

CUDAExecutionProvider

Inputs

3

Outputs

1

Fused AddAdd / MulMul CUDA custom operator for three inputs.

Declares the kernel and operator classes for two element-wise ternary operations on three broadcastable input tensors A, B, C:

  • AddAdd (addition = true ): <math>

  • MulMul (addition = false): <math>

When all inputs share the same shape the kernel uses a no-broadcast path for better performance.

Supported element types: float, half.

MulMulMul#

Domain

yaourt.ortops.fused_kernel.cuda

Execution provider

CUDAExecutionProvider

Inputs

4

Outputs

1

Fused AddAddAdd / MulMulMul CUDA custom operator for four inputs.

Declares the kernel and operator classes for two element-wise quaternary operations on four broadcastable input tensors A, B, C, D:

  • AddAddAdd (addition = true ): <math>

  • MulMulMul (addition = false): <math>

Supported element types: float, half.

MulMulSigmoid#

Domain

yaourt.ortops.fused_kernel.cuda

Execution provider

CUDAExecutionProvider

Inputs

2

Outputs

1

Fused MulMulSigmoid CUDA custom operator.

Declares the kernel and operator classes for the element-wise binary operation applied to two broadcastable input tensors x and y:

<math>

The sigmoid is applied only to y, making this a gated variant of the SiLU / Swish activation commonly found in gated linear units (GLUs) used in transformer feed-forward networks.

Supported element types: float, half.

MulSharedInput#

Domain

yaourt.ortops.fused_kernel.cuda

Execution provider

CUDAExecutionProvider

Inputs

3

Outputs

2

Fused AddSharedInput / MulSharedInput CUDA custom operator.

Declares the kernel and operator classes for an operation that applies the same first input A to two different second inputs B and C simultaneously, producing two outputs:

  • AddSharedInput (addition = true ): <math>

  • MulSharedInput (addition = false): <math>

This fused form avoids reading A twice when computing two independent operations that share the same operand.

Supported element types: float, half.

MulSigmoid#

Domain

yaourt.ortops.fused_kernel.cuda

Execution provider

CUDAExecutionProvider

Inputs

1

Outputs

1

Fused MulSigmoid CUDA custom operator (SiLU / Swish activation).

Declares the kernel and operator classes for the element-wise unary operation:

<math>

This is equivalent to the SiLU (Sigmoid Linear Unit) / Swish activation function, which is commonly used in transformer feed-forward blocks.

Supported element types: float, half.

MulSub#

Domain

yaourt.ortops.fused_kernel.cuda

Execution provider

CUDAExecutionProvider

Inputs

3

Outputs

1

Fused SubMul / MulSub CUDA custom operator.

Declares the kernel and operator classes for element-wise ternary operations on three broadcastable input tensors A, B, C. The template parameter addition selects whether subtraction precedes or follows multiplication, and the optional kernel attribute negative inverts the sign of the subtraction operand:

  • SubMul (addition = true, negative = false): <math>

  • SubMul (addition = true, negative = true): <math>

  • MulSub (addition = false, negative = false): <math>

  • MulSub (addition = false, negative = true): <math>

Supported element types: float, half.

NegXplus1#

Domain

yaourt.ortops.fused_kernel.cuda

Execution provider

CUDAExecutionProvider

Inputs

1

Outputs

1

NegXplus1 CUDA custom operator — element-wise complement (1 − x).

Declares the kernel and operator classes for the unary element-wise operation:

<math>

This is the arithmetic complement of the input, useful for computing probability complements (e.g. <math>) without an extra Constant and Sub node in the graph.

Supported element types: float, half, int32_t.

ReplaceZero#

Domain

yaourt.ortops.fused_kernel.cuda

Execution provider

CUDAExecutionProvider

Inputs

1

Outputs

1

ReplaceZero CUDA custom operator — substitute zero elements.

Declares the kernel and operator classes for the unary element-wise operation:

<math>

The replacement scalar by is read from a kernel attribute of the same name. The operator is useful for masking out padding tokens or avoiding division by zero in subsequent operations.

Supported element types: float, half.

Rotary#

Domain

yaourt.ortops.fused_kernel.cuda

Execution provider

CUDAExecutionProvider

Inputs

2

Outputs

1

Rotary positional embedding CUDA custom operator.

Declares the kernel and operator classes that apply a rotary transformation to the last dimension of an input tensor. The operation implements rotary position encodings (RoPE) as used in models such as LLaMA and GPT-NeoX.

Each pair of elements <math> in the last dimension is rotated depending on RotarySide:

  • LEFT side (rotary_side_ = LEFT): <math>

  • RIGHT side (rotary_side_ = RIGHT): <math>

The side is selected via the kernel attribute "side" (integer, 1 = LEFT, 2 = RIGHT). The first input provides the data tensor and is expected in device memory; the second input (optional shape hint) is read from CPU memory.

Supported element types: float, half.

ScatterNDOfShape#

Domain

yaourt.ortops.fused_kernel.cuda

Execution provider

CUDAExecutionProvider

Inputs

3

Outputs

1

ScatterNDOfShape CUDA custom operator.

Declares the kernel and operator classes for scattering updates into a zero-initialised output tensor whose shape is defined by the first input. The operation is equivalent to:

output = zeros(shape) output[indices[i]] op= updates[i] for each i

where op is determined by the reduction kernel attribute (Reduction::None overwrites, Add accumulates, etc.). A second strategy attribute (Strategy::None or Optimize) lets the kernel choose a shape-specific fast path at runtime.

Inputs:

  • 0: shape — 1-D int64 tensor defining the output shape (CPU).

  • 1: indices — integer indices tensor (CPU).

  • 2: updates — data tensor of type T (GPU).

Supported element types: float, half.

SubMul#

Domain

yaourt.ortops.fused_kernel.cuda

Execution provider

CUDAExecutionProvider

Inputs

3

Outputs

1

Fused SubMul / MulSub CUDA custom operator.

Declares the kernel and operator classes for element-wise ternary operations on three broadcastable input tensors A, B, C. The template parameter addition selects whether subtraction precedes or follows multiplication, and the optional kernel attribute negative inverts the sign of the subtraction operand:

  • SubMul (addition = true, negative = false): <math>

  • SubMul (addition = true, negative = true): <math>

  • MulSub (addition = false, negative = false): <math>

  • MulSub (addition = false, negative = true): <math>

Supported element types: float, half.

Transpose2DCastFP16#

Domain

yaourt.ortops.fused_kernel.cuda

Execution provider

CUDAExecutionProvider

Inputs

1

Outputs

1

Transpose2DCast CUDA custom operator — 2-D matrix transpose with type conversion.

Declares the kernel and operator classes for transposing a 2-D matrix and simultaneously casting its elements to a different numeric type. Two registered operator names are available depending on the output type:

  • Transpose2DCastFP16 — input float → output half.

  • Transpose2DCastFP32 — input half → output float.

The operator fuses the Transpose and Cast nodes into a single tiled CUDA kernel, avoiding a round-trip through global memory compared to executing them separately.

The input and output types are chosen at construction time and stored in the operator descriptor.

Transpose2DCastFP32#

Domain

yaourt.ortops.fused_kernel.cuda

Execution provider

CUDAExecutionProvider

Inputs

1

Outputs

1

Transpose2DCast CUDA custom operator — 2-D matrix transpose with type conversion.

Declares the kernel and operator classes for transposing a 2-D matrix and simultaneously casting its elements to a different numeric type. Two registered operator names are available depending on the output type:

  • Transpose2DCastFP16 — input float → output half.

  • Transpose2DCastFP32 — input half → output float.

The operator fuses the Transpose and Cast nodes into a single tiled CUDA kernel, avoiding a round-trip through global memory compared to executing them separately.

The input and output types are chosen at construction time and stored in the operator descriptor.

TriMatrix#

Domain

yaourt.ortops.fused_kernel.cuda

Execution provider

CUDAExecutionProvider

Inputs

2

Outputs

1

TriMatrix CUDA custom operator — triangular matrix generator.

Declares the kernel and operator classes that create a 2-D triangular matrix whose lower-triangle, diagonal, and upper-triangle elements are filled with user-supplied scalar values.

For a matrix of size <math>, element <math> is set to:

<math>

Inputs:

  • 0: shape — 1-D int64 tensor <math> (CPU).

  • 1: values — 1-D tensor of type T with three elements <math> (CPU).

Supported element types: float, half.