yaourt.ortops.doc#

Documentation catalogue of custom ORT ops, derived from C++ source files.

Structural metadata (op name, domain, execution provider, input/output names and element types) is parsed directly from the C++ source files at import time so that the Python catalogue always stays in sync with the C++ implementation without any manual maintenance.

Human-readable documentation strings are parsed from Doxygen-style doc comments in the C++ header files, so prose descriptions live alongside the kernel declarations and are never duplicated in Python.

Supported C++ sources — sparse CPU (lite API)#

  • yaourt/ortops/sparse/cpu/ort_sparse_cpu2_lib.cc — provides the op domain and the CreateLiteCustomOp registrations (op name → kernel class + exec provider).

  • yaourt/ortops/sparse/cpu/ort_sparse_lite.h — provides the Compute method signatures and /// doc comments used to extract input/output argument names, element types, and prose descriptions.

Supported C++ sources — fused kernel CUDA (custom-op-base API)#

  • yaourt/ortops/fused_kernel/cuda/ort_fused_kernel_cuda_lib.cu — provides the op domain name.

  • yaourt/ortops/fused_kernel/cuda/*.cu (individual kernel files) — provide GetName(), GetInputTypeCount(), GetOutputTypeCount(), and GetExecutionProviderType() implementations.

  • yaourt/ortops/fused_kernel/cuda/*.h (individual header files) — provide /** @file @brief */ Doxygen doc blocks used as op descriptions.

The print_cpu_ops() / print_cpu_ops_rst() and print_cuda_ops() / print_cuda_ops_rst() functions render the catalogues as plain text or RST and are intended to be called from .. runpython:: blocks in the Sphinx docs.

yaourt.ortops.doc.CPU_OPS: dict[str, OrtOpDesc] = {'DenseToSparse': OrtOpDesc(name='DenseToSparse', domain='yaourt.ortops.sparse.cpu', since_version=1, execution_provider='CPUExecutionProvider', inputs=[OrtOpInput(name='X', dtype='float32', description='2-D dense float32 input tensor of shape [n_rows, n_cols]. Zero elements are not stored in the sparse encoding.')], outputs=[OrtOpOutput(name='Y', dtype='float32', description='1-D float32 tensor containing the sparse encoding of X. Layout: header | flat indices (uint32) | non-zero values (float32).')], doc='Converts a 2-D dense float32 tensor into a compact flat sparse encoding.\nOnly non-zero elements are stored. The 1-D output tensor encodes the\noriginal shape, the number of non-zero elements, their flat indices\n(stored as uint32), and their values (float32). The encoding is suitable\nas input to SparseToDense for a lossless round-trip.\n\nConstraints: input must be exactly 2-D; only float32 is supported.'), 'SparseToDense': OrtOpDesc(name='SparseToDense', domain='yaourt.ortops.sparse.cpu', since_version=1, execution_provider='CPUExecutionProvider', inputs=[OrtOpInput(name='X', dtype='float32', description='1-D float32 sparse encoding produced by DenseToSparse.')], outputs=[OrtOpOutput(name='Y', dtype='float32', description='Reconstructed 2-D dense float32 tensor. The shape is recovered from the sparse header embedded in X.')], doc='Converts the compact sparse encoding produced by DenseToSparse back into\na 2-D dense float32 tensor. Positions that were zero in the original\ntensor are filled with 0.0. The output shape is recovered from the\nsparse header embedded in the input.\n\nConstraints: input must be 1-D with a valid sparse header; the encoded\nshape must be 2-D; only float32 is supported.')}#

All CPU custom ops provided by yet-another-onnxruntime-extensions, keyed by op name. Populated at import time by parsing the C++ source files.

yaourt.ortops.doc.CUDA_OPS: dict[str, OrtOpDesc] = {'AddAdd': OrtOpDesc(name='AddAdd', domain='yaourt.ortops.fused_kernel.cuda', since_version=1, execution_provider='CUDAExecutionProvider', inputs=[OrtOpInput(name='input_0', dtype='T', description=''), OrtOpInput(name='input_1', dtype='T', description=''), OrtOpInput(name='input_2', dtype='T', description='')], outputs=[OrtOpOutput(name='output_0', dtype='T', description='')], doc='Fused AddAdd / MulMul CUDA custom operator for three inputs.\n\nDeclares the kernel and operator classes for two element-wise ternary\noperations on three broadcastable input tensors A, B, C:\n\n- **AddAdd** (``addition`` = ``true`` ): <math>\n- **MulMul** (``addition`` = ``false):`` <math>\n\nWhen all inputs share the same shape the kernel uses a no-broadcast path for\nbetter performance.\n\nSupported element types: ``float,`` ``half.``'), 'AddAddAdd': OrtOpDesc(name='AddAddAdd', domain='yaourt.ortops.fused_kernel.cuda', since_version=1, execution_provider='CUDAExecutionProvider', inputs=[OrtOpInput(name='input_0', dtype='T', description=''), OrtOpInput(name='input_1', dtype='T', description=''), OrtOpInput(name='input_2', dtype='T', description=''), OrtOpInput(name='input_3', dtype='T', description='')], outputs=[OrtOpOutput(name='output_0', dtype='T', description='')], doc='Fused AddAddAdd / MulMulMul CUDA custom operator for four inputs.\n\nDeclares the kernel and operator classes for two element-wise quaternary\noperations on four broadcastable input tensors A, B, C, D:\n\n- **AddAddAdd** (``addition`` = ``true`` ):\n  <math>\n- **MulMulMul** (``addition`` = ``false):``\n  <math>\n\nSupported element types: ``float,`` ``half.``'), 'AddMul': OrtOpDesc(name='AddMul', domain='yaourt.ortops.fused_kernel.cuda', since_version=1, execution_provider='CUDAExecutionProvider', inputs=[OrtOpInput(name='input_0', dtype='T', description=''), OrtOpInput(name='input_1', dtype='T', description=''), OrtOpInput(name='input_2', dtype='T', description='')], outputs=[OrtOpOutput(name='output_0', dtype='T', description='')], doc='Fused AddMul / MulAdd CUDA custom operator.\n\nDeclares the kernel and operator classes for two element-wise ternary\noperations on three broadcastable input tensors A, B, C:\n\n- **AddMul** (``addition`` = ``true`` ): <math>\n- **MulAdd** (``addition`` = ``false):`` <math>\n\nBoth variants support an optional kernel attribute ``transposeMiddle.``  When\nset to ``true`` on a 4-D input the two middle axes of the output are\ntransposed, which avoids a separate Transpose node in common attention-kernel\npatterns.\n\nSupported element types: ``float,`` ``half.``'), 'AddSharedInput': OrtOpDesc(name='AddSharedInput', domain='yaourt.ortops.fused_kernel.cuda', since_version=1, execution_provider='CUDAExecutionProvider', inputs=[OrtOpInput(name='input_0', dtype='T', description=''), OrtOpInput(name='input_1', dtype='T', description=''), OrtOpInput(name='input_2', dtype='T', description='')], outputs=[OrtOpOutput(name='output_0', dtype='T', description=''), OrtOpOutput(name='output_1', dtype='T', description='')], doc='Fused AddSharedInput / MulSharedInput CUDA custom operator.\n\nDeclares the kernel and operator classes for an operation that applies the\nsame first input A to two different second inputs B and C simultaneously,\nproducing two outputs:\n\n- **AddSharedInput** (``addition`` = ``true`` ):\n  <math>\n- **MulSharedInput** (``addition`` = ``false):``\n  <math>\n\nThis fused form avoids reading A twice when computing two independent\noperations that share the same operand.\n\nSupported element types: ``float,`` ``half.``'), 'MaskedScatterNDOfShape': OrtOpDesc(name='MaskedScatterNDOfShape', domain='yaourt.ortops.fused_kernel.cuda', since_version=1, execution_provider='CUDAExecutionProvider', inputs=[OrtOpInput(name='input_0', dtype='T', description=''), OrtOpInput(name='input_1', dtype='T', description=''), OrtOpInput(name='input_2', dtype='T', description='')], outputs=[OrtOpOutput(name='output_0', dtype='T', description='')], doc='MaskedScatterNDOfShape CUDA custom operator.\n\nDeclares the kernel and operator classes for a masked variant of the\nScatterNDOfShape operation.  Each scatter step is skipped when the\ncorresponding index value equals a configurable ``masked_value,`` allowing\npadding indices to be ignored without pre-filtering:\n\noutput = zeros(shape)\nfor each i:\nif indices[i] != masked_value:\noutput[indices[i]] op= updates[i]\n\nInputs:\n\n- 0: ``shape``        1-D ``int64`` tensor defining the output shape (CPU).\n- 1: ``indices``      integer indices tensor (CPU).\n- 2: ``updates``      data tensor of type ``T`` (GPU).\n\nThe ``reduction`` and ``masked_value`` behaviour is configured via kernel\nattributes of the same names.\n\nSupported element types: ``float,`` ``half.``'), 'MulAdd': OrtOpDesc(name='MulAdd', domain='yaourt.ortops.fused_kernel.cuda', since_version=1, execution_provider='CUDAExecutionProvider', inputs=[OrtOpInput(name='input_0', dtype='T', description=''), OrtOpInput(name='input_1', dtype='T', description=''), OrtOpInput(name='input_2', dtype='T', description='')], outputs=[OrtOpOutput(name='output_0', dtype='T', description='')], doc='Fused AddMul / MulAdd CUDA custom operator.\n\nDeclares the kernel and operator classes for two element-wise ternary\noperations on three broadcastable input tensors A, B, C:\n\n- **AddMul** (``addition`` = ``true`` ): <math>\n- **MulAdd** (``addition`` = ``false):`` <math>\n\nBoth variants support an optional kernel attribute ``transposeMiddle.``  When\nset to ``true`` on a 4-D input the two middle axes of the output are\ntransposed, which avoids a separate Transpose node in common attention-kernel\npatterns.\n\nSupported element types: ``float,`` ``half.``'), 'MulMul': OrtOpDesc(name='MulMul', domain='yaourt.ortops.fused_kernel.cuda', since_version=1, execution_provider='CUDAExecutionProvider', inputs=[OrtOpInput(name='input_0', dtype='T', description=''), OrtOpInput(name='input_1', dtype='T', description=''), OrtOpInput(name='input_2', dtype='T', description='')], outputs=[OrtOpOutput(name='output_0', dtype='T', description='')], doc='Fused AddAdd / MulMul CUDA custom operator for three inputs.\n\nDeclares the kernel and operator classes for two element-wise ternary\noperations on three broadcastable input tensors A, B, C:\n\n- **AddAdd** (``addition`` = ``true`` ): <math>\n- **MulMul** (``addition`` = ``false):`` <math>\n\nWhen all inputs share the same shape the kernel uses a no-broadcast path for\nbetter performance.\n\nSupported element types: ``float,`` ``half.``'), 'MulMulMul': OrtOpDesc(name='MulMulMul', domain='yaourt.ortops.fused_kernel.cuda', since_version=1, execution_provider='CUDAExecutionProvider', inputs=[OrtOpInput(name='input_0', dtype='T', description=''), OrtOpInput(name='input_1', dtype='T', description=''), OrtOpInput(name='input_2', dtype='T', description=''), OrtOpInput(name='input_3', dtype='T', description='')], outputs=[OrtOpOutput(name='output_0', dtype='T', description='')], doc='Fused AddAddAdd / MulMulMul CUDA custom operator for four inputs.\n\nDeclares the kernel and operator classes for two element-wise quaternary\noperations on four broadcastable input tensors A, B, C, D:\n\n- **AddAddAdd** (``addition`` = ``true`` ):\n  <math>\n- **MulMulMul** (``addition`` = ``false):``\n  <math>\n\nSupported element types: ``float,`` ``half.``'), 'MulMulSigmoid': OrtOpDesc(name='MulMulSigmoid', domain='yaourt.ortops.fused_kernel.cuda', since_version=1, execution_provider='CUDAExecutionProvider', inputs=[OrtOpInput(name='input_0', dtype='T', description=''), OrtOpInput(name='input_1', dtype='T', description='')], outputs=[OrtOpOutput(name='output_0', dtype='T', description='')], doc='Fused MulMulSigmoid CUDA custom operator.\n\nDeclares the kernel and operator classes for the element-wise binary\noperation applied to two broadcastable input tensors x and y:\n\n<math>\n\nThe sigmoid is applied only to y, making this a gated variant of the\nSiLU / Swish activation commonly found in gated linear units (GLUs) used in\ntransformer feed-forward networks.\n\nSupported element types: ``float,`` ``half.``'), 'MulSharedInput': OrtOpDesc(name='MulSharedInput', domain='yaourt.ortops.fused_kernel.cuda', since_version=1, execution_provider='CUDAExecutionProvider', inputs=[OrtOpInput(name='input_0', dtype='T', description=''), OrtOpInput(name='input_1', dtype='T', description=''), OrtOpInput(name='input_2', dtype='T', description='')], outputs=[OrtOpOutput(name='output_0', dtype='T', description=''), OrtOpOutput(name='output_1', dtype='T', description='')], doc='Fused AddSharedInput / MulSharedInput CUDA custom operator.\n\nDeclares the kernel and operator classes for an operation that applies the\nsame first input A to two different second inputs B and C simultaneously,\nproducing two outputs:\n\n- **AddSharedInput** (``addition`` = ``true`` ):\n  <math>\n- **MulSharedInput** (``addition`` = ``false):``\n  <math>\n\nThis fused form avoids reading A twice when computing two independent\noperations that share the same operand.\n\nSupported element types: ``float,`` ``half.``'), 'MulSigmoid': OrtOpDesc(name='MulSigmoid', domain='yaourt.ortops.fused_kernel.cuda', since_version=1, execution_provider='CUDAExecutionProvider', inputs=[OrtOpInput(name='input_0', dtype='T', description='')], outputs=[OrtOpOutput(name='output_0', dtype='T', description='')], doc='Fused MulSigmoid CUDA custom operator (SiLU / Swish activation).\n\nDeclares the kernel and operator classes for the element-wise unary\noperation:\n\n<math>\n\nThis is equivalent to the SiLU (Sigmoid Linear Unit) / Swish activation\nfunction, which is commonly used in transformer feed-forward blocks.\n\nSupported element types: ``float,`` ``half.``'), 'MulSub': OrtOpDesc(name='MulSub', domain='yaourt.ortops.fused_kernel.cuda', since_version=1, execution_provider='CUDAExecutionProvider', inputs=[OrtOpInput(name='input_0', dtype='T', description=''), OrtOpInput(name='input_1', dtype='T', description=''), OrtOpInput(name='input_2', dtype='T', description='')], outputs=[OrtOpOutput(name='output_0', dtype='T', description='')], doc='Fused SubMul / MulSub CUDA custom operator.\n\nDeclares the kernel and operator classes for element-wise ternary operations\non three broadcastable input tensors A, B, C.  The template parameter\n``addition`` selects whether subtraction precedes or follows multiplication,\nand the optional kernel attribute ``negative`` inverts the sign of the\nsubtraction operand:\n\n- **SubMul** (``addition`` = ``true,`` ``negative`` = ``false):``\n  <math>\n- **SubMul** (``addition`` = ``true,`` ``negative`` = ``true):``\n  <math>\n- **MulSub** (``addition`` = ``false,`` ``negative`` = ``false):``\n  <math>\n- **MulSub** (``addition`` = ``false,`` ``negative`` = ``true):``\n  <math>\n\nSupported element types: ``float,`` ``half.``'), 'NegXplus1': OrtOpDesc(name='NegXplus1', domain='yaourt.ortops.fused_kernel.cuda', since_version=1, execution_provider='CUDAExecutionProvider', inputs=[OrtOpInput(name='input_0', dtype='T', description='')], outputs=[OrtOpOutput(name='output_0', dtype='T', description='')], doc='NegXplus1 CUDA custom operator element-wise complement (1 x).\n\nDeclares the kernel and operator classes for the unary element-wise\noperation:\n\n<math>\n\nThis is the arithmetic complement of the input, useful for computing\nprobability complements (e.g. <math>) without an extra Constant and\nSub node in the graph.\n\nSupported element types: ``float,`` ``half,`` ``int32_t.``'), 'ReplaceZero': OrtOpDesc(name='ReplaceZero', domain='yaourt.ortops.fused_kernel.cuda', since_version=1, execution_provider='CUDAExecutionProvider', inputs=[OrtOpInput(name='input_0', dtype='T', description='')], outputs=[OrtOpOutput(name='output_0', dtype='T', description='')], doc='ReplaceZero CUDA custom operator substitute zero elements.\n\nDeclares the kernel and operator classes for the unary element-wise\noperation:\n\n<math>\n\nThe replacement scalar ``by`` is read from a kernel attribute of the same\nname.  The operator is useful for masking out padding tokens or avoiding\ndivision by zero in subsequent operations.\n\nSupported element types: ``float,`` ``half.``'), 'Rotary': OrtOpDesc(name='Rotary', domain='yaourt.ortops.fused_kernel.cuda', since_version=1, execution_provider='CUDAExecutionProvider', inputs=[OrtOpInput(name='input_0', dtype='T', description=''), OrtOpInput(name='input_1', dtype='T', description='')], outputs=[OrtOpOutput(name='output_0', dtype='T', description='')], doc='Rotary positional embedding CUDA custom operator.\n\nDeclares the kernel and operator classes that apply a rotary transformation\nto the last dimension of an input tensor.  The operation implements rotary\nposition encodings (RoPE) as used in models such as LLaMA and GPT-NeoX.\n\nEach pair of elements <math> in the last\ndimension is rotated depending on ``RotarySide:``\n\n- **LEFT** side (``rotary_side_`` = LEFT):\n  <math>\n- **RIGHT** side (``rotary_side_`` = RIGHT):\n  <math>\n\nThe side is selected via the kernel attribute ``"side"`` (integer, 1 = LEFT,\n2 = RIGHT).  The first input provides the data tensor and is expected in\ndevice memory; the second input (optional shape hint) is read from CPU\nmemory.\n\nSupported element types: ``float,`` ``half.``'), 'ScatterNDOfShape': OrtOpDesc(name='ScatterNDOfShape', domain='yaourt.ortops.fused_kernel.cuda', since_version=1, execution_provider='CUDAExecutionProvider', inputs=[OrtOpInput(name='input_0', dtype='T', description=''), OrtOpInput(name='input_1', dtype='T', description=''), OrtOpInput(name='input_2', dtype='T', description='')], outputs=[OrtOpOutput(name='output_0', dtype='T', description='')], doc='ScatterNDOfShape CUDA custom operator.\n\nDeclares the kernel and operator classes for scattering ``updates`` into a\nzero-initialised output tensor whose shape is defined by the first input.\nThe operation is equivalent to:\n\noutput = zeros(shape)\noutput[indices[i]] op= updates[i]  for each i\n\nwhere ``op`` is determined by the ``reduction`` kernel attribute\n(``Reduction::None`` overwrites, ``Add`` accumulates, etc.).  A second\nstrategy attribute (``Strategy::None`` or ``Optimize)`` lets the kernel choose\na shape-specific fast path at runtime.\n\nInputs:\n\n- 0: ``shape``   1-D ``int64`` tensor defining the output shape (CPU).\n- 1: ``indices`` integer indices tensor (CPU).\n- 2: ``updates`` data tensor of type ``T`` (GPU).\n\nSupported element types: ``float,`` ``half.``'), 'SubMul': OrtOpDesc(name='SubMul', domain='yaourt.ortops.fused_kernel.cuda', since_version=1, execution_provider='CUDAExecutionProvider', inputs=[OrtOpInput(name='input_0', dtype='T', description=''), OrtOpInput(name='input_1', dtype='T', description=''), OrtOpInput(name='input_2', dtype='T', description='')], outputs=[OrtOpOutput(name='output_0', dtype='T', description='')], doc='Fused SubMul / MulSub CUDA custom operator.\n\nDeclares the kernel and operator classes for element-wise ternary operations\non three broadcastable input tensors A, B, C.  The template parameter\n``addition`` selects whether subtraction precedes or follows multiplication,\nand the optional kernel attribute ``negative`` inverts the sign of the\nsubtraction operand:\n\n- **SubMul** (``addition`` = ``true,`` ``negative`` = ``false):``\n  <math>\n- **SubMul** (``addition`` = ``true,`` ``negative`` = ``true):``\n  <math>\n- **MulSub** (``addition`` = ``false,`` ``negative`` = ``false):``\n  <math>\n- **MulSub** (``addition`` = ``false,`` ``negative`` = ``true):``\n  <math>\n\nSupported element types: ``float,`` ``half.``'), 'Transpose2DCastFP16': OrtOpDesc(name='Transpose2DCastFP16', domain='yaourt.ortops.fused_kernel.cuda', since_version=1, execution_provider='CUDAExecutionProvider', inputs=[OrtOpInput(name='input_0', dtype='T', description='')], outputs=[OrtOpOutput(name='output_0', dtype='T', description='')], doc='Transpose2DCast CUDA custom operator 2-D matrix transpose with type\nconversion.\n\nDeclares the kernel and operator classes for transposing a 2-D matrix and\nsimultaneously casting its elements to a different numeric type.  Two\nregistered operator names are available depending on the output type:\n\n- **Transpose2DCastFP16** input ``float`` output ``half.``\n- **Transpose2DCastFP32** input ``half`` output ``float.``\n\nThe operator fuses the Transpose and Cast nodes into a single tiled CUDA\nkernel, avoiding a round-trip through global memory compared to executing\nthem separately.\n\nThe input and output types are chosen at construction time and stored in the\noperator descriptor.'), 'Transpose2DCastFP32': OrtOpDesc(name='Transpose2DCastFP32', domain='yaourt.ortops.fused_kernel.cuda', since_version=1, execution_provider='CUDAExecutionProvider', inputs=[OrtOpInput(name='input_0', dtype='T', description='')], outputs=[OrtOpOutput(name='output_0', dtype='T', description='')], doc='Transpose2DCast CUDA custom operator 2-D matrix transpose with type\nconversion.\n\nDeclares the kernel and operator classes for transposing a 2-D matrix and\nsimultaneously casting its elements to a different numeric type.  Two\nregistered operator names are available depending on the output type:\n\n- **Transpose2DCastFP16** input ``float`` output ``half.``\n- **Transpose2DCastFP32** input ``half`` output ``float.``\n\nThe operator fuses the Transpose and Cast nodes into a single tiled CUDA\nkernel, avoiding a round-trip through global memory compared to executing\nthem separately.\n\nThe input and output types are chosen at construction time and stored in the\noperator descriptor.'), 'TriMatrix': OrtOpDesc(name='TriMatrix', domain='yaourt.ortops.fused_kernel.cuda', since_version=1, execution_provider='CUDAExecutionProvider', inputs=[OrtOpInput(name='input_0', dtype='T', description=''), OrtOpInput(name='input_1', dtype='T', description='')], outputs=[OrtOpOutput(name='output_0', dtype='T', description='')], doc='TriMatrix CUDA custom operator triangular matrix generator.\n\nDeclares the kernel and operator classes that create a 2-D triangular matrix\nwhose lower-triangle, diagonal, and upper-triangle elements are filled with\nuser-supplied scalar values.\n\nFor a matrix of size <math>, element\n<math> is set to:\n\n<math>\n\nInputs:\n\n- 0: ``shape``  1-D ``int64`` tensor <math> (CPU).\n- 1: ``values`` 1-D tensor of type ``T`` with three elements\n  <math> (CPU).\n\nSupported element types: ``float,`` ``half.``')}#

All fused-kernel CUDA custom ops provided by yet-another-onnxruntime-extensions, keyed by op name. Populated at import time by parsing the C++ source files.

yaourt.ortops.doc.FUSED_KERNEL_CPU_OPS: dict[str, OrtOpDesc] = {'MulMul': OrtOpDesc(name='MulMul', domain='yaourt.ortops.fused_kernel.cpu', since_version=1, execution_provider='CPUExecutionProvider', inputs=[OrtOpInput(name='A', dtype='float32', description=''), OrtOpInput(name='B', dtype='float32', description=''), OrtOpInput(name='C', dtype='float32', description='')], outputs=[OrtOpOutput(name='output', dtype='float32', description='')], doc='Fused MulMul CPU custom operator.\n\nDeclares the kernel class for the element-wise ternary multiplication\napplied to three broadcastable input tensors A, B, C:\n\n<math>\n\nThe implementation uses AVX2 SIMD instructions for vectorised throughput\nand std::thread for multi-threaded parallelism.\n\nSupported element type: ``float.``')}#

All fused-kernel CPU custom ops provided by yet-another-onnxruntime-extensions, keyed by op name. Populated at import time by parsing the C++ source files.

class yaourt.ortops.doc.OrtOpDesc(name: str, domain: str, since_version: int, execution_provider: str, inputs: List[OrtOpInput] = <factory>, outputs: List[OrtOpOutput] = <factory>, doc: str = '')#

Describes a single custom ORT op.

Parameters:
  • name – op name as registered with OrtRuntime

  • domain – ONNX domain the op belongs to

  • since_version – opset version in which the op was introduced

  • execution_provider – execution provider (e.g. "CPUExecutionProvider")

  • inputs – ordered list of input descriptors

  • outputs – ordered list of output descriptors

  • doc – longer plain-text description of the op’s semantics

__eq__(other)#

Return self==value.

__hash__ = None#
__init__(name: str, domain: str, since_version: int, execution_provider: str, inputs: List[OrtOpInput] = <factory>, outputs: List[OrtOpOutput] = <factory>, doc: str = '') None#
__repr__()#

Return repr(self).

__weakref__#

list of weak references to the object

class yaourt.ortops.doc.OrtOpInput(name: str, dtype: str, description: str)#

Describes one input of a custom ORT op.

Parameters:
  • name – argument name used in the op signature

  • dtype – ONNX element type (e.g. "float32")

  • description – human-readable description of what the input represents

__eq__(other)#

Return self==value.

__hash__ = None#
__init__(name: str, dtype: str, description: str) None#
__repr__()#

Return repr(self).

__weakref__#

list of weak references to the object

class yaourt.ortops.doc.OrtOpOutput(name: str, dtype: str, description: str)#

Describes one output of a custom ORT op.

Parameters:
  • name – argument name used in the op signature

  • dtype – ONNX element type (e.g. "float32")

  • description – human-readable description of what the output represents

__eq__(other)#

Return self==value.

__hash__ = None#
__init__(name: str, dtype: str, description: str) None#
__repr__()#

Return repr(self).

__weakref__#

list of weak references to the object

yaourt.ortops.doc.print_cpu_ops() None#

Prints the CPU custom-op catalogue to stdout.

Renders CPU_OPS as plain text suitable for a .. runpython:: block in the Sphinx documentation, ensuring the rendered output is always derived from the C++ source files.

<<<

from yaourt.ortops.doc import print_cpu_ops

print_cpu_ops()

>>>

    DenseToSparse
      domain   : yaourt.ortops.sparse.cpu
      provider : CPUExecutionProvider
      version  : 1
      Converts a 2-D dense float32 tensor into a compact flat sparse encoding.
      Only non-zero elements are stored. The 1-D output tensor encodes the
      original shape, the number of non-zero elements, their flat indices
      (stored as uint32), and their values (float32). The encoding is suitable
      as input to SparseToDense for a lossless round-trip.
      
      Constraints: input must be exactly 2-D; only float32 is supported.
      inputs:
        X (float32) — 2-D dense float32 input tensor of shape [n_rows, n_cols]. Zero elements are not stored in the sparse encoding.
      outputs:
        Y (float32) — 1-D float32 tensor containing the sparse encoding of X. Layout: header | flat indices (uint32) | non-zero values (float32).
    
    SparseToDense
      domain   : yaourt.ortops.sparse.cpu
      provider : CPUExecutionProvider
      version  : 1
      Converts the compact sparse encoding produced by DenseToSparse back into
      a 2-D dense float32 tensor. Positions that were zero in the original
      tensor are filled with 0.0. The output shape is recovered from the
      sparse header embedded in the input.
      
      Constraints: input must be 1-D with a valid sparse header; the encoded
      shape must be 2-D; only float32 is supported.
      inputs:
        X (float32) — 1-D float32 sparse encoding produced by DenseToSparse.
      outputs:
        Y (float32) — Reconstructed 2-D dense float32 tensor. The shape is recovered from the sparse header embedded in X.
yaourt.ortops.doc.print_cpu_ops_rst() None#

Renders the CPU custom-op catalogue as RST and writes it to stdout.

Renders CPU_OPS as valid reStructuredText suitable for a .. runpython:: :rst: block in the Sphinx documentation. Each op is rendered as a sub-section with a list-table for its metadata, and bulleted lists for its inputs and outputs, ensuring the rendered page is always derived from the C++ source files without manual maintenance.

<<<

from yaourt.ortops.doc import print_cpu_ops_rst

print_cpu_ops_rst()

>>>

DenseToSparse#

Domain

yaourt.ortops.sparse.cpu

Execution provider

CPUExecutionProvider

Since version

1

Converts a 2-D dense float32 tensor into a compact flat sparse encoding. Only non-zero elements are stored. The 1-D output tensor encodes the original shape, the number of non-zero elements, their flat indices (stored as uint32), and their values (float32). The encoding is suitable as input to SparseToDense for a lossless round-trip.

Constraints: input must be exactly 2-D; only float32 is supported.

Inputs

  • X (float32) — 2-D dense float32 input tensor of shape [n_rows, n_cols]. Zero elements are not stored in the sparse encoding.

Outputs

  • Y (float32) — 1-D float32 tensor containing the sparse encoding of X. Layout: header | flat indices (uint32) | non-zero values (float32).

SparseToDense#

Domain

yaourt.ortops.sparse.cpu

Execution provider

CPUExecutionProvider

Since version

1

Converts the compact sparse encoding produced by DenseToSparse back into a 2-D dense float32 tensor. Positions that were zero in the original tensor are filled with 0.0. The output shape is recovered from the sparse header embedded in the input.

Constraints: input must be 1-D with a valid sparse header; the encoded shape must be 2-D; only float32 is supported.

Inputs

  • X (float32) — 1-D float32 sparse encoding produced by DenseToSparse.

Outputs

  • Y (float32) — Reconstructed 2-D dense float32 tensor. The shape is recovered from the sparse header embedded in X.

yaourt.ortops.doc.print_cuda_ops() None#

Prints the fused-kernel CUDA custom-op catalogue to stdout.

Renders CUDA_OPS as plain text suitable for a .. runpython:: block in the Sphinx documentation, ensuring the rendered output is always derived from the C++ source files.

<<<

from yaourt.ortops.doc import print_cuda_ops

print_cuda_ops()

>>>

    AddAdd
      domain   : yaourt.ortops.fused_kernel.cuda
      provider : CUDAExecutionProvider
      version  : 1
      Fused AddAdd / MulMul CUDA custom operator for three inputs.
      
      Declares the kernel and operator classes for two element-wise ternary
      operations on three broadcastable input tensors A, B, C:
      
      - **AddAdd** (``addition`` = ``true`` ): <math>
      - **MulMul** (``addition`` = ``false):`` <math>
      
      When all inputs share the same shape the kernel uses a no-broadcast path for
      better performance.
      
      Supported element types: ``float,`` ``half.``
      inputs   : T x3
      outputs  : T x1
    
    AddAddAdd
      domain   : yaourt.ortops.fused_kernel.cuda
      provider : CUDAExecutionProvider
      version  : 1
      Fused AddAddAdd / MulMulMul CUDA custom operator for four inputs.
      
      Declares the kernel and operator classes for two element-wise quaternary
      operations on four broadcastable input tensors A, B, C, D:
      
      - **AddAddAdd** (``addition`` = ``true`` ):
        <math>
      - **MulMulMul** (``addition`` = ``false):``
        <math>
      
      Supported element types: ``float,`` ``half.``
      inputs   : T x4
      outputs  : T x1
    
    AddMul
      domain   : yaourt.ortops.fused_kernel.cuda
      provider : CUDAExecutionProvider
      version  : 1
      Fused AddMul / MulAdd CUDA custom operator.
      
      Declares the kernel and operator classes for two element-wise ternary
      operations on three broadcastable input tensors A, B, C:
      
      - **AddMul** (``addition`` = ``true`` ): <math>
      - **MulAdd** (``addition`` = ``false):`` <math>
      
      Both variants support an optional kernel attribute ``transposeMiddle.``  When
      set to ``true`` on a 4-D input the two middle axes of the output are
      transposed, which avoids a separate Transpose node in common attention-kernel
      patterns.
      
      Supported element types: ``float,`` ``half.``
      inputs   : T x3
      outputs  : T x1
    
    AddSharedInput
      domain   : yaourt.ortops.fused_kernel.cuda
      provider : CUDAExecutionProvider
      version  : 1
      Fused AddSharedInput / MulSharedInput CUDA custom operator.
      
      Declares the kernel and operator classes for an operation that applies the
      same first input A to two different second inputs B and C simultaneously,
      producing two outputs:
      
      - **AddSharedInput** (``addition`` = ``true`` ):
        <math>
      - **MulSharedInput** (``addition`` = ``false):``
        <math>
      
      This fused form avoids reading A twice when computing two independent
      operations that share the same operand.
      
      Supported element types: ``float,`` ``half.``
      inputs   : T x3
      outputs  : T x2
    
    MaskedScatterNDOfShape
      domain   : yaourt.ortops.fused_kernel.cuda
      provider : CUDAExecutionProvider
      version  : 1
      MaskedScatterNDOfShape CUDA custom operator.
      
      Declares the kernel and operator classes for a masked variant of the
      ScatterNDOfShape operation.  Each scatter step is skipped when the
      corresponding index value equals a configurable ``masked_value,`` allowing
      padding indices to be ignored without pre-filtering:
      
      output = zeros(shape)
      for each i:
      if indices[i] != masked_value:
      output[indices[i]] op= updates[i]
      
      Inputs:
      
      - 0: ``shape``        — 1-D ``int64`` tensor defining the output shape (CPU).
      - 1: ``indices``      — integer indices tensor (CPU).
      - 2: ``updates``      — data tensor of type ``T`` (GPU).
      
      The ``reduction`` and ``masked_value`` behaviour is configured via kernel
      attributes of the same names.
      
      Supported element types: ``float,`` ``half.``
      inputs   : T x3
      outputs  : T x1
    
    MulAdd
      domain   : yaourt.ortops.fused_kernel.cuda
      provider : CUDAExecutionProvider
      version  : 1
      Fused AddMul / MulAdd CUDA custom operator.
      
      Declares the kernel and operator classes for two element-wise ternary
      operations on three broadcastable input tensors A, B, C:
      
      - **AddMul** (``addition`` = ``true`` ): <math>
      - **MulAdd** (``addition`` = ``false):`` <math>
      
      Both variants support an optional kernel attribute ``transposeMiddle.``  When
      set to ``true`` on a 4-D input the two middle axes of the output are
      transposed, which avoids a separate Transpose node in common attention-kernel
      patterns.
      
      Supported element types: ``float,`` ``half.``
      inputs   : T x3
      outputs  : T x1
    
    MulMul
      domain   : yaourt.ortops.fused_kernel.cuda
      provider : CUDAExecutionProvider
      version  : 1
      Fused AddAdd / MulMul CUDA custom operator for three inputs.
      
      Declares the kernel and operator classes for two element-wise ternary
      operations on three broadcastable input tensors A, B, C:
      
      - **AddAdd** (``addition`` = ``true`` ): <math>
      - **MulMul** (``addition`` = ``false):`` <math>
      
      When all inputs share the same shape the kernel uses a no-broadcast path for
      better performance.
      
      Supported element types: ``float,`` ``half.``
      inputs   : T x3
      outputs  : T x1
    
    MulMulMul
      domain   : yaourt.ortops.fused_kernel.cuda
      provider : CUDAExecutionProvider
      version  : 1
      Fused AddAddAdd / MulMulMul CUDA custom operator for four inputs.
      
      Declares the kernel and operator classes for two element-wise quaternary
      operations on four broadcastable input tensors A, B, C, D:
      
      - **AddAddAdd** (``addition`` = ``true`` ):
        <math>
      - **MulMulMul** (``addition`` = ``false):``
        <math>
      
      Supported element types: ``float,`` ``half.``
      inputs   : T x4
      outputs  : T x1
    
    MulMulSigmoid
      domain   : yaourt.ortops.fused_kernel.cuda
      provider : CUDAExecutionProvider
      version  : 1
      Fused MulMulSigmoid CUDA custom operator.
      
      Declares the kernel and operator classes for the element-wise binary
      operation applied to two broadcastable input tensors x and y:
      
      <math>
      
      The sigmoid is applied only to y, making this a gated variant of the
      SiLU / Swish activation commonly found in gated linear units (GLUs) used in
      transformer feed-forward networks.
      
      Supported element types: ``float,`` ``half.``
      inputs   : T x2
      outputs  : T x1
    
    MulSharedInput
      domain   : yaourt.ortops.fused_kernel.cuda
      provider : CUDAExecutionProvider
      version  : 1
      Fused AddSharedInput / MulSharedInput CUDA custom operator.
      
      Declares the kernel and operator classes for an operation that applies the
      same first input A to two different second inputs B and C simultaneously,
      producing two outputs:
      
      - **AddSharedInput** (``addition`` = ``true`` ):
        <math>
      - **MulSharedInput** (``addition`` = ``false):``
        <math>
      
      This fused form avoids reading A twice when computing two independent
      operations that share the same operand.
      
      Supported element types: ``float,`` ``half.``
      inputs   : T x3
      outputs  : T x2
    
    MulSigmoid
      domain   : yaourt.ortops.fused_kernel.cuda
      provider : CUDAExecutionProvider
      version  : 1
      Fused MulSigmoid CUDA custom operator (SiLU / Swish activation).
      
      Declares the kernel and operator classes for the element-wise unary
      operation:
      
      <math>
      
      This is equivalent to the SiLU (Sigmoid Linear Unit) / Swish activation
      function, which is commonly used in transformer feed-forward blocks.
      
      Supported element types: ``float,`` ``half.``
      inputs   : T x1
      outputs  : T x1
    
    MulSub
      domain   : yaourt.ortops.fused_kernel.cuda
      provider : CUDAExecutionProvider
      version  : 1
      Fused SubMul / MulSub CUDA custom operator.
      
      Declares the kernel and operator classes for element-wise ternary operations
      on three broadcastable input tensors A, B, C.  The template parameter
      ``addition`` selects whether subtraction precedes or follows multiplication,
      and the optional kernel attribute ``negative`` inverts the sign of the
      subtraction operand:
      
      - **SubMul** (``addition`` = ``true,`` ``negative`` = ``false):``
        <math>
      - **SubMul** (``addition`` = ``true,`` ``negative`` = ``true):``
        <math>
      - **MulSub** (``addition`` = ``false,`` ``negative`` = ``false):``
        <math>
      - **MulSub** (``addition`` = ``false,`` ``negative`` = ``true):``
        <math>
      
      Supported element types: ``float,`` ``half.``
      inputs   : T x3
      outputs  : T x1
    
    NegXplus1
      domain   : yaourt.ortops.fused_kernel.cuda
      provider : CUDAExecutionProvider
      version  : 1
      NegXplus1 CUDA custom operator — element-wise complement (1 − x).
      
      Declares the kernel and operator classes for the unary element-wise
      operation:
      
      <math>
      
      This is the arithmetic complement of the input, useful for computing
      probability complements (e.g. <math>) without an extra Constant and
      Sub node in the graph.
      
      Supported element types: ``float,`` ``half,`` ``int32_t.``
      inputs   : T x1
      outputs  : T x1
    
    ReplaceZero
      domain   : yaourt.ortops.fused_kernel.cuda
      provider : CUDAExecutionProvider
      version  : 1
      ReplaceZero CUDA custom operator — substitute zero elements.
      
      Declares the kernel and operator classes for the unary element-wise
      operation:
      
      <math>
      
      The replacement scalar ``by`` is read from a kernel attribute of the same
      name.  The operator is useful for masking out padding tokens or avoiding
      division by zero in subsequent operations.
      
      Supported element types: ``float,`` ``half.``
      inputs   : T x1
      outputs  : T x1
    
    Rotary
      domain   : yaourt.ortops.fused_kernel.cuda
      provider : CUDAExecutionProvider
      version  : 1
      Rotary positional embedding CUDA custom operator.
      
      Declares the kernel and operator classes that apply a rotary transformation
      to the last dimension of an input tensor.  The operation implements rotary
      position encodings (RoPE) as used in models such as LLaMA and GPT-NeoX.
      
      Each pair of elements <math> in the last
      dimension is rotated depending on ``RotarySide:``
      
      - **LEFT** side (``rotary_side_`` = LEFT):
        <math>
      - **RIGHT** side (``rotary_side_`` = RIGHT):
        <math>
      
      The side is selected via the kernel attribute ``"side"`` (integer, 1 = LEFT,
      2 = RIGHT).  The first input provides the data tensor and is expected in
      device memory; the second input (optional shape hint) is read from CPU
      memory.
      
      Supported element types: ``float,`` ``half.``
      inputs   : T x2
      outputs  : T x1
    
    ScatterNDOfShape
      domain   : yaourt.ortops.fused_kernel.cuda
      provider : CUDAExecutionProvider
      version  : 1
      ScatterNDOfShape CUDA custom operator.
      
      Declares the kernel and operator classes for scattering ``updates`` into a
      zero-initialised output tensor whose shape is defined by the first input.
      The operation is equivalent to:
      
      output = zeros(shape)
      output[indices[i]] op= updates[i]  for each i
      
      where ``op`` is determined by the ``reduction`` kernel attribute
      (``Reduction::None`` overwrites, ``Add`` accumulates, etc.).  A second
      strategy attribute (``Strategy::None`` or ``Optimize)`` lets the kernel choose
      a shape-specific fast path at runtime.
      
      Inputs:
      
      - 0: ``shape``   — 1-D ``int64`` tensor defining the output shape (CPU).
      - 1: ``indices`` — integer indices tensor (CPU).
      - 2: ``updates`` — data tensor of type ``T`` (GPU).
      
      Supported element types: ``float,`` ``half.``
      inputs   : T x3
      outputs  : T x1
    
    SubMul
      domain   : yaourt.ortops.fused_kernel.cuda
      provider : CUDAExecutionProvider
      version  : 1
      Fused SubMul / MulSub CUDA custom operator.
      
      Declares the kernel and operator classes for element-wise ternary operations
      on three broadcastable input tensors A, B, C.  The template parameter
      ``addition`` selects whether subtraction precedes or follows multiplication,
      and the optional kernel attribute ``negative`` inverts the sign of the
      subtraction operand:
      
      - **SubMul** (``addition`` = ``true,`` ``negative`` = ``false):``
        <math>
      - **SubMul** (``addition`` = ``true,`` ``negative`` = ``true):``
        <math>
      - **MulSub** (``addition`` = ``false,`` ``negative`` = ``false):``
        <math>
      - **MulSub** (``addition`` = ``false,`` ``negative`` = ``true):``
        <math>
      
      Supported element types: ``float,`` ``half.``
      inputs   : T x3
      outputs  : T x1
    
    Transpose2DCastFP16
      domain   : yaourt.ortops.fused_kernel.cuda
      provider : CUDAExecutionProvider
      version  : 1
      Transpose2DCast CUDA custom operator — 2-D matrix transpose with type
      conversion.
      
      Declares the kernel and operator classes for transposing a 2-D matrix and
      simultaneously casting its elements to a different numeric type.  Two
      registered operator names are available depending on the output type:
      
      - **Transpose2DCastFP16** — input ``float`` → output ``half.``
      - **Transpose2DCastFP32** — input ``half`` → output ``float.``
      
      The operator fuses the Transpose and Cast nodes into a single tiled CUDA
      kernel, avoiding a round-trip through global memory compared to executing
      them separately.
      
      The input and output types are chosen at construction time and stored in the
      operator descriptor.
      inputs   : T x1
      outputs  : T x1
    
    Transpose2DCastFP32
      domain   : yaourt.ortops.fused_kernel.cuda
      provider : CUDAExecutionProvider
      version  : 1
      Transpose2DCast CUDA custom operator — 2-D matrix transpose with type
      conversion.
      
      Declares the kernel and operator classes for transposing a 2-D matrix and
      simultaneously casting its elements to a different numeric type.  Two
      registered operator names are available depending on the output type:
      
      - **Transpose2DCastFP16** — input ``float`` → output ``half.``
      - **Transpose2DCastFP32** — input ``half`` → output ``float.``
      
      The operator fuses the Transpose and Cast nodes into a single tiled CUDA
      kernel, avoiding a round-trip through global memory compared to executing
      them separately.
      
      The input and output types are chosen at construction time and stored in the
      operator descriptor.
      inputs   : T x1
      outputs  : T x1
    
    TriMatrix
      domain   : yaourt.ortops.fused_kernel.cuda
      provider : CUDAExecutionProvider
      version  : 1
      TriMatrix CUDA custom operator — triangular matrix generator.
      
      Declares the kernel and operator classes that create a 2-D triangular matrix
      whose lower-triangle, diagonal, and upper-triangle elements are filled with
      user-supplied scalar values.
      
      For a matrix of size <math>, element
      <math> is set to:
      
      <math>
      
      Inputs:
      
      - 0: ``shape``  — 1-D ``int64`` tensor <math> (CPU).
      - 1: ``values`` — 1-D tensor of type ``T`` with three elements
        <math> (CPU).
      
      Supported element types: ``float,`` ``half.``
      inputs   : T x2
      outputs  : T x1
yaourt.ortops.doc.print_cuda_ops_rst() None#

Renders the fused-kernel CUDA custom-op catalogue as RST and writes it to stdout.

Renders CUDA_OPS as valid reStructuredText suitable for a .. runpython:: :rst: block in the Sphinx documentation. Each op is rendered as a sub-section with a list-table for its metadata followed by its description, ensuring the rendered page is always derived from the C++ source files without manual maintenance.

<<<

from yaourt.ortops.doc import print_cuda_ops_rst

print_cuda_ops_rst()

>>>

AddAdd#

Domain

yaourt.ortops.fused_kernel.cuda

Execution provider

CUDAExecutionProvider

Inputs

3

Outputs

1

Fused AddAdd / MulMul CUDA custom operator for three inputs.

Declares the kernel and operator classes for two element-wise ternary operations on three broadcastable input tensors A, B, C:

  • AddAdd (addition = true ): <math>

  • MulMul (addition = false): <math>

When all inputs share the same shape the kernel uses a no-broadcast path for better performance.

Supported element types: float, half.

AddAddAdd#

Domain

yaourt.ortops.fused_kernel.cuda

Execution provider

CUDAExecutionProvider

Inputs

4

Outputs

1

Fused AddAddAdd / MulMulMul CUDA custom operator for four inputs.

Declares the kernel and operator classes for two element-wise quaternary operations on four broadcastable input tensors A, B, C, D:

  • AddAddAdd (addition = true ): <math>

  • MulMulMul (addition = false): <math>

Supported element types: float, half.

AddMul#

Domain

yaourt.ortops.fused_kernel.cuda

Execution provider

CUDAExecutionProvider

Inputs

3

Outputs

1

Fused AddMul / MulAdd CUDA custom operator.

Declares the kernel and operator classes for two element-wise ternary operations on three broadcastable input tensors A, B, C:

  • AddMul (addition = true ): <math>

  • MulAdd (addition = false): <math>

Both variants support an optional kernel attribute transposeMiddle. When set to true on a 4-D input the two middle axes of the output are transposed, which avoids a separate Transpose node in common attention-kernel patterns.

Supported element types: float, half.

AddSharedInput#

Domain

yaourt.ortops.fused_kernel.cuda

Execution provider

CUDAExecutionProvider

Inputs

3

Outputs

2

Fused AddSharedInput / MulSharedInput CUDA custom operator.

Declares the kernel and operator classes for an operation that applies the same first input A to two different second inputs B and C simultaneously, producing two outputs:

  • AddSharedInput (addition = true ): <math>

  • MulSharedInput (addition = false): <math>

This fused form avoids reading A twice when computing two independent operations that share the same operand.

Supported element types: float, half.

MaskedScatterNDOfShape#

Domain

yaourt.ortops.fused_kernel.cuda

Execution provider

CUDAExecutionProvider

Inputs

3

Outputs

1

MaskedScatterNDOfShape CUDA custom operator.

Declares the kernel and operator classes for a masked variant of the ScatterNDOfShape operation. Each scatter step is skipped when the corresponding index value equals a configurable masked_value, allowing padding indices to be ignored without pre-filtering:

output = zeros(shape) for each i: if indices[i] != masked_value: output[indices[i]] op= updates[i]

Inputs:

  • 0: shape — 1-D int64 tensor defining the output shape (CPU).

  • 1: indices — integer indices tensor (CPU).

  • 2: updates — data tensor of type T (GPU).

The reduction and masked_value behaviour is configured via kernel attributes of the same names.

Supported element types: float, half.

MulAdd#

Domain

yaourt.ortops.fused_kernel.cuda

Execution provider

CUDAExecutionProvider

Inputs

3

Outputs

1

Fused AddMul / MulAdd CUDA custom operator.

Declares the kernel and operator classes for two element-wise ternary operations on three broadcastable input tensors A, B, C:

  • AddMul (addition = true ): <math>

  • MulAdd (addition = false): <math>

Both variants support an optional kernel attribute transposeMiddle. When set to true on a 4-D input the two middle axes of the output are transposed, which avoids a separate Transpose node in common attention-kernel patterns.

Supported element types: float, half.

MulMul#

Domain

yaourt.ortops.fused_kernel.cuda

Execution provider

CUDAExecutionProvider

Inputs

3

Outputs

1

Fused AddAdd / MulMul CUDA custom operator for three inputs.

Declares the kernel and operator classes for two element-wise ternary operations on three broadcastable input tensors A, B, C:

  • AddAdd (addition = true ): <math>

  • MulMul (addition = false): <math>

When all inputs share the same shape the kernel uses a no-broadcast path for better performance.

Supported element types: float, half.

MulMulMul#

Domain

yaourt.ortops.fused_kernel.cuda

Execution provider

CUDAExecutionProvider

Inputs

4

Outputs

1

Fused AddAddAdd / MulMulMul CUDA custom operator for four inputs.

Declares the kernel and operator classes for two element-wise quaternary operations on four broadcastable input tensors A, B, C, D:

  • AddAddAdd (addition = true ): <math>

  • MulMulMul (addition = false): <math>

Supported element types: float, half.

MulMulSigmoid#

Domain

yaourt.ortops.fused_kernel.cuda

Execution provider

CUDAExecutionProvider

Inputs

2

Outputs

1

Fused MulMulSigmoid CUDA custom operator.

Declares the kernel and operator classes for the element-wise binary operation applied to two broadcastable input tensors x and y:

<math>

The sigmoid is applied only to y, making this a gated variant of the SiLU / Swish activation commonly found in gated linear units (GLUs) used in transformer feed-forward networks.

Supported element types: float, half.

MulSharedInput#

Domain

yaourt.ortops.fused_kernel.cuda

Execution provider

CUDAExecutionProvider

Inputs

3

Outputs

2

Fused AddSharedInput / MulSharedInput CUDA custom operator.

Declares the kernel and operator classes for an operation that applies the same first input A to two different second inputs B and C simultaneously, producing two outputs:

  • AddSharedInput (addition = true ): <math>

  • MulSharedInput (addition = false): <math>

This fused form avoids reading A twice when computing two independent operations that share the same operand.

Supported element types: float, half.

MulSigmoid#

Domain

yaourt.ortops.fused_kernel.cuda

Execution provider

CUDAExecutionProvider

Inputs

1

Outputs

1

Fused MulSigmoid CUDA custom operator (SiLU / Swish activation).

Declares the kernel and operator classes for the element-wise unary operation:

<math>

This is equivalent to the SiLU (Sigmoid Linear Unit) / Swish activation function, which is commonly used in transformer feed-forward blocks.

Supported element types: float, half.

MulSub#

Domain

yaourt.ortops.fused_kernel.cuda

Execution provider

CUDAExecutionProvider

Inputs

3

Outputs

1

Fused SubMul / MulSub CUDA custom operator.

Declares the kernel and operator classes for element-wise ternary operations on three broadcastable input tensors A, B, C. The template parameter addition selects whether subtraction precedes or follows multiplication, and the optional kernel attribute negative inverts the sign of the subtraction operand:

  • SubMul (addition = true, negative = false): <math>

  • SubMul (addition = true, negative = true): <math>

  • MulSub (addition = false, negative = false): <math>

  • MulSub (addition = false, negative = true): <math>

Supported element types: float, half.

NegXplus1#

Domain

yaourt.ortops.fused_kernel.cuda

Execution provider

CUDAExecutionProvider

Inputs

1

Outputs

1

NegXplus1 CUDA custom operator — element-wise complement (1 − x).

Declares the kernel and operator classes for the unary element-wise operation:

<math>

This is the arithmetic complement of the input, useful for computing probability complements (e.g. <math>) without an extra Constant and Sub node in the graph.

Supported element types: float, half, int32_t.

ReplaceZero#

Domain

yaourt.ortops.fused_kernel.cuda

Execution provider

CUDAExecutionProvider

Inputs

1

Outputs

1

ReplaceZero CUDA custom operator — substitute zero elements.

Declares the kernel and operator classes for the unary element-wise operation:

<math>

The replacement scalar by is read from a kernel attribute of the same name. The operator is useful for masking out padding tokens or avoiding division by zero in subsequent operations.

Supported element types: float, half.

Rotary#

Domain

yaourt.ortops.fused_kernel.cuda

Execution provider

CUDAExecutionProvider

Inputs

2

Outputs

1

Rotary positional embedding CUDA custom operator.

Declares the kernel and operator classes that apply a rotary transformation to the last dimension of an input tensor. The operation implements rotary position encodings (RoPE) as used in models such as LLaMA and GPT-NeoX.

Each pair of elements <math> in the last dimension is rotated depending on RotarySide:

  • LEFT side (rotary_side_ = LEFT): <math>

  • RIGHT side (rotary_side_ = RIGHT): <math>

The side is selected via the kernel attribute "side" (integer, 1 = LEFT, 2 = RIGHT). The first input provides the data tensor and is expected in device memory; the second input (optional shape hint) is read from CPU memory.

Supported element types: float, half.

ScatterNDOfShape#

Domain

yaourt.ortops.fused_kernel.cuda

Execution provider

CUDAExecutionProvider

Inputs

3

Outputs

1

ScatterNDOfShape CUDA custom operator.

Declares the kernel and operator classes for scattering updates into a zero-initialised output tensor whose shape is defined by the first input. The operation is equivalent to:

output = zeros(shape) output[indices[i]] op= updates[i] for each i

where op is determined by the reduction kernel attribute (Reduction::None overwrites, Add accumulates, etc.). A second strategy attribute (Strategy::None or Optimize) lets the kernel choose a shape-specific fast path at runtime.

Inputs:

  • 0: shape — 1-D int64 tensor defining the output shape (CPU).

  • 1: indices — integer indices tensor (CPU).

  • 2: updates — data tensor of type T (GPU).

Supported element types: float, half.

SubMul#

Domain

yaourt.ortops.fused_kernel.cuda

Execution provider

CUDAExecutionProvider

Inputs

3

Outputs

1

Fused SubMul / MulSub CUDA custom operator.

Declares the kernel and operator classes for element-wise ternary operations on three broadcastable input tensors A, B, C. The template parameter addition selects whether subtraction precedes or follows multiplication, and the optional kernel attribute negative inverts the sign of the subtraction operand:

  • SubMul (addition = true, negative = false): <math>

  • SubMul (addition = true, negative = true): <math>

  • MulSub (addition = false, negative = false): <math>

  • MulSub (addition = false, negative = true): <math>

Supported element types: float, half.

Transpose2DCastFP16#

Domain

yaourt.ortops.fused_kernel.cuda

Execution provider

CUDAExecutionProvider

Inputs

1

Outputs

1

Transpose2DCast CUDA custom operator — 2-D matrix transpose with type conversion.

Declares the kernel and operator classes for transposing a 2-D matrix and simultaneously casting its elements to a different numeric type. Two registered operator names are available depending on the output type:

  • Transpose2DCastFP16 — input float → output half.

  • Transpose2DCastFP32 — input half → output float.

The operator fuses the Transpose and Cast nodes into a single tiled CUDA kernel, avoiding a round-trip through global memory compared to executing them separately.

The input and output types are chosen at construction time and stored in the operator descriptor.

Transpose2DCastFP32#

Domain

yaourt.ortops.fused_kernel.cuda

Execution provider

CUDAExecutionProvider

Inputs

1

Outputs

1

Transpose2DCast CUDA custom operator — 2-D matrix transpose with type conversion.

Declares the kernel and operator classes for transposing a 2-D matrix and simultaneously casting its elements to a different numeric type. Two registered operator names are available depending on the output type:

  • Transpose2DCastFP16 — input float → output half.

  • Transpose2DCastFP32 — input half → output float.

The operator fuses the Transpose and Cast nodes into a single tiled CUDA kernel, avoiding a round-trip through global memory compared to executing them separately.

The input and output types are chosen at construction time and stored in the operator descriptor.

TriMatrix#

Domain

yaourt.ortops.fused_kernel.cuda

Execution provider

CUDAExecutionProvider

Inputs

2

Outputs

1

TriMatrix CUDA custom operator — triangular matrix generator.

Declares the kernel and operator classes that create a 2-D triangular matrix whose lower-triangle, diagonal, and upper-triangle elements are filled with user-supplied scalar values.

For a matrix of size <math>, element <math> is set to:

<math>

Inputs:

  • 0: shape — 1-D int64 tensor <math> (CPU).

  • 1: values — 1-D tensor of type T with three elements <math> (CPU).

Supported element types: float, half.

yaourt.ortops.doc.print_fused_kernel_cpu_ops() None#

Prints the fused-kernel CPU custom-op catalogue to stdout.

Renders FUSED_KERNEL_CPU_OPS as plain text suitable for a .. runpython:: block in the Sphinx documentation, ensuring the rendered output is always derived from the C++ source files.

<<<

from yaourt.ortops.doc import print_fused_kernel_cpu_ops

print_fused_kernel_cpu_ops()

>>>

    MulMul
      domain   : yaourt.ortops.fused_kernel.cpu
      provider : CPUExecutionProvider
      version  : 1
      Fused MulMul CPU custom operator.
      
      Declares the kernel class for the element-wise ternary multiplication
      applied to three broadcastable input tensors A, B, C:
      
      <math>
      
      The implementation uses AVX2 SIMD instructions for vectorised throughput
      and std::thread for multi-threaded parallelism.
      
      Supported element type: ``float.``
      inputs   : float32 x3
      outputs  : float32 x1
yaourt.ortops.doc.print_fused_kernel_cpu_ops_rst() None#

Outputs the fused-kernel CPU custom-op catalogue as RST to stdout.

Renders FUSED_KERNEL_CPU_OPS as valid reStructuredText suitable for a .. runpython:: :rst: block in the Sphinx documentation. Each op is rendered as a sub-section with a list-table for its metadata followed by its description, ensuring the rendered page is always derived from the C++ source files without manual maintenance.

<<<

from yaourt.ortops.doc import print_fused_kernel_cpu_ops_rst

print_fused_kernel_cpu_ops_rst()

>>>

MulMul#

Domain

yaourt.ortops.fused_kernel.cpu

Execution provider

CPUExecutionProvider

Inputs

3

Outputs

1

Fused MulMul CPU custom operator.

Declares the kernel class for the element-wise ternary multiplication applied to three broadcastable input tensors A, B, C:

<math>

The implementation uses AVX2 SIMD instructions for vectorised throughput and std::thread for multi-threaded parallelism.

Supported element type: float.

CPU ops catalogue#

<<<

from yaourt.ortops.doc import print_cpu_ops

print_cpu_ops()

>>>

    DenseToSparse
      domain   : yaourt.ortops.sparse.cpu
      provider : CPUExecutionProvider
      version  : 1
      Converts a 2-D dense float32 tensor into a compact flat sparse encoding.
      Only non-zero elements are stored. The 1-D output tensor encodes the
      original shape, the number of non-zero elements, their flat indices
      (stored as uint32), and their values (float32). The encoding is suitable
      as input to SparseToDense for a lossless round-trip.
      
      Constraints: input must be exactly 2-D; only float32 is supported.
      inputs:
        X (float32) — 2-D dense float32 input tensor of shape [n_rows, n_cols]. Zero elements are not stored in the sparse encoding.
      outputs:
        Y (float32) — 1-D float32 tensor containing the sparse encoding of X. Layout: header | flat indices (uint32) | non-zero values (float32).
    
    SparseToDense
      domain   : yaourt.ortops.sparse.cpu
      provider : CPUExecutionProvider
      version  : 1
      Converts the compact sparse encoding produced by DenseToSparse back into
      a 2-D dense float32 tensor. Positions that were zero in the original
      tensor are filled with 0.0. The output shape is recovered from the
      sparse header embedded in the input.
      
      Constraints: input must be 1-D with a valid sparse header; the encoded
      shape must be 2-D; only float32 is supported.
      inputs:
        X (float32) — 1-D float32 sparse encoding produced by DenseToSparse.
      outputs:
        Y (float32) — Reconstructed 2-D dense float32 tensor. The shape is recovered from the sparse header embedded in X.