yaourt.ortops.doc#
Documentation catalogue of custom ORT ops, derived from C++ source files.
Structural metadata (op name, domain, execution provider, input/output names and element types) is parsed directly from the C++ source files at import time so that the Python catalogue always stays in sync with the C++ implementation without any manual maintenance.
Human-readable documentation strings are parsed from Doxygen-style doc comments in the C++ header files, so prose descriptions live alongside the kernel declarations and are never duplicated in Python.
Supported C++ sources — sparse CPU (lite API)#
yaourt/ortops/sparse/cpu/ort_sparse_cpu2_lib.cc— provides the op domain and theCreateLiteCustomOpregistrations (op name → kernel class + exec provider).yaourt/ortops/sparse/cpu/ort_sparse_lite.h— provides theComputemethod signatures and///doc comments used to extract input/output argument names, element types, and prose descriptions.
Supported C++ sources — fused kernel CUDA (custom-op-base API)#
yaourt/ortops/fused_kernel/cuda/ort_fused_kernel_cuda_lib.cu— provides the op domain name.yaourt/ortops/fused_kernel/cuda/*.cu(individual kernel files) — provideGetName(),GetInputTypeCount(),GetOutputTypeCount(), andGetExecutionProviderType()implementations.yaourt/ortops/fused_kernel/cuda/*.h(individual header files) — provide/** @file … @brief … */Doxygen doc blocks used as op descriptions.
The print_cpu_ops() / print_cpu_ops_rst() and
print_cuda_ops() / print_cuda_ops_rst() functions render the
catalogues as plain text or RST and are intended to be called from
.. runpython:: blocks in the Sphinx docs.
- yaourt.ortops.doc.CPU_OPS: dict[str, OrtOpDesc] = {'DenseToSparse': OrtOpDesc(name='DenseToSparse', domain='yaourt.ortops.sparse.cpu', since_version=1, execution_provider='CPUExecutionProvider', inputs=[OrtOpInput(name='X', dtype='float32', description='2-D dense float32 input tensor of shape [n_rows, n_cols]. Zero elements are not stored in the sparse encoding.')], outputs=[OrtOpOutput(name='Y', dtype='float32', description='1-D float32 tensor containing the sparse encoding of X. Layout: header | flat indices (uint32) | non-zero values (float32).')], doc='Converts a 2-D dense float32 tensor into a compact flat sparse encoding.\nOnly non-zero elements are stored. The 1-D output tensor encodes the\noriginal shape, the number of non-zero elements, their flat indices\n(stored as uint32), and their values (float32). The encoding is suitable\nas input to SparseToDense for a lossless round-trip.\n\nConstraints: input must be exactly 2-D; only float32 is supported.'), 'SparseToDense': OrtOpDesc(name='SparseToDense', domain='yaourt.ortops.sparse.cpu', since_version=1, execution_provider='CPUExecutionProvider', inputs=[OrtOpInput(name='X', dtype='float32', description='1-D float32 sparse encoding produced by DenseToSparse.')], outputs=[OrtOpOutput(name='Y', dtype='float32', description='Reconstructed 2-D dense float32 tensor. The shape is recovered from the sparse header embedded in X.')], doc='Converts the compact sparse encoding produced by DenseToSparse back into\na 2-D dense float32 tensor. Positions that were zero in the original\ntensor are filled with 0.0. The output shape is recovered from the\nsparse header embedded in the input.\n\nConstraints: input must be 1-D with a valid sparse header; the encoded\nshape must be 2-D; only float32 is supported.')}#
All CPU custom ops provided by yet-another-onnxruntime-extensions, keyed by op name. Populated at import time by parsing the C++ source files.
- yaourt.ortops.doc.CUDA_OPS: dict[str, OrtOpDesc] = {'AddAdd': OrtOpDesc(name='AddAdd', domain='yaourt.ortops.fused_kernel.cuda', since_version=1, execution_provider='CUDAExecutionProvider', inputs=[OrtOpInput(name='input_0', dtype='T', description=''), OrtOpInput(name='input_1', dtype='T', description=''), OrtOpInput(name='input_2', dtype='T', description='')], outputs=[OrtOpOutput(name='output_0', dtype='T', description='')], doc='Fused AddAdd / MulMul CUDA custom operator for three inputs.\n\nDeclares the kernel and operator classes for two element-wise ternary\noperations on three broadcastable input tensors A, B, C:\n\n- **AddAdd** (``addition`` = ``true`` ): <math>\n- **MulMul** (``addition`` = ``false):`` <math>\n\nWhen all inputs share the same shape the kernel uses a no-broadcast path for\nbetter performance.\n\nSupported element types: ``float,`` ``half.``'), 'AddAddAdd': OrtOpDesc(name='AddAddAdd', domain='yaourt.ortops.fused_kernel.cuda', since_version=1, execution_provider='CUDAExecutionProvider', inputs=[OrtOpInput(name='input_0', dtype='T', description=''), OrtOpInput(name='input_1', dtype='T', description=''), OrtOpInput(name='input_2', dtype='T', description=''), OrtOpInput(name='input_3', dtype='T', description='')], outputs=[OrtOpOutput(name='output_0', dtype='T', description='')], doc='Fused AddAddAdd / MulMulMul CUDA custom operator for four inputs.\n\nDeclares the kernel and operator classes for two element-wise quaternary\noperations on four broadcastable input tensors A, B, C, D:\n\n- **AddAddAdd** (``addition`` = ``true`` ):\n <math>\n- **MulMulMul** (``addition`` = ``false):``\n <math>\n\nSupported element types: ``float,`` ``half.``'), 'AddMul': OrtOpDesc(name='AddMul', domain='yaourt.ortops.fused_kernel.cuda', since_version=1, execution_provider='CUDAExecutionProvider', inputs=[OrtOpInput(name='input_0', dtype='T', description=''), OrtOpInput(name='input_1', dtype='T', description=''), OrtOpInput(name='input_2', dtype='T', description='')], outputs=[OrtOpOutput(name='output_0', dtype='T', description='')], doc='Fused AddMul / MulAdd CUDA custom operator.\n\nDeclares the kernel and operator classes for two element-wise ternary\noperations on three broadcastable input tensors A, B, C:\n\n- **AddMul** (``addition`` = ``true`` ): <math>\n- **MulAdd** (``addition`` = ``false):`` <math>\n\nBoth variants support an optional kernel attribute ``transposeMiddle.`` When\nset to ``true`` on a 4-D input the two middle axes of the output are\ntransposed, which avoids a separate Transpose node in common attention-kernel\npatterns.\n\nSupported element types: ``float,`` ``half.``'), 'AddSharedInput': OrtOpDesc(name='AddSharedInput', domain='yaourt.ortops.fused_kernel.cuda', since_version=1, execution_provider='CUDAExecutionProvider', inputs=[OrtOpInput(name='input_0', dtype='T', description=''), OrtOpInput(name='input_1', dtype='T', description=''), OrtOpInput(name='input_2', dtype='T', description='')], outputs=[OrtOpOutput(name='output_0', dtype='T', description=''), OrtOpOutput(name='output_1', dtype='T', description='')], doc='Fused AddSharedInput / MulSharedInput CUDA custom operator.\n\nDeclares the kernel and operator classes for an operation that applies the\nsame first input A to two different second inputs B and C simultaneously,\nproducing two outputs:\n\n- **AddSharedInput** (``addition`` = ``true`` ):\n <math>\n- **MulSharedInput** (``addition`` = ``false):``\n <math>\n\nThis fused form avoids reading A twice when computing two independent\noperations that share the same operand.\n\nSupported element types: ``float,`` ``half.``'), 'MaskedScatterNDOfShape': OrtOpDesc(name='MaskedScatterNDOfShape', domain='yaourt.ortops.fused_kernel.cuda', since_version=1, execution_provider='CUDAExecutionProvider', inputs=[OrtOpInput(name='input_0', dtype='T', description=''), OrtOpInput(name='input_1', dtype='T', description=''), OrtOpInput(name='input_2', dtype='T', description='')], outputs=[OrtOpOutput(name='output_0', dtype='T', description='')], doc='MaskedScatterNDOfShape CUDA custom operator.\n\nDeclares the kernel and operator classes for a masked variant of the\nScatterNDOfShape operation. Each scatter step is skipped when the\ncorresponding index value equals a configurable ``masked_value,`` allowing\npadding indices to be ignored without pre-filtering:\n\noutput = zeros(shape)\nfor each i:\nif indices[i] != masked_value:\noutput[indices[i]] op= updates[i]\n\nInputs:\n\n- 0: ``shape`` — 1-D ``int64`` tensor defining the output shape (CPU).\n- 1: ``indices`` — integer indices tensor (CPU).\n- 2: ``updates`` — data tensor of type ``T`` (GPU).\n\nThe ``reduction`` and ``masked_value`` behaviour is configured via kernel\nattributes of the same names.\n\nSupported element types: ``float,`` ``half.``'), 'MulAdd': OrtOpDesc(name='MulAdd', domain='yaourt.ortops.fused_kernel.cuda', since_version=1, execution_provider='CUDAExecutionProvider', inputs=[OrtOpInput(name='input_0', dtype='T', description=''), OrtOpInput(name='input_1', dtype='T', description=''), OrtOpInput(name='input_2', dtype='T', description='')], outputs=[OrtOpOutput(name='output_0', dtype='T', description='')], doc='Fused AddMul / MulAdd CUDA custom operator.\n\nDeclares the kernel and operator classes for two element-wise ternary\noperations on three broadcastable input tensors A, B, C:\n\n- **AddMul** (``addition`` = ``true`` ): <math>\n- **MulAdd** (``addition`` = ``false):`` <math>\n\nBoth variants support an optional kernel attribute ``transposeMiddle.`` When\nset to ``true`` on a 4-D input the two middle axes of the output are\ntransposed, which avoids a separate Transpose node in common attention-kernel\npatterns.\n\nSupported element types: ``float,`` ``half.``'), 'MulMul': OrtOpDesc(name='MulMul', domain='yaourt.ortops.fused_kernel.cuda', since_version=1, execution_provider='CUDAExecutionProvider', inputs=[OrtOpInput(name='input_0', dtype='T', description=''), OrtOpInput(name='input_1', dtype='T', description=''), OrtOpInput(name='input_2', dtype='T', description='')], outputs=[OrtOpOutput(name='output_0', dtype='T', description='')], doc='Fused AddAdd / MulMul CUDA custom operator for three inputs.\n\nDeclares the kernel and operator classes for two element-wise ternary\noperations on three broadcastable input tensors A, B, C:\n\n- **AddAdd** (``addition`` = ``true`` ): <math>\n- **MulMul** (``addition`` = ``false):`` <math>\n\nWhen all inputs share the same shape the kernel uses a no-broadcast path for\nbetter performance.\n\nSupported element types: ``float,`` ``half.``'), 'MulMulMul': OrtOpDesc(name='MulMulMul', domain='yaourt.ortops.fused_kernel.cuda', since_version=1, execution_provider='CUDAExecutionProvider', inputs=[OrtOpInput(name='input_0', dtype='T', description=''), OrtOpInput(name='input_1', dtype='T', description=''), OrtOpInput(name='input_2', dtype='T', description=''), OrtOpInput(name='input_3', dtype='T', description='')], outputs=[OrtOpOutput(name='output_0', dtype='T', description='')], doc='Fused AddAddAdd / MulMulMul CUDA custom operator for four inputs.\n\nDeclares the kernel and operator classes for two element-wise quaternary\noperations on four broadcastable input tensors A, B, C, D:\n\n- **AddAddAdd** (``addition`` = ``true`` ):\n <math>\n- **MulMulMul** (``addition`` = ``false):``\n <math>\n\nSupported element types: ``float,`` ``half.``'), 'MulMulSigmoid': OrtOpDesc(name='MulMulSigmoid', domain='yaourt.ortops.fused_kernel.cuda', since_version=1, execution_provider='CUDAExecutionProvider', inputs=[OrtOpInput(name='input_0', dtype='T', description=''), OrtOpInput(name='input_1', dtype='T', description='')], outputs=[OrtOpOutput(name='output_0', dtype='T', description='')], doc='Fused MulMulSigmoid CUDA custom operator.\n\nDeclares the kernel and operator classes for the element-wise binary\noperation applied to two broadcastable input tensors x and y:\n\n<math>\n\nThe sigmoid is applied only to y, making this a gated variant of the\nSiLU / Swish activation commonly found in gated linear units (GLUs) used in\ntransformer feed-forward networks.\n\nSupported element types: ``float,`` ``half.``'), 'MulSharedInput': OrtOpDesc(name='MulSharedInput', domain='yaourt.ortops.fused_kernel.cuda', since_version=1, execution_provider='CUDAExecutionProvider', inputs=[OrtOpInput(name='input_0', dtype='T', description=''), OrtOpInput(name='input_1', dtype='T', description=''), OrtOpInput(name='input_2', dtype='T', description='')], outputs=[OrtOpOutput(name='output_0', dtype='T', description=''), OrtOpOutput(name='output_1', dtype='T', description='')], doc='Fused AddSharedInput / MulSharedInput CUDA custom operator.\n\nDeclares the kernel and operator classes for an operation that applies the\nsame first input A to two different second inputs B and C simultaneously,\nproducing two outputs:\n\n- **AddSharedInput** (``addition`` = ``true`` ):\n <math>\n- **MulSharedInput** (``addition`` = ``false):``\n <math>\n\nThis fused form avoids reading A twice when computing two independent\noperations that share the same operand.\n\nSupported element types: ``float,`` ``half.``'), 'MulSigmoid': OrtOpDesc(name='MulSigmoid', domain='yaourt.ortops.fused_kernel.cuda', since_version=1, execution_provider='CUDAExecutionProvider', inputs=[OrtOpInput(name='input_0', dtype='T', description='')], outputs=[OrtOpOutput(name='output_0', dtype='T', description='')], doc='Fused MulSigmoid CUDA custom operator (SiLU / Swish activation).\n\nDeclares the kernel and operator classes for the element-wise unary\noperation:\n\n<math>\n\nThis is equivalent to the SiLU (Sigmoid Linear Unit) / Swish activation\nfunction, which is commonly used in transformer feed-forward blocks.\n\nSupported element types: ``float,`` ``half.``'), 'MulSub': OrtOpDesc(name='MulSub', domain='yaourt.ortops.fused_kernel.cuda', since_version=1, execution_provider='CUDAExecutionProvider', inputs=[OrtOpInput(name='input_0', dtype='T', description=''), OrtOpInput(name='input_1', dtype='T', description=''), OrtOpInput(name='input_2', dtype='T', description='')], outputs=[OrtOpOutput(name='output_0', dtype='T', description='')], doc='Fused SubMul / MulSub CUDA custom operator.\n\nDeclares the kernel and operator classes for element-wise ternary operations\non three broadcastable input tensors A, B, C. The template parameter\n``addition`` selects whether subtraction precedes or follows multiplication,\nand the optional kernel attribute ``negative`` inverts the sign of the\nsubtraction operand:\n\n- **SubMul** (``addition`` = ``true,`` ``negative`` = ``false):``\n <math>\n- **SubMul** (``addition`` = ``true,`` ``negative`` = ``true):``\n <math>\n- **MulSub** (``addition`` = ``false,`` ``negative`` = ``false):``\n <math>\n- **MulSub** (``addition`` = ``false,`` ``negative`` = ``true):``\n <math>\n\nSupported element types: ``float,`` ``half.``'), 'NegXplus1': OrtOpDesc(name='NegXplus1', domain='yaourt.ortops.fused_kernel.cuda', since_version=1, execution_provider='CUDAExecutionProvider', inputs=[OrtOpInput(name='input_0', dtype='T', description='')], outputs=[OrtOpOutput(name='output_0', dtype='T', description='')], doc='NegXplus1 CUDA custom operator — element-wise complement (1 − x).\n\nDeclares the kernel and operator classes for the unary element-wise\noperation:\n\n<math>\n\nThis is the arithmetic complement of the input, useful for computing\nprobability complements (e.g. <math>) without an extra Constant and\nSub node in the graph.\n\nSupported element types: ``float,`` ``half,`` ``int32_t.``'), 'ReplaceZero': OrtOpDesc(name='ReplaceZero', domain='yaourt.ortops.fused_kernel.cuda', since_version=1, execution_provider='CUDAExecutionProvider', inputs=[OrtOpInput(name='input_0', dtype='T', description='')], outputs=[OrtOpOutput(name='output_0', dtype='T', description='')], doc='ReplaceZero CUDA custom operator — substitute zero elements.\n\nDeclares the kernel and operator classes for the unary element-wise\noperation:\n\n<math>\n\nThe replacement scalar ``by`` is read from a kernel attribute of the same\nname. The operator is useful for masking out padding tokens or avoiding\ndivision by zero in subsequent operations.\n\nSupported element types: ``float,`` ``half.``'), 'Rotary': OrtOpDesc(name='Rotary', domain='yaourt.ortops.fused_kernel.cuda', since_version=1, execution_provider='CUDAExecutionProvider', inputs=[OrtOpInput(name='input_0', dtype='T', description=''), OrtOpInput(name='input_1', dtype='T', description='')], outputs=[OrtOpOutput(name='output_0', dtype='T', description='')], doc='Rotary positional embedding CUDA custom operator.\n\nDeclares the kernel and operator classes that apply a rotary transformation\nto the last dimension of an input tensor. The operation implements rotary\nposition encodings (RoPE) as used in models such as LLaMA and GPT-NeoX.\n\nEach pair of elements <math> in the last\ndimension is rotated depending on ``RotarySide:``\n\n- **LEFT** side (``rotary_side_`` = LEFT):\n <math>\n- **RIGHT** side (``rotary_side_`` = RIGHT):\n <math>\n\nThe side is selected via the kernel attribute ``"side"`` (integer, 1 = LEFT,\n2 = RIGHT). The first input provides the data tensor and is expected in\ndevice memory; the second input (optional shape hint) is read from CPU\nmemory.\n\nSupported element types: ``float,`` ``half.``'), 'ScatterNDOfShape': OrtOpDesc(name='ScatterNDOfShape', domain='yaourt.ortops.fused_kernel.cuda', since_version=1, execution_provider='CUDAExecutionProvider', inputs=[OrtOpInput(name='input_0', dtype='T', description=''), OrtOpInput(name='input_1', dtype='T', description=''), OrtOpInput(name='input_2', dtype='T', description='')], outputs=[OrtOpOutput(name='output_0', dtype='T', description='')], doc='ScatterNDOfShape CUDA custom operator.\n\nDeclares the kernel and operator classes for scattering ``updates`` into a\nzero-initialised output tensor whose shape is defined by the first input.\nThe operation is equivalent to:\n\noutput = zeros(shape)\noutput[indices[i]] op= updates[i] for each i\n\nwhere ``op`` is determined by the ``reduction`` kernel attribute\n(``Reduction::None`` overwrites, ``Add`` accumulates, etc.). A second\nstrategy attribute (``Strategy::None`` or ``Optimize)`` lets the kernel choose\na shape-specific fast path at runtime.\n\nInputs:\n\n- 0: ``shape`` — 1-D ``int64`` tensor defining the output shape (CPU).\n- 1: ``indices`` — integer indices tensor (CPU).\n- 2: ``updates`` — data tensor of type ``T`` (GPU).\n\nSupported element types: ``float,`` ``half.``'), 'SubMul': OrtOpDesc(name='SubMul', domain='yaourt.ortops.fused_kernel.cuda', since_version=1, execution_provider='CUDAExecutionProvider', inputs=[OrtOpInput(name='input_0', dtype='T', description=''), OrtOpInput(name='input_1', dtype='T', description=''), OrtOpInput(name='input_2', dtype='T', description='')], outputs=[OrtOpOutput(name='output_0', dtype='T', description='')], doc='Fused SubMul / MulSub CUDA custom operator.\n\nDeclares the kernel and operator classes for element-wise ternary operations\non three broadcastable input tensors A, B, C. The template parameter\n``addition`` selects whether subtraction precedes or follows multiplication,\nand the optional kernel attribute ``negative`` inverts the sign of the\nsubtraction operand:\n\n- **SubMul** (``addition`` = ``true,`` ``negative`` = ``false):``\n <math>\n- **SubMul** (``addition`` = ``true,`` ``negative`` = ``true):``\n <math>\n- **MulSub** (``addition`` = ``false,`` ``negative`` = ``false):``\n <math>\n- **MulSub** (``addition`` = ``false,`` ``negative`` = ``true):``\n <math>\n\nSupported element types: ``float,`` ``half.``'), 'Transpose2DCastFP16': OrtOpDesc(name='Transpose2DCastFP16', domain='yaourt.ortops.fused_kernel.cuda', since_version=1, execution_provider='CUDAExecutionProvider', inputs=[OrtOpInput(name='input_0', dtype='T', description='')], outputs=[OrtOpOutput(name='output_0', dtype='T', description='')], doc='Transpose2DCast CUDA custom operator — 2-D matrix transpose with type\nconversion.\n\nDeclares the kernel and operator classes for transposing a 2-D matrix and\nsimultaneously casting its elements to a different numeric type. Two\nregistered operator names are available depending on the output type:\n\n- **Transpose2DCastFP16** — input ``float`` → output ``half.``\n- **Transpose2DCastFP32** — input ``half`` → output ``float.``\n\nThe operator fuses the Transpose and Cast nodes into a single tiled CUDA\nkernel, avoiding a round-trip through global memory compared to executing\nthem separately.\n\nThe input and output types are chosen at construction time and stored in the\noperator descriptor.'), 'Transpose2DCastFP32': OrtOpDesc(name='Transpose2DCastFP32', domain='yaourt.ortops.fused_kernel.cuda', since_version=1, execution_provider='CUDAExecutionProvider', inputs=[OrtOpInput(name='input_0', dtype='T', description='')], outputs=[OrtOpOutput(name='output_0', dtype='T', description='')], doc='Transpose2DCast CUDA custom operator — 2-D matrix transpose with type\nconversion.\n\nDeclares the kernel and operator classes for transposing a 2-D matrix and\nsimultaneously casting its elements to a different numeric type. Two\nregistered operator names are available depending on the output type:\n\n- **Transpose2DCastFP16** — input ``float`` → output ``half.``\n- **Transpose2DCastFP32** — input ``half`` → output ``float.``\n\nThe operator fuses the Transpose and Cast nodes into a single tiled CUDA\nkernel, avoiding a round-trip through global memory compared to executing\nthem separately.\n\nThe input and output types are chosen at construction time and stored in the\noperator descriptor.'), 'TriMatrix': OrtOpDesc(name='TriMatrix', domain='yaourt.ortops.fused_kernel.cuda', since_version=1, execution_provider='CUDAExecutionProvider', inputs=[OrtOpInput(name='input_0', dtype='T', description=''), OrtOpInput(name='input_1', dtype='T', description='')], outputs=[OrtOpOutput(name='output_0', dtype='T', description='')], doc='TriMatrix CUDA custom operator — triangular matrix generator.\n\nDeclares the kernel and operator classes that create a 2-D triangular matrix\nwhose lower-triangle, diagonal, and upper-triangle elements are filled with\nuser-supplied scalar values.\n\nFor a matrix of size <math>, element\n<math> is set to:\n\n<math>\n\nInputs:\n\n- 0: ``shape`` — 1-D ``int64`` tensor <math> (CPU).\n- 1: ``values`` — 1-D tensor of type ``T`` with three elements\n <math> (CPU).\n\nSupported element types: ``float,`` ``half.``')}#
All fused-kernel CUDA custom ops provided by yet-another-onnxruntime-extensions, keyed by op name. Populated at import time by parsing the C++ source files.
- yaourt.ortops.doc.FUSED_KERNEL_CPU_OPS: dict[str, OrtOpDesc] = {'MulMul': OrtOpDesc(name='MulMul', domain='yaourt.ortops.fused_kernel.cpu', since_version=1, execution_provider='CPUExecutionProvider', inputs=[OrtOpInput(name='A', dtype='float32', description=''), OrtOpInput(name='B', dtype='float32', description=''), OrtOpInput(name='C', dtype='float32', description='')], outputs=[OrtOpOutput(name='output', dtype='float32', description='')], doc='Fused MulMul CPU custom operator.\n\nDeclares the kernel class for the element-wise ternary multiplication\napplied to three broadcastable input tensors A, B, C:\n\n<math>\n\nThe implementation uses AVX2 SIMD instructions for vectorised throughput\nand std::thread for multi-threaded parallelism.\n\nSupported element type: ``float.``')}#
All fused-kernel CPU custom ops provided by yet-another-onnxruntime-extensions, keyed by op name. Populated at import time by parsing the C++ source files.
- class yaourt.ortops.doc.OrtOpDesc(name: str, domain: str, since_version: int, execution_provider: str, inputs: List[OrtOpInput] = <factory>, outputs: List[OrtOpOutput] = <factory>, doc: str = '')#
Describes a single custom ORT op.
- Parameters:
name – op name as registered with OrtRuntime
domain – ONNX domain the op belongs to
since_version – opset version in which the op was introduced
execution_provider – execution provider (e.g.
"CPUExecutionProvider")inputs – ordered list of input descriptors
outputs – ordered list of output descriptors
doc – longer plain-text description of the op’s semantics
- __eq__(other)#
Return self==value.
- __hash__ = None#
- __init__(name: str, domain: str, since_version: int, execution_provider: str, inputs: List[OrtOpInput] = <factory>, outputs: List[OrtOpOutput] = <factory>, doc: str = '') None#
- __repr__()#
Return repr(self).
- __weakref__#
list of weak references to the object
- class yaourt.ortops.doc.OrtOpInput(name: str, dtype: str, description: str)#
Describes one input of a custom ORT op.
- Parameters:
name – argument name used in the op signature
dtype – ONNX element type (e.g.
"float32")description – human-readable description of what the input represents
- __eq__(other)#
Return self==value.
- __hash__ = None#
- __repr__()#
Return repr(self).
- __weakref__#
list of weak references to the object
- class yaourt.ortops.doc.OrtOpOutput(name: str, dtype: str, description: str)#
Describes one output of a custom ORT op.
- Parameters:
name – argument name used in the op signature
dtype – ONNX element type (e.g.
"float32")description – human-readable description of what the output represents
- __eq__(other)#
Return self==value.
- __hash__ = None#
- __repr__()#
Return repr(self).
- __weakref__#
list of weak references to the object
- yaourt.ortops.doc.print_cpu_ops() None#
Prints the CPU custom-op catalogue to stdout.
Renders
CPU_OPSas plain text suitable for a.. runpython::block in the Sphinx documentation, ensuring the rendered output is always derived from the C++ source files.<<<
from yaourt.ortops.doc import print_cpu_ops print_cpu_ops()
>>>
DenseToSparse domain : yaourt.ortops.sparse.cpu provider : CPUExecutionProvider version : 1 Converts a 2-D dense float32 tensor into a compact flat sparse encoding. Only non-zero elements are stored. The 1-D output tensor encodes the original shape, the number of non-zero elements, their flat indices (stored as uint32), and their values (float32). The encoding is suitable as input to SparseToDense for a lossless round-trip. Constraints: input must be exactly 2-D; only float32 is supported. inputs: X (float32) — 2-D dense float32 input tensor of shape [n_rows, n_cols]. Zero elements are not stored in the sparse encoding. outputs: Y (float32) — 1-D float32 tensor containing the sparse encoding of X. Layout: header | flat indices (uint32) | non-zero values (float32). SparseToDense domain : yaourt.ortops.sparse.cpu provider : CPUExecutionProvider version : 1 Converts the compact sparse encoding produced by DenseToSparse back into a 2-D dense float32 tensor. Positions that were zero in the original tensor are filled with 0.0. The output shape is recovered from the sparse header embedded in the input. Constraints: input must be 1-D with a valid sparse header; the encoded shape must be 2-D; only float32 is supported. inputs: X (float32) — 1-D float32 sparse encoding produced by DenseToSparse. outputs: Y (float32) — Reconstructed 2-D dense float32 tensor. The shape is recovered from the sparse header embedded in X.
- yaourt.ortops.doc.print_cpu_ops_rst() None#
Renders the CPU custom-op catalogue as RST and writes it to stdout.
Renders
CPU_OPSas valid reStructuredText suitable for a.. runpython:: :rst:block in the Sphinx documentation. Each op is rendered as a sub-section with alist-tablefor its metadata, and bulleted lists for its inputs and outputs, ensuring the rendered page is always derived from the C++ source files without manual maintenance.<<<
from yaourt.ortops.doc import print_cpu_ops_rst print_cpu_ops_rst()
>>>
DenseToSparse#
Domain
yaourt.ortops.sparse.cpuExecution provider
CPUExecutionProviderSince version
1
Converts a 2-D dense float32 tensor into a compact flat sparse encoding. Only non-zero elements are stored. The 1-D output tensor encodes the original shape, the number of non-zero elements, their flat indices (stored as uint32), and their values (float32). The encoding is suitable as input to SparseToDense for a lossless round-trip.
Constraints: input must be exactly 2-D; only float32 is supported.
Inputs
X(float32) — 2-D dense float32 input tensor of shape [n_rows, n_cols]. Zero elements are not stored in the sparse encoding.
Outputs
Y(float32) — 1-D float32 tensor containing the sparse encoding of X. Layout: header | flat indices (uint32) | non-zero values (float32).
SparseToDense#
Domain
yaourt.ortops.sparse.cpuExecution provider
CPUExecutionProviderSince version
1
Converts the compact sparse encoding produced by DenseToSparse back into a 2-D dense float32 tensor. Positions that were zero in the original tensor are filled with 0.0. The output shape is recovered from the sparse header embedded in the input.
Constraints: input must be 1-D with a valid sparse header; the encoded shape must be 2-D; only float32 is supported.
Inputs
X(float32) — 1-D float32 sparse encoding produced by DenseToSparse.
Outputs
Y(float32) — Reconstructed 2-D dense float32 tensor. The shape is recovered from the sparse header embedded in X.
- yaourt.ortops.doc.print_cuda_ops() None#
Prints the fused-kernel CUDA custom-op catalogue to stdout.
Renders
CUDA_OPSas plain text suitable for a.. runpython::block in the Sphinx documentation, ensuring the rendered output is always derived from the C++ source files.<<<
from yaourt.ortops.doc import print_cuda_ops print_cuda_ops()
>>>
AddAdd domain : yaourt.ortops.fused_kernel.cuda provider : CUDAExecutionProvider version : 1 Fused AddAdd / MulMul CUDA custom operator for three inputs. Declares the kernel and operator classes for two element-wise ternary operations on three broadcastable input tensors A, B, C: - **AddAdd** (``addition`` = ``true`` ): <math> - **MulMul** (``addition`` = ``false):`` <math> When all inputs share the same shape the kernel uses a no-broadcast path for better performance. Supported element types: ``float,`` ``half.`` inputs : T x3 outputs : T x1 AddAddAdd domain : yaourt.ortops.fused_kernel.cuda provider : CUDAExecutionProvider version : 1 Fused AddAddAdd / MulMulMul CUDA custom operator for four inputs. Declares the kernel and operator classes for two element-wise quaternary operations on four broadcastable input tensors A, B, C, D: - **AddAddAdd** (``addition`` = ``true`` ): <math> - **MulMulMul** (``addition`` = ``false):`` <math> Supported element types: ``float,`` ``half.`` inputs : T x4 outputs : T x1 AddMul domain : yaourt.ortops.fused_kernel.cuda provider : CUDAExecutionProvider version : 1 Fused AddMul / MulAdd CUDA custom operator. Declares the kernel and operator classes for two element-wise ternary operations on three broadcastable input tensors A, B, C: - **AddMul** (``addition`` = ``true`` ): <math> - **MulAdd** (``addition`` = ``false):`` <math> Both variants support an optional kernel attribute ``transposeMiddle.`` When set to ``true`` on a 4-D input the two middle axes of the output are transposed, which avoids a separate Transpose node in common attention-kernel patterns. Supported element types: ``float,`` ``half.`` inputs : T x3 outputs : T x1 AddSharedInput domain : yaourt.ortops.fused_kernel.cuda provider : CUDAExecutionProvider version : 1 Fused AddSharedInput / MulSharedInput CUDA custom operator. Declares the kernel and operator classes for an operation that applies the same first input A to two different second inputs B and C simultaneously, producing two outputs: - **AddSharedInput** (``addition`` = ``true`` ): <math> - **MulSharedInput** (``addition`` = ``false):`` <math> This fused form avoids reading A twice when computing two independent operations that share the same operand. Supported element types: ``float,`` ``half.`` inputs : T x3 outputs : T x2 MaskedScatterNDOfShape domain : yaourt.ortops.fused_kernel.cuda provider : CUDAExecutionProvider version : 1 MaskedScatterNDOfShape CUDA custom operator. Declares the kernel and operator classes for a masked variant of the ScatterNDOfShape operation. Each scatter step is skipped when the corresponding index value equals a configurable ``masked_value,`` allowing padding indices to be ignored without pre-filtering: output = zeros(shape) for each i: if indices[i] != masked_value: output[indices[i]] op= updates[i] Inputs: - 0: ``shape`` — 1-D ``int64`` tensor defining the output shape (CPU). - 1: ``indices`` — integer indices tensor (CPU). - 2: ``updates`` — data tensor of type ``T`` (GPU). The ``reduction`` and ``masked_value`` behaviour is configured via kernel attributes of the same names. Supported element types: ``float,`` ``half.`` inputs : T x3 outputs : T x1 MulAdd domain : yaourt.ortops.fused_kernel.cuda provider : CUDAExecutionProvider version : 1 Fused AddMul / MulAdd CUDA custom operator. Declares the kernel and operator classes for two element-wise ternary operations on three broadcastable input tensors A, B, C: - **AddMul** (``addition`` = ``true`` ): <math> - **MulAdd** (``addition`` = ``false):`` <math> Both variants support an optional kernel attribute ``transposeMiddle.`` When set to ``true`` on a 4-D input the two middle axes of the output are transposed, which avoids a separate Transpose node in common attention-kernel patterns. Supported element types: ``float,`` ``half.`` inputs : T x3 outputs : T x1 MulMul domain : yaourt.ortops.fused_kernel.cuda provider : CUDAExecutionProvider version : 1 Fused AddAdd / MulMul CUDA custom operator for three inputs. Declares the kernel and operator classes for two element-wise ternary operations on three broadcastable input tensors A, B, C: - **AddAdd** (``addition`` = ``true`` ): <math> - **MulMul** (``addition`` = ``false):`` <math> When all inputs share the same shape the kernel uses a no-broadcast path for better performance. Supported element types: ``float,`` ``half.`` inputs : T x3 outputs : T x1 MulMulMul domain : yaourt.ortops.fused_kernel.cuda provider : CUDAExecutionProvider version : 1 Fused AddAddAdd / MulMulMul CUDA custom operator for four inputs. Declares the kernel and operator classes for two element-wise quaternary operations on four broadcastable input tensors A, B, C, D: - **AddAddAdd** (``addition`` = ``true`` ): <math> - **MulMulMul** (``addition`` = ``false):`` <math> Supported element types: ``float,`` ``half.`` inputs : T x4 outputs : T x1 MulMulSigmoid domain : yaourt.ortops.fused_kernel.cuda provider : CUDAExecutionProvider version : 1 Fused MulMulSigmoid CUDA custom operator. Declares the kernel and operator classes for the element-wise binary operation applied to two broadcastable input tensors x and y: <math> The sigmoid is applied only to y, making this a gated variant of the SiLU / Swish activation commonly found in gated linear units (GLUs) used in transformer feed-forward networks. Supported element types: ``float,`` ``half.`` inputs : T x2 outputs : T x1 MulSharedInput domain : yaourt.ortops.fused_kernel.cuda provider : CUDAExecutionProvider version : 1 Fused AddSharedInput / MulSharedInput CUDA custom operator. Declares the kernel and operator classes for an operation that applies the same first input A to two different second inputs B and C simultaneously, producing two outputs: - **AddSharedInput** (``addition`` = ``true`` ): <math> - **MulSharedInput** (``addition`` = ``false):`` <math> This fused form avoids reading A twice when computing two independent operations that share the same operand. Supported element types: ``float,`` ``half.`` inputs : T x3 outputs : T x2 MulSigmoid domain : yaourt.ortops.fused_kernel.cuda provider : CUDAExecutionProvider version : 1 Fused MulSigmoid CUDA custom operator (SiLU / Swish activation). Declares the kernel and operator classes for the element-wise unary operation: <math> This is equivalent to the SiLU (Sigmoid Linear Unit) / Swish activation function, which is commonly used in transformer feed-forward blocks. Supported element types: ``float,`` ``half.`` inputs : T x1 outputs : T x1 MulSub domain : yaourt.ortops.fused_kernel.cuda provider : CUDAExecutionProvider version : 1 Fused SubMul / MulSub CUDA custom operator. Declares the kernel and operator classes for element-wise ternary operations on three broadcastable input tensors A, B, C. The template parameter ``addition`` selects whether subtraction precedes or follows multiplication, and the optional kernel attribute ``negative`` inverts the sign of the subtraction operand: - **SubMul** (``addition`` = ``true,`` ``negative`` = ``false):`` <math> - **SubMul** (``addition`` = ``true,`` ``negative`` = ``true):`` <math> - **MulSub** (``addition`` = ``false,`` ``negative`` = ``false):`` <math> - **MulSub** (``addition`` = ``false,`` ``negative`` = ``true):`` <math> Supported element types: ``float,`` ``half.`` inputs : T x3 outputs : T x1 NegXplus1 domain : yaourt.ortops.fused_kernel.cuda provider : CUDAExecutionProvider version : 1 NegXplus1 CUDA custom operator — element-wise complement (1 − x). Declares the kernel and operator classes for the unary element-wise operation: <math> This is the arithmetic complement of the input, useful for computing probability complements (e.g. <math>) without an extra Constant and Sub node in the graph. Supported element types: ``float,`` ``half,`` ``int32_t.`` inputs : T x1 outputs : T x1 ReplaceZero domain : yaourt.ortops.fused_kernel.cuda provider : CUDAExecutionProvider version : 1 ReplaceZero CUDA custom operator — substitute zero elements. Declares the kernel and operator classes for the unary element-wise operation: <math> The replacement scalar ``by`` is read from a kernel attribute of the same name. The operator is useful for masking out padding tokens or avoiding division by zero in subsequent operations. Supported element types: ``float,`` ``half.`` inputs : T x1 outputs : T x1 Rotary domain : yaourt.ortops.fused_kernel.cuda provider : CUDAExecutionProvider version : 1 Rotary positional embedding CUDA custom operator. Declares the kernel and operator classes that apply a rotary transformation to the last dimension of an input tensor. The operation implements rotary position encodings (RoPE) as used in models such as LLaMA and GPT-NeoX. Each pair of elements <math> in the last dimension is rotated depending on ``RotarySide:`` - **LEFT** side (``rotary_side_`` = LEFT): <math> - **RIGHT** side (``rotary_side_`` = RIGHT): <math> The side is selected via the kernel attribute ``"side"`` (integer, 1 = LEFT, 2 = RIGHT). The first input provides the data tensor and is expected in device memory; the second input (optional shape hint) is read from CPU memory. Supported element types: ``float,`` ``half.`` inputs : T x2 outputs : T x1 ScatterNDOfShape domain : yaourt.ortops.fused_kernel.cuda provider : CUDAExecutionProvider version : 1 ScatterNDOfShape CUDA custom operator. Declares the kernel and operator classes for scattering ``updates`` into a zero-initialised output tensor whose shape is defined by the first input. The operation is equivalent to: output = zeros(shape) output[indices[i]] op= updates[i] for each i where ``op`` is determined by the ``reduction`` kernel attribute (``Reduction::None`` overwrites, ``Add`` accumulates, etc.). A second strategy attribute (``Strategy::None`` or ``Optimize)`` lets the kernel choose a shape-specific fast path at runtime. Inputs: - 0: ``shape`` — 1-D ``int64`` tensor defining the output shape (CPU). - 1: ``indices`` — integer indices tensor (CPU). - 2: ``updates`` — data tensor of type ``T`` (GPU). Supported element types: ``float,`` ``half.`` inputs : T x3 outputs : T x1 SubMul domain : yaourt.ortops.fused_kernel.cuda provider : CUDAExecutionProvider version : 1 Fused SubMul / MulSub CUDA custom operator. Declares the kernel and operator classes for element-wise ternary operations on three broadcastable input tensors A, B, C. The template parameter ``addition`` selects whether subtraction precedes or follows multiplication, and the optional kernel attribute ``negative`` inverts the sign of the subtraction operand: - **SubMul** (``addition`` = ``true,`` ``negative`` = ``false):`` <math> - **SubMul** (``addition`` = ``true,`` ``negative`` = ``true):`` <math> - **MulSub** (``addition`` = ``false,`` ``negative`` = ``false):`` <math> - **MulSub** (``addition`` = ``false,`` ``negative`` = ``true):`` <math> Supported element types: ``float,`` ``half.`` inputs : T x3 outputs : T x1 Transpose2DCastFP16 domain : yaourt.ortops.fused_kernel.cuda provider : CUDAExecutionProvider version : 1 Transpose2DCast CUDA custom operator — 2-D matrix transpose with type conversion. Declares the kernel and operator classes for transposing a 2-D matrix and simultaneously casting its elements to a different numeric type. Two registered operator names are available depending on the output type: - **Transpose2DCastFP16** — input ``float`` → output ``half.`` - **Transpose2DCastFP32** — input ``half`` → output ``float.`` The operator fuses the Transpose and Cast nodes into a single tiled CUDA kernel, avoiding a round-trip through global memory compared to executing them separately. The input and output types are chosen at construction time and stored in the operator descriptor. inputs : T x1 outputs : T x1 Transpose2DCastFP32 domain : yaourt.ortops.fused_kernel.cuda provider : CUDAExecutionProvider version : 1 Transpose2DCast CUDA custom operator — 2-D matrix transpose with type conversion. Declares the kernel and operator classes for transposing a 2-D matrix and simultaneously casting its elements to a different numeric type. Two registered operator names are available depending on the output type: - **Transpose2DCastFP16** — input ``float`` → output ``half.`` - **Transpose2DCastFP32** — input ``half`` → output ``float.`` The operator fuses the Transpose and Cast nodes into a single tiled CUDA kernel, avoiding a round-trip through global memory compared to executing them separately. The input and output types are chosen at construction time and stored in the operator descriptor. inputs : T x1 outputs : T x1 TriMatrix domain : yaourt.ortops.fused_kernel.cuda provider : CUDAExecutionProvider version : 1 TriMatrix CUDA custom operator — triangular matrix generator. Declares the kernel and operator classes that create a 2-D triangular matrix whose lower-triangle, diagonal, and upper-triangle elements are filled with user-supplied scalar values. For a matrix of size <math>, element <math> is set to: <math> Inputs: - 0: ``shape`` — 1-D ``int64`` tensor <math> (CPU). - 1: ``values`` — 1-D tensor of type ``T`` with three elements <math> (CPU). Supported element types: ``float,`` ``half.`` inputs : T x2 outputs : T x1
- yaourt.ortops.doc.print_cuda_ops_rst() None#
Renders the fused-kernel CUDA custom-op catalogue as RST and writes it to stdout.
Renders
CUDA_OPSas valid reStructuredText suitable for a.. runpython:: :rst:block in the Sphinx documentation. Each op is rendered as a sub-section with alist-tablefor its metadata followed by its description, ensuring the rendered page is always derived from the C++ source files without manual maintenance.<<<
from yaourt.ortops.doc import print_cuda_ops_rst print_cuda_ops_rst()
>>>
AddAdd#
Domain
yaourt.ortops.fused_kernel.cudaExecution provider
CUDAExecutionProviderInputs
3
Outputs
1
Fused AddAdd / MulMul CUDA custom operator for three inputs.
Declares the kernel and operator classes for two element-wise ternary operations on three broadcastable input tensors A, B, C:
AddAdd (
addition=true): <math>MulMul (
addition=false):<math>
When all inputs share the same shape the kernel uses a no-broadcast path for better performance.
Supported element types:
float,half.AddAddAdd#
Domain
yaourt.ortops.fused_kernel.cudaExecution provider
CUDAExecutionProviderInputs
4
Outputs
1
Fused AddAddAdd / MulMulMul CUDA custom operator for four inputs.
Declares the kernel and operator classes for two element-wise quaternary operations on four broadcastable input tensors A, B, C, D:
AddAddAdd (
addition=true): <math>MulMulMul (
addition=false):<math>
Supported element types:
float,half.AddMul#
Domain
yaourt.ortops.fused_kernel.cudaExecution provider
CUDAExecutionProviderInputs
3
Outputs
1
Fused AddMul / MulAdd CUDA custom operator.
Declares the kernel and operator classes for two element-wise ternary operations on three broadcastable input tensors A, B, C:
AddMul (
addition=true): <math>MulAdd (
addition=false):<math>
Both variants support an optional kernel attribute
transposeMiddle.When set totrueon a 4-D input the two middle axes of the output are transposed, which avoids a separate Transpose node in common attention-kernel patterns.Supported element types:
float,half.MaskedScatterNDOfShape#
Domain
yaourt.ortops.fused_kernel.cudaExecution provider
CUDAExecutionProviderInputs
3
Outputs
1
MaskedScatterNDOfShape CUDA custom operator.
Declares the kernel and operator classes for a masked variant of the ScatterNDOfShape operation. Each scatter step is skipped when the corresponding index value equals a configurable
masked_value,allowing padding indices to be ignored without pre-filtering:output = zeros(shape) for each i: if indices[i] != masked_value: output[indices[i]] op= updates[i]
Inputs:
0:
shape— 1-Dint64tensor defining the output shape (CPU).1:
indices— integer indices tensor (CPU).2:
updates— data tensor of typeT(GPU).
The
reductionandmasked_valuebehaviour is configured via kernel attributes of the same names.Supported element types:
float,half.MulAdd#
Domain
yaourt.ortops.fused_kernel.cudaExecution provider
CUDAExecutionProviderInputs
3
Outputs
1
Fused AddMul / MulAdd CUDA custom operator.
Declares the kernel and operator classes for two element-wise ternary operations on three broadcastable input tensors A, B, C:
AddMul (
addition=true): <math>MulAdd (
addition=false):<math>
Both variants support an optional kernel attribute
transposeMiddle.When set totrueon a 4-D input the two middle axes of the output are transposed, which avoids a separate Transpose node in common attention-kernel patterns.Supported element types:
float,half.MulMul#
Domain
yaourt.ortops.fused_kernel.cudaExecution provider
CUDAExecutionProviderInputs
3
Outputs
1
Fused AddAdd / MulMul CUDA custom operator for three inputs.
Declares the kernel and operator classes for two element-wise ternary operations on three broadcastable input tensors A, B, C:
AddAdd (
addition=true): <math>MulMul (
addition=false):<math>
When all inputs share the same shape the kernel uses a no-broadcast path for better performance.
Supported element types:
float,half.MulMulMul#
Domain
yaourt.ortops.fused_kernel.cudaExecution provider
CUDAExecutionProviderInputs
4
Outputs
1
Fused AddAddAdd / MulMulMul CUDA custom operator for four inputs.
Declares the kernel and operator classes for two element-wise quaternary operations on four broadcastable input tensors A, B, C, D:
AddAddAdd (
addition=true): <math>MulMulMul (
addition=false):<math>
Supported element types:
float,half.MulMulSigmoid#
Domain
yaourt.ortops.fused_kernel.cudaExecution provider
CUDAExecutionProviderInputs
2
Outputs
1
Fused MulMulSigmoid CUDA custom operator.
Declares the kernel and operator classes for the element-wise binary operation applied to two broadcastable input tensors x and y:
<math>
The sigmoid is applied only to y, making this a gated variant of the SiLU / Swish activation commonly found in gated linear units (GLUs) used in transformer feed-forward networks.
Supported element types:
float,half.MulSigmoid#
Domain
yaourt.ortops.fused_kernel.cudaExecution provider
CUDAExecutionProviderInputs
1
Outputs
1
Fused MulSigmoid CUDA custom operator (SiLU / Swish activation).
Declares the kernel and operator classes for the element-wise unary operation:
<math>
This is equivalent to the SiLU (Sigmoid Linear Unit) / Swish activation function, which is commonly used in transformer feed-forward blocks.
Supported element types:
float,half.MulSub#
Domain
yaourt.ortops.fused_kernel.cudaExecution provider
CUDAExecutionProviderInputs
3
Outputs
1
Fused SubMul / MulSub CUDA custom operator.
Declares the kernel and operator classes for element-wise ternary operations on three broadcastable input tensors A, B, C. The template parameter
additionselects whether subtraction precedes or follows multiplication, and the optional kernel attributenegativeinverts the sign of the subtraction operand:SubMul (
addition=true,negative=false):<math>SubMul (
addition=true,negative=true):<math>MulSub (
addition=false,negative=false):<math>MulSub (
addition=false,negative=true):<math>
Supported element types:
float,half.NegXplus1#
Domain
yaourt.ortops.fused_kernel.cudaExecution provider
CUDAExecutionProviderInputs
1
Outputs
1
NegXplus1 CUDA custom operator — element-wise complement (1 − x).
Declares the kernel and operator classes for the unary element-wise operation:
<math>
This is the arithmetic complement of the input, useful for computing probability complements (e.g. <math>) without an extra Constant and Sub node in the graph.
Supported element types:
float,half,int32_t.ReplaceZero#
Domain
yaourt.ortops.fused_kernel.cudaExecution provider
CUDAExecutionProviderInputs
1
Outputs
1
ReplaceZero CUDA custom operator — substitute zero elements.
Declares the kernel and operator classes for the unary element-wise operation:
<math>
The replacement scalar
byis read from a kernel attribute of the same name. The operator is useful for masking out padding tokens or avoiding division by zero in subsequent operations.Supported element types:
float,half.Rotary#
Domain
yaourt.ortops.fused_kernel.cudaExecution provider
CUDAExecutionProviderInputs
2
Outputs
1
Rotary positional embedding CUDA custom operator.
Declares the kernel and operator classes that apply a rotary transformation to the last dimension of an input tensor. The operation implements rotary position encodings (RoPE) as used in models such as LLaMA and GPT-NeoX.
Each pair of elements <math> in the last dimension is rotated depending on
RotarySide:LEFT side (
rotary_side_= LEFT): <math>RIGHT side (
rotary_side_= RIGHT): <math>
The side is selected via the kernel attribute
"side"(integer, 1 = LEFT, 2 = RIGHT). The first input provides the data tensor and is expected in device memory; the second input (optional shape hint) is read from CPU memory.Supported element types:
float,half.ScatterNDOfShape#
Domain
yaourt.ortops.fused_kernel.cudaExecution provider
CUDAExecutionProviderInputs
3
Outputs
1
ScatterNDOfShape CUDA custom operator.
Declares the kernel and operator classes for scattering
updatesinto a zero-initialised output tensor whose shape is defined by the first input. The operation is equivalent to:output = zeros(shape) output[indices[i]] op= updates[i] for each i
where
opis determined by thereductionkernel attribute (Reduction::Noneoverwrites,Addaccumulates, etc.). A second strategy attribute (Strategy::NoneorOptimize)lets the kernel choose a shape-specific fast path at runtime.Inputs:
0:
shape— 1-Dint64tensor defining the output shape (CPU).1:
indices— integer indices tensor (CPU).2:
updates— data tensor of typeT(GPU).
Supported element types:
float,half.SubMul#
Domain
yaourt.ortops.fused_kernel.cudaExecution provider
CUDAExecutionProviderInputs
3
Outputs
1
Fused SubMul / MulSub CUDA custom operator.
Declares the kernel and operator classes for element-wise ternary operations on three broadcastable input tensors A, B, C. The template parameter
additionselects whether subtraction precedes or follows multiplication, and the optional kernel attributenegativeinverts the sign of the subtraction operand:SubMul (
addition=true,negative=false):<math>SubMul (
addition=true,negative=true):<math>MulSub (
addition=false,negative=false):<math>MulSub (
addition=false,negative=true):<math>
Supported element types:
float,half.Transpose2DCastFP16#
Domain
yaourt.ortops.fused_kernel.cudaExecution provider
CUDAExecutionProviderInputs
1
Outputs
1
Transpose2DCast CUDA custom operator — 2-D matrix transpose with type conversion.
Declares the kernel and operator classes for transposing a 2-D matrix and simultaneously casting its elements to a different numeric type. Two registered operator names are available depending on the output type:
Transpose2DCastFP16 — input
float→ outputhalf.Transpose2DCastFP32 — input
half→ outputfloat.
The operator fuses the Transpose and Cast nodes into a single tiled CUDA kernel, avoiding a round-trip through global memory compared to executing them separately.
The input and output types are chosen at construction time and stored in the operator descriptor.
Transpose2DCastFP32#
Domain
yaourt.ortops.fused_kernel.cudaExecution provider
CUDAExecutionProviderInputs
1
Outputs
1
Transpose2DCast CUDA custom operator — 2-D matrix transpose with type conversion.
Declares the kernel and operator classes for transposing a 2-D matrix and simultaneously casting its elements to a different numeric type. Two registered operator names are available depending on the output type:
Transpose2DCastFP16 — input
float→ outputhalf.Transpose2DCastFP32 — input
half→ outputfloat.
The operator fuses the Transpose and Cast nodes into a single tiled CUDA kernel, avoiding a round-trip through global memory compared to executing them separately.
The input and output types are chosen at construction time and stored in the operator descriptor.
TriMatrix#
Domain
yaourt.ortops.fused_kernel.cudaExecution provider
CUDAExecutionProviderInputs
2
Outputs
1
TriMatrix CUDA custom operator — triangular matrix generator.
Declares the kernel and operator classes that create a 2-D triangular matrix whose lower-triangle, diagonal, and upper-triangle elements are filled with user-supplied scalar values.
For a matrix of size <math>, element <math> is set to:
<math>
Inputs:
0:
shape— 1-Dint64tensor <math> (CPU).1:
values— 1-D tensor of typeTwith three elements <math> (CPU).
Supported element types:
float,half.
- yaourt.ortops.doc.print_fused_kernel_cpu_ops() None#
Prints the fused-kernel CPU custom-op catalogue to stdout.
Renders
FUSED_KERNEL_CPU_OPSas plain text suitable for a.. runpython::block in the Sphinx documentation, ensuring the rendered output is always derived from the C++ source files.<<<
from yaourt.ortops.doc import print_fused_kernel_cpu_ops print_fused_kernel_cpu_ops()
>>>
MulMul domain : yaourt.ortops.fused_kernel.cpu provider : CPUExecutionProvider version : 1 Fused MulMul CPU custom operator. Declares the kernel class for the element-wise ternary multiplication applied to three broadcastable input tensors A, B, C: <math> The implementation uses AVX2 SIMD instructions for vectorised throughput and std::thread for multi-threaded parallelism. Supported element type: ``float.`` inputs : float32 x3 outputs : float32 x1
- yaourt.ortops.doc.print_fused_kernel_cpu_ops_rst() None#
Outputs the fused-kernel CPU custom-op catalogue as RST to stdout.
Renders
FUSED_KERNEL_CPU_OPSas valid reStructuredText suitable for a.. runpython:: :rst:block in the Sphinx documentation. Each op is rendered as a sub-section with alist-tablefor its metadata followed by its description, ensuring the rendered page is always derived from the C++ source files without manual maintenance.<<<
from yaourt.ortops.doc import print_fused_kernel_cpu_ops_rst print_fused_kernel_cpu_ops_rst()
>>>
MulMul#
Domain
yaourt.ortops.fused_kernel.cpuExecution provider
CPUExecutionProviderInputs
3
Outputs
1
Fused MulMul CPU custom operator.
Declares the kernel class for the element-wise ternary multiplication applied to three broadcastable input tensors A, B, C:
<math>
The implementation uses AVX2 SIMD instructions for vectorised throughput and std::thread for multi-threaded parallelism.
Supported element type:
float.
CPU ops catalogue#
<<<
from yaourt.ortops.doc import print_cpu_ops
print_cpu_ops()
>>>
DenseToSparse
domain : yaourt.ortops.sparse.cpu
provider : CPUExecutionProvider
version : 1
Converts a 2-D dense float32 tensor into a compact flat sparse encoding.
Only non-zero elements are stored. The 1-D output tensor encodes the
original shape, the number of non-zero elements, their flat indices
(stored as uint32), and their values (float32). The encoding is suitable
as input to SparseToDense for a lossless round-trip.
Constraints: input must be exactly 2-D; only float32 is supported.
inputs:
X (float32) — 2-D dense float32 input tensor of shape [n_rows, n_cols]. Zero elements are not stored in the sparse encoding.
outputs:
Y (float32) — 1-D float32 tensor containing the sparse encoding of X. Layout: header | flat indices (uint32) | non-zero values (float32).
SparseToDense
domain : yaourt.ortops.sparse.cpu
provider : CPUExecutionProvider
version : 1
Converts the compact sparse encoding produced by DenseToSparse back into
a 2-D dense float32 tensor. Positions that were zero in the original
tensor are filled with 0.0. The output shape is recovered from the
sparse header embedded in the input.
Constraints: input must be 1-D with a valid sparse header; the encoded
shape must be 2-D; only float32 is supported.
inputs:
X (float32) — 1-D float32 sparse encoding produced by DenseToSparse.
outputs:
Y (float32) — Reconstructed 2-D dense float32 tensor. The shape is recovered from the sparse header embedded in X.