yobx.xoptim.patterns_ort.relative_position_bias#

class yobx.xoptim.patterns_ort.relative_position_bias.GatedRelativePositionBiasPattern(verbose: int = 0, priority: int = 2)[source]#

Implements the fusion of gated relative position bias computation (DeBERTa-v2/v3 style) into com.microsoft.GatedRelativePositionBias.

The fused pattern corresponds to the DeBERTa disentangled self-attention gating computation, which applies a learned sigmoid gate to modulate a pre-computed relative position bias tensor.

Model with nodes to be fused:

        graph TD

    classDef ioNode fill:#dfd,stroke:#333,color:#333
    classDef initNode fill:#cccc00,stroke:#333,color:#333
    classDef constNode fill:#f9f,stroke:#333,stroke-width:2px,color:#333
    classDef opNode fill:#bbf,stroke:#333,stroke-width:2px,color:#333

    I_query_layer(["query_layer FLOAT(batch, seq_len, num_heads*head_size)"])
    I_rel_pos(["rel_pos FLOAT(1, num_heads, seq_len, seq_len)"])
    i_query_bias["query_bias FLOAT(num_heads*head_size)"]
    i_gate_weight["gate_weight FLOAT(head_size, D)"]
    i_gate_bias["gate_bias FLOAT(D)"]
    i_eco_a["eco_a FLOAT(1, num_heads, 1, 1)"]

    Add_0[["Add(query_layer, query_bias)"]]
    Reshape_1[["Reshape(., [batch, seq_len, num_heads, head_size])"]]
    Transpose_2[["Transpose(., perm=[0,2,1,3])"]]
    MatMul_3[["MatMul(., gate_weight)"]]
    Add_4[["Add(., gate_bias)"]]
    Reshape_5[["Reshape(., [batch, num_heads, seq_len, 2, D//2])"]]
    ReduceSum_6[["ReduceSum(., axis=-1, keepdims=0)"]]
    Sigmoid_7[["Sigmoid(.)"]]
    Split_8[["Split(., axis=-1)"]]
    Mul_9[["Mul(gate_r, eco_a)"]]
    Sub_10[["Sub(., 1.0)"]]
    Mul_11[["Mul(gate_u, .)"]]
    Add_12[["Add(., 2.0)"]]
    Mul_13[["Mul(gate_u_1, rel_pos)"]]

    I_query_layer -->|"FLOAT(batch, seq_len, num_heads*head_size)"| Add_0
    i_query_bias -->|"FLOAT(num_heads*head_size)"| Add_0
    Add_0 --> Reshape_1
    Reshape_1 --> Transpose_2
    Transpose_2 --> MatMul_3
    i_gate_weight -->|"FLOAT(head_size, D)"| MatMul_3
    MatMul_3 --> Add_4
    i_gate_bias -->|"FLOAT(D)"| Add_4
    Add_4 --> Reshape_5
    Reshape_5 --> ReduceSum_6
    ReduceSum_6 --> Sigmoid_7
    Sigmoid_7 --> Split_8
    Split_8 -->|"gate_r"| Mul_9
    i_eco_a -->|"FLOAT(1, num_heads, 1, 1)"| Mul_9
    Mul_9 --> Sub_10
    Split_8 -->|"gate_u"| Mul_11
    Sub_10 --> Mul_11
    Mul_11 --> Add_12
    Add_12 --> Mul_13
    I_rel_pos -->|"FLOAT(1, num_heads, seq_len, seq_len)"| Mul_13

    O_Y(["Y FLOAT(batch, num_heads, seq_len, seq_len)"])
    Mul_13 --> O_Y

    class I_query_layer,I_rel_pos,O_Y ioNode
    class i_query_bias,i_gate_weight,i_gate_bias,i_eco_a initNode
    class Add_0,Reshape_1,Transpose_2,MatMul_3,Add_4 opNode
    class Reshape_5,ReduceSum_6,Sigmoid_7,Split_8 opNode
    class Mul_9,Sub_10,Mul_11,Add_12,Mul_13 opNode

Outcome of the fusion:

        graph TD

    classDef ioNode fill:#dfd,stroke:#333,color:#333
    classDef initNode fill:#cccc00,stroke:#333,color:#333
    classDef constNode fill:#f9f,stroke:#333,stroke-width:2px,color:#333
    classDef opNode fill:#bbf,stroke:#333,stroke-width:2px,color:#333

    I_query_layer(["query_layer FLOAT(batch, seq_len, num_heads*head_size)"])
    I_rel_pos(["rel_pos FLOAT(1, num_heads, seq_len, seq_len)"])
    i_query_bias["query_bias FLOAT(num_heads*head_size)"]
    i_gate_weight["gate_weight FLOAT(head_size, D)"]
    i_gate_bias["gate_bias FLOAT(D)"]
    i_eco_a["eco_a FLOAT(1, num_heads, 1, 1)"]

    GatedRPB_0[["com.microsoft.GatedRelativePositionBias(., ., ., ., ., .)"]]

    I_query_layer -->|"FLOAT(batch, seq_len, num_heads*head_size)"| GatedRPB_0
    i_query_bias -->|"FLOAT(num_heads*head_size)"| GatedRPB_0
    I_rel_pos -->|"FLOAT(1, num_heads, seq_len, seq_len)"| GatedRPB_0
    i_gate_weight -->|"FLOAT(head_size, D)"| GatedRPB_0
    i_gate_bias -->|"FLOAT(D)"| GatedRPB_0
    i_eco_a -->|"FLOAT(1, num_heads, 1, 1)"| GatedRPB_0

    O_Y(["Y FLOAT(batch, num_heads, seq_len, seq_len)"])
    GatedRPB_0 --> O_Y

    class I_query_layer,I_rel_pos,O_Y ioNode
    class i_query_bias,i_gate_weight,i_gate_bias,i_eco_a initNode
    class GatedRPB_0 opNode

apply(g: GraphBuilder, add_query_bias_node: NodeProto, reshape_1_node: NodeProto, transpose_node: NodeProto, matmul_node: NodeProto, add_gate_bias_node: NodeProto, reshape_2_node: NodeProto, reduce_sum_node: NodeProto, sigmoid_node: NodeProto, split_node: NodeProto, mul_eco_a_node: NodeProto, sub_node: NodeProto, mul_gate_u_node: NodeProto, add_two_node: NodeProto, final_mul_node: NodeProto) → List[NodeProto][source]#

The method does the rewriting. It assumes it can happen. It takes a list of nodes impacted by the rewriting assumes no other pattern optimizer will be modify them. It receives the list of nodes returned by method apply. Since it is a list of argument, method match can include None values. The method returns the new nodes. The optimizer considers that any node given to this function is removed from the graph, and any node returned by it are added. If a received node must be kept, it must be added to the list of returned node.

Parameters:: nodes – nodes returned by method match, there are then removed
Returns:: nodes to add to graph.

match(g: GraphBuilderPatternOptimization, node: NodeProto, matched: List[MatchResult]) → MatchResult | None[source]#

Determines nodes around node which can be rewritten.

Parameters:

g – is a GraphBuilderPatternOptimization, it holds all the existing nodes, is able to return any information about type, shape, the node before, the node after another one.
node – the matching must determine if some nodes around this one are part of set of nodes this pattern optimizer can rewrite. From there, the function explores wherever it needs, checking any condition it needs.
matched – usually unused, it returns of nodes already matching a pattern

The method must not modify the graph. The method returns None if no match is found or an instance of class MatchResult. It must contain:

a list of nodes involved in the rewriting. It does not mean all of them will be removed but all of them are needed to do the rewriting and must not be impacted by other pattern optimizer.
A function doing the rewriting (usually method apply of the pattern class).
An existing node where the rewritten nodes can be inserted. Knowing it makes it faster to rewriter. If not specified, the optimizer will automatically determine the position of the new nodes.

class yobx.xoptim.patterns_ort.relative_position_bias.RelativePositionBiasPattern(verbose: int = 0, priority: int = 2)[source]#

Fuses the relative position bias computation (T5-style, encoder) into com.microsoft.RelativePositionBias.

The fused pattern corresponds to the T5 bidirectional relative attention bias computation, recognizable by a Gather node reading from a learnable bias table, whose indices are computed through a bucketing function of absolute relative positions.