yobx.xoptim.patterns_ort.relative_position_bias#

class yobx.xoptim.patterns_ort.relative_position_bias.GatedRelativePositionBiasPattern(verbose: int = 0, priority: int = 2)[source]#

Implements the fusion of gated relative position bias computation (DeBERTa-v2/v3 style) into com.microsoft.GatedRelativePositionBias.

The fused pattern corresponds to the DeBERTa disentangled self-attention gating computation, which applies a learned sigmoid gate to modulate a pre-computed relative position bias tensor.

Model with nodes to be fused:

        graph TD

    classDef ioNode fill:#dfd,stroke:#333,color:#333
    classDef initNode fill:#cccc00,stroke:#333,color:#333
    classDef constNode fill:#f9f,stroke:#333,stroke-width:2px,color:#333
    classDef opNode fill:#bbf,stroke:#333,stroke-width:2px,color:#333

    I_query_layer(["query_layer FLOAT(batch, seq_len, num_heads*head_size)"])
    I_rel_pos(["rel_pos FLOAT(1, num_heads, seq_len, seq_len)"])
    i_query_bias["query_bias FLOAT(num_heads*head_size)"]
    i_gate_weight["gate_weight FLOAT(head_size, D)"]
    i_gate_bias["gate_bias FLOAT(D)"]
    i_eco_a["eco_a FLOAT(1, num_heads, 1, 1)"]

    Add_0[["Add(query_layer, query_bias)"]]
    Reshape_1[["Reshape(., [batch, seq_len, num_heads, head_size])"]]
    Transpose_2[["Transpose(., perm=[0,2,1,3])"]]
    MatMul_3[["MatMul(., gate_weight)"]]
    Add_4[["Add(., gate_bias)"]]
    Reshape_5[["Reshape(., [batch, num_heads, seq_len, 2, D//2])"]]
    ReduceSum_6[["ReduceSum(., axis=-1, keepdims=0)"]]
    Sigmoid_7[["Sigmoid(.)"]]
    Split_8[["Split(., axis=-1)"]]
    Mul_9[["Mul(gate_r, eco_a)"]]
    Sub_10[["Sub(., 1.0)"]]
    Mul_11[["Mul(gate_u, .)"]]
    Add_12[["Add(., 2.0)"]]
    Mul_13[["Mul(gate_u_1, rel_pos)"]]

    I_query_layer -->|"FLOAT(batch, seq_len, num_heads*head_size)"| Add_0
    i_query_bias -->|"FLOAT(num_heads*head_size)"| Add_0
    Add_0 --> Reshape_1
    Reshape_1 --> Transpose_2
    Transpose_2 --> MatMul_3
    i_gate_weight -->|"FLOAT(head_size, D)"| MatMul_3
    MatMul_3 --> Add_4
    i_gate_bias -->|"FLOAT(D)"| Add_4
    Add_4 --> Reshape_5
    Reshape_5 --> ReduceSum_6
    ReduceSum_6 --> Sigmoid_7
    Sigmoid_7 --> Split_8
    Split_8 -->|"gate_r"| Mul_9
    i_eco_a -->|"FLOAT(1, num_heads, 1, 1)"| Mul_9
    Mul_9 --> Sub_10
    Split_8 -->|"gate_u"| Mul_11
    Sub_10 --> Mul_11
    Mul_11 --> Add_12
    Add_12 --> Mul_13
    I_rel_pos -->|"FLOAT(1, num_heads, seq_len, seq_len)"| Mul_13

    O_Y(["Y FLOAT(batch, num_heads, seq_len, seq_len)"])
    Mul_13 --> O_Y

    class I_query_layer,I_rel_pos,O_Y ioNode
    class i_query_bias,i_gate_weight,i_gate_bias,i_eco_a initNode
    class Add_0,Reshape_1,Transpose_2,MatMul_3,Add_4 opNode
    class Reshape_5,ReduceSum_6,Sigmoid_7,Split_8 opNode
    class Mul_9,Sub_10,Mul_11,Add_12,Mul_13 opNode
    

Outcome of the fusion:

        graph TD

    classDef ioNode fill:#dfd,stroke:#333,color:#333
    classDef initNode fill:#cccc00,stroke:#333,color:#333
    classDef constNode fill:#f9f,stroke:#333,stroke-width:2px,color:#333
    classDef opNode fill:#bbf,stroke:#333,stroke-width:2px,color:#333

    I_query_layer(["query_layer FLOAT(batch, seq_len, num_heads*head_size)"])
    I_rel_pos(["rel_pos FLOAT(1, num_heads, seq_len, seq_len)"])
    i_query_bias["query_bias FLOAT(num_heads*head_size)"]
    i_gate_weight["gate_weight FLOAT(head_size, D)"]
    i_gate_bias["gate_bias FLOAT(D)"]
    i_eco_a["eco_a FLOAT(1, num_heads, 1, 1)"]

    GatedRPB_0[["com.microsoft.GatedRelativePositionBias(., ., ., ., ., .)"]]

    I_query_layer -->|"FLOAT(batch, seq_len, num_heads*head_size)"| GatedRPB_0
    i_query_bias -->|"FLOAT(num_heads*head_size)"| GatedRPB_0
    I_rel_pos -->|"FLOAT(1, num_heads, seq_len, seq_len)"| GatedRPB_0
    i_gate_weight -->|"FLOAT(head_size, D)"| GatedRPB_0
    i_gate_bias -->|"FLOAT(D)"| GatedRPB_0
    i_eco_a -->|"FLOAT(1, num_heads, 1, 1)"| GatedRPB_0

    O_Y(["Y FLOAT(batch, num_heads, seq_len, seq_len)"])
    GatedRPB_0 --> O_Y

    class I_query_layer,I_rel_pos,O_Y ioNode
    class i_query_bias,i_gate_weight,i_gate_bias,i_eco_a initNode
    class GatedRPB_0 opNode
    
apply(g: GraphBuilder, add_query_bias_node: NodeProto, reshape_1_node: NodeProto, transpose_node: NodeProto, matmul_node: NodeProto, add_gate_bias_node: NodeProto, reshape_2_node: NodeProto, reduce_sum_node: NodeProto, sigmoid_node: NodeProto, split_node: NodeProto, mul_eco_a_node: NodeProto, sub_node: NodeProto, mul_gate_u_node: NodeProto, add_two_node: NodeProto, final_mul_node: NodeProto) List[NodeProto][source]#

The method does the rewriting. It assumes it can happen. It takes a list of nodes impacted by the rewriting assumes no other pattern optimizer will be modify them. It receives the list of nodes returned by method apply. Since it is a list of argument, method match can include None values. The method returns the new nodes. The optimizer considers that any node given to this function is removed from the graph, and any node returned by it are added. If a received node must be kept, it must be added to the list of returned node.

Parameters:

nodes – nodes returned by method match, there are then removed

Returns:

nodes to add to graph.

match(g: GraphBuilderPatternOptimization, node: NodeProto, matched: List[MatchResult]) MatchResult | None[source]#

Determines nodes around node which can be rewritten.

Parameters:
  • g – is a GraphBuilderPatternOptimization, it holds all the existing nodes, is able to return any information about type, shape, the node before, the node after another one.

  • node – the matching must determine if some nodes around this one are part of set of nodes this pattern optimizer can rewrite. From there, the function explores wherever it needs, checking any condition it needs.

  • matched – usually unused, it returns of nodes already matching a pattern

The method must not modify the graph. The method returns None if no match is found or an instance of class MatchResult. It must contain:

  • a list of nodes involved in the rewriting. It does not mean all of them will be removed but all of them are needed to do the rewriting and must not be impacted by other pattern optimizer.

  • A function doing the rewriting (usually method apply of the pattern class).

  • An existing node where the rewritten nodes can be inserted. Knowing it makes it faster to rewriter. If not specified, the optimizer will automatically determine the position of the new nodes.

class yobx.xoptim.patterns_ort.relative_position_bias.RelativePositionBiasPattern(verbose: int = 0, priority: int = 2)[source]#

Fuses the relative position bias computation (T5-style, encoder) into com.microsoft.RelativePositionBias.

The fused pattern corresponds to the T5 bidirectional relative attention bias computation, recognizable by a Gather node reading from a learnable bias table, whose indices are computed through a bucketing function of absolute relative positions.

Model with nodes to be fused:

        graph TD

    classDef ioNode fill:#dfd,stroke:#333,color:#333
    classDef initNode fill:#cccc00,stroke:#333,color:#333
    classDef constNode fill:#f9f,stroke:#333,stroke-width:2px,color:#333
    classDef opNode fill:#bbf,stroke:#333,stroke-width:2px,color:#333

    I_seq_len(["seq_len INT64()"])
    i_bias_table["bias_table FLOAT(num_buckets, num_heads)"]
    i_zero["zero INT64()"]
    i_one["one INT64()"]
    i_max_exact_int["max_exact INT64()"]
    i_max_exact_float["max_exact FLOAT()"]
    i_log_max["log_max FLOAT()"]
    i_scale["scale FLOAT()"]
    i_clamp["clamp_val INT64()"]

    Range_0[["Range(zero, seq_len, one)"]]
    Unsqueeze_1[["Unsqueeze(., [0])"]]
    Unsqueeze_2[["Unsqueeze(., [1])"]]
    Sub_3[["Sub(., .)"]]
    Abs_4[["Abs(.)"]]
    CastFloat_5[["Cast(., to=FLOAT)"]]
    Div_6[["Div(., max_exact_float)"]]
    Log_7[["Log(.)"]]
    Div_8[["Div(., log_max)"]]
    Mul_9[["Mul(., scale)"]]
    CastInt_10[["Cast(., to=INT64)"]]
    Add_11[["Add(., max_exact_int)"]]
    Shape_12[["Shape(.)"]]
    ConstantOfShape_13[["ConstantOfShape(., clamp_val)"]]
    Min_14[["Min(., .)"]]
    CastInt2_15[["Cast(., to=INT64)"]]
    Where_16[["Where(., ., .)"]]
    Gather_17[["Gather(bias_table, .)"]]
    Transpose_18[["Transpose(., perm=[2,0,1])"]]
    Unsqueeze_19[["Unsqueeze(., [0])"]]

    I_seq_len -->|"INT64()"| Range_0
    Range_0 -->|"INT64(seq_len)"| Unsqueeze_1
    Range_0 -->|"INT64(seq_len)"| Unsqueeze_2
    Unsqueeze_1 -->|"INT64(1, seq_len)"| Sub_3
    Unsqueeze_2 -->|"INT64(seq_len, 1)"| Sub_3
    Sub_3 -->|"INT64(seq_len, seq_len)"| Abs_4
    Abs_4 -->|"INT64(seq_len, seq_len)"| CastFloat_5
    Abs_4 -->|"INT64(seq_len, seq_len)"| CastInt2_15
    CastFloat_5 -->|"FLOAT(seq_len, seq_len)"| Div_6
    Div_6 -->|"FLOAT(seq_len, seq_len)"| Log_7
    Log_7 -->|"FLOAT(seq_len, seq_len)"| Div_8
    Div_8 -->|"FLOAT(seq_len, seq_len)"| Mul_9
    Mul_9 -->|"FLOAT(seq_len, seq_len)"| CastInt_10
    CastInt_10 -->|"INT64(seq_len, seq_len)"| Add_11
    Add_11 -->|"INT64(seq_len, seq_len)"| Shape_12
    Add_11 -->|"INT64(seq_len, seq_len)"| Min_14
    Shape_12 -->|"INT64(2)"| ConstantOfShape_13
    ConstantOfShape_13 -->|"INT64(seq_len, seq_len)"| Min_14
    CastInt2_15 -->|"INT64(seq_len, seq_len)"| Where_16
    Min_14 -->|"INT64(seq_len, seq_len)"| Where_16
    i_bias_table -->|"FLOAT(num_buckets, num_heads)"| Gather_17
    Where_16 -->|"INT64(seq_len, seq_len)"| Gather_17
    Gather_17 -->|"FLOAT(seq_len, seq_len, num_heads)"| Transpose_18
    Transpose_18 -->|"FLOAT(num_heads, seq_len, seq_len)"| Unsqueeze_19

    O_Y(["Y FLOAT(1, num_heads, seq_len, seq_len)"])
    Unsqueeze_19 --> O_Y

    class I_seq_len,O_Y ioNode
    class i_bias_table,i_zero,i_one,i_max_exact_int,i_max_exact_float initNode
    class i_log_max,i_scale,i_clamp initNode
    class Range_0,Unsqueeze_1,Unsqueeze_2,Sub_3,Abs_4 opNode
    class CastFloat_5,Div_6,Log_7,Div_8,Mul_9,CastInt_10 opNode
    class Add_11,Shape_12,ConstantOfShape_13,Min_14 opNode
    class CastInt2_15,Where_16,Gather_17,Transpose_18,Unsqueeze_19 opNode
    

Outcome of the fusion:

        graph TD

    classDef ioNode fill:#dfd,stroke:#333,color:#333
    classDef initNode fill:#cccc00,stroke:#333,color:#333
    classDef constNode fill:#f9f,stroke:#333,stroke-width:2px,color:#333
    classDef opNode fill:#bbf,stroke:#333,stroke-width:2px,color:#333

    I_seq_len(["seq_len INT64()"])
    i_bias_table_t["bias_table_T FLOAT(num_heads, num_buckets)"]

    RelativePositionBias_0[["com.microsoft.RelativePositionBias(., ., .)"]]

    i_bias_table_t -->|"FLOAT(num_heads, num_buckets)"| RelativePositionBias_0
    I_seq_len -->|"INT64()"| RelativePositionBias_0
    I_seq_len -->|"INT64()"| RelativePositionBias_0

    O_Y(["Y FLOAT(1, num_heads, seq_len, seq_len)"])
    RelativePositionBias_0 --> O_Y

    class I_seq_len,O_Y ioNode
    class i_bias_table_t initNode
    class RelativePositionBias_0 opNode
    
apply(g: GraphBuilder, cast_float_node: NodeProto, div_pos_node: NodeProto, log_node: NodeProto, div_log_node: NodeProto, mul_node: NodeProto, cast_int_bucket_node: NodeProto, add_bucket_node: NodeProto, shape_node: NodeProto, const_of_shape_node: NodeProto, min_node: NodeProto, where_node: NodeProto, gather_node: NodeProto, transpose_node: NodeProto, unsqueeze_batch_node: NodeProto) List[NodeProto][source]#

The method does the rewriting. It assumes it can happen. It takes a list of nodes impacted by the rewriting assumes no other pattern optimizer will be modify them. It receives the list of nodes returned by method apply. Since it is a list of argument, method match can include None values. The method returns the new nodes. The optimizer considers that any node given to this function is removed from the graph, and any node returned by it are added. If a received node must be kept, it must be added to the list of returned node.

Parameters:

nodes – nodes returned by method match, there are then removed

Returns:

nodes to add to graph.

match(g: GraphBuilderPatternOptimization, node: NodeProto, matched: List[MatchResult]) MatchResult | None[source]#

Determines nodes around node which can be rewritten.

Parameters:
  • g – is a GraphBuilderPatternOptimization, it holds all the existing nodes, is able to return any information about type, shape, the node before, the node after another one.

  • node – the matching must determine if some nodes around this one are part of set of nodes this pattern optimizer can rewrite. From there, the function explores wherever it needs, checking any condition it needs.

  • matched – usually unused, it returns of nodes already matching a pattern

The method must not modify the graph. The method returns None if no match is found or an instance of class MatchResult. It must contain:

  • a list of nodes involved in the rewriting. It does not mean all of them will be removed but all of them are needed to do the rewriting and must not be impacted by other pattern optimizer.

  • A function doing the rewriting (usually method apply of the pattern class).

  • An existing node where the rewritten nodes can be inserted. Knowing it makes it faster to rewriter. If not specified, the optimizer will automatically determine the position of the new nodes.