yobx.xoptim.patterns_ort.relative_position_bias#
- class yobx.xoptim.patterns_ort.relative_position_bias.GatedRelativePositionBiasPattern(verbose: int = 0, priority: int = 2)[source]#
Implements the fusion of gated relative position bias computation (DeBERTa-v2/v3 style) into
com.microsoft.GatedRelativePositionBias.The fused pattern corresponds to the DeBERTa disentangled self-attention gating computation, which applies a learned sigmoid gate to modulate a pre-computed relative position bias tensor.
Model with nodes to be fused:
graph TD classDef ioNode fill:#dfd,stroke:#333,color:#333 classDef initNode fill:#cccc00,stroke:#333,color:#333 classDef constNode fill:#f9f,stroke:#333,stroke-width:2px,color:#333 classDef opNode fill:#bbf,stroke:#333,stroke-width:2px,color:#333 I_query_layer(["query_layer FLOAT(batch, seq_len, num_heads*head_size)"]) I_rel_pos(["rel_pos FLOAT(1, num_heads, seq_len, seq_len)"]) i_query_bias["query_bias FLOAT(num_heads*head_size)"] i_gate_weight["gate_weight FLOAT(head_size, D)"] i_gate_bias["gate_bias FLOAT(D)"] i_eco_a["eco_a FLOAT(1, num_heads, 1, 1)"] Add_0[["Add(query_layer, query_bias)"]] Reshape_1[["Reshape(., [batch, seq_len, num_heads, head_size])"]] Transpose_2[["Transpose(., perm=[0,2,1,3])"]] MatMul_3[["MatMul(., gate_weight)"]] Add_4[["Add(., gate_bias)"]] Reshape_5[["Reshape(., [batch, num_heads, seq_len, 2, D//2])"]] ReduceSum_6[["ReduceSum(., axis=-1, keepdims=0)"]] Sigmoid_7[["Sigmoid(.)"]] Split_8[["Split(., axis=-1)"]] Mul_9[["Mul(gate_r, eco_a)"]] Sub_10[["Sub(., 1.0)"]] Mul_11[["Mul(gate_u, .)"]] Add_12[["Add(., 2.0)"]] Mul_13[["Mul(gate_u_1, rel_pos)"]] I_query_layer -->|"FLOAT(batch, seq_len, num_heads*head_size)"| Add_0 i_query_bias -->|"FLOAT(num_heads*head_size)"| Add_0 Add_0 --> Reshape_1 Reshape_1 --> Transpose_2 Transpose_2 --> MatMul_3 i_gate_weight -->|"FLOAT(head_size, D)"| MatMul_3 MatMul_3 --> Add_4 i_gate_bias -->|"FLOAT(D)"| Add_4 Add_4 --> Reshape_5 Reshape_5 --> ReduceSum_6 ReduceSum_6 --> Sigmoid_7 Sigmoid_7 --> Split_8 Split_8 -->|"gate_r"| Mul_9 i_eco_a -->|"FLOAT(1, num_heads, 1, 1)"| Mul_9 Mul_9 --> Sub_10 Split_8 -->|"gate_u"| Mul_11 Sub_10 --> Mul_11 Mul_11 --> Add_12 Add_12 --> Mul_13 I_rel_pos -->|"FLOAT(1, num_heads, seq_len, seq_len)"| Mul_13 O_Y(["Y FLOAT(batch, num_heads, seq_len, seq_len)"]) Mul_13 --> O_Y class I_query_layer,I_rel_pos,O_Y ioNode class i_query_bias,i_gate_weight,i_gate_bias,i_eco_a initNode class Add_0,Reshape_1,Transpose_2,MatMul_3,Add_4 opNode class Reshape_5,ReduceSum_6,Sigmoid_7,Split_8 opNode class Mul_9,Sub_10,Mul_11,Add_12,Mul_13 opNodeOutcome of the fusion:
graph TD classDef ioNode fill:#dfd,stroke:#333,color:#333 classDef initNode fill:#cccc00,stroke:#333,color:#333 classDef constNode fill:#f9f,stroke:#333,stroke-width:2px,color:#333 classDef opNode fill:#bbf,stroke:#333,stroke-width:2px,color:#333 I_query_layer(["query_layer FLOAT(batch, seq_len, num_heads*head_size)"]) I_rel_pos(["rel_pos FLOAT(1, num_heads, seq_len, seq_len)"]) i_query_bias["query_bias FLOAT(num_heads*head_size)"] i_gate_weight["gate_weight FLOAT(head_size, D)"] i_gate_bias["gate_bias FLOAT(D)"] i_eco_a["eco_a FLOAT(1, num_heads, 1, 1)"] GatedRPB_0[["com.microsoft.GatedRelativePositionBias(., ., ., ., ., .)"]] I_query_layer -->|"FLOAT(batch, seq_len, num_heads*head_size)"| GatedRPB_0 i_query_bias -->|"FLOAT(num_heads*head_size)"| GatedRPB_0 I_rel_pos -->|"FLOAT(1, num_heads, seq_len, seq_len)"| GatedRPB_0 i_gate_weight -->|"FLOAT(head_size, D)"| GatedRPB_0 i_gate_bias -->|"FLOAT(D)"| GatedRPB_0 i_eco_a -->|"FLOAT(1, num_heads, 1, 1)"| GatedRPB_0 O_Y(["Y FLOAT(batch, num_heads, seq_len, seq_len)"]) GatedRPB_0 --> O_Y class I_query_layer,I_rel_pos,O_Y ioNode class i_query_bias,i_gate_weight,i_gate_bias,i_eco_a initNode class GatedRPB_0 opNode- apply(g: GraphBuilder, add_query_bias_node: NodeProto, reshape_1_node: NodeProto, transpose_node: NodeProto, matmul_node: NodeProto, add_gate_bias_node: NodeProto, reshape_2_node: NodeProto, reduce_sum_node: NodeProto, sigmoid_node: NodeProto, split_node: NodeProto, mul_eco_a_node: NodeProto, sub_node: NodeProto, mul_gate_u_node: NodeProto, add_two_node: NodeProto, final_mul_node: NodeProto) List[NodeProto][source]#
The method does the rewriting. It assumes it can happen. It takes a list of nodes impacted by the rewriting assumes no other pattern optimizer will be modify them. It receives the list of nodes returned by method apply. Since it is a list of argument, method match can include None values. The method returns the new nodes. The optimizer considers that any node given to this function is removed from the graph, and any node returned by it are added. If a received node must be kept, it must be added to the list of returned node.
- Parameters:
nodes – nodes returned by method match, there are then removed
- Returns:
nodes to add to graph.
- match(g: GraphBuilderPatternOptimization, node: NodeProto, matched: List[MatchResult]) MatchResult | None[source]#
Determines nodes around node which can be rewritten.
- Parameters:
g – is a
GraphBuilderPatternOptimization, it holds all the existing nodes, is able to return any information about type, shape, the node before, the node after another one.node – the matching must determine if some nodes around this one are part of set of nodes this pattern optimizer can rewrite. From there, the function explores wherever it needs, checking any condition it needs.
matched – usually unused, it returns of nodes already matching a pattern
The method must not modify the graph. The method returns None if no match is found or an instance of class
MatchResult. It must contain:a list of nodes involved in the rewriting. It does not mean all of them will be removed but all of them are needed to do the rewriting and must not be impacted by other pattern optimizer.
A function doing the rewriting (usually method apply of the pattern class).
An existing node where the rewritten nodes can be inserted. Knowing it makes it faster to rewriter. If not specified, the optimizer will automatically determine the position of the new nodes.
- class yobx.xoptim.patterns_ort.relative_position_bias.RelativePositionBiasPattern(verbose: int = 0, priority: int = 2)[source]#
Fuses the relative position bias computation (T5-style, encoder) into
com.microsoft.RelativePositionBias.The fused pattern corresponds to the T5 bidirectional relative attention bias computation, recognizable by a
Gathernode reading from a learnable bias table, whose indices are computed through a bucketing function of absolute relative positions.Model with nodes to be fused:
graph TD classDef ioNode fill:#dfd,stroke:#333,color:#333 classDef initNode fill:#cccc00,stroke:#333,color:#333 classDef constNode fill:#f9f,stroke:#333,stroke-width:2px,color:#333 classDef opNode fill:#bbf,stroke:#333,stroke-width:2px,color:#333 I_seq_len(["seq_len INT64()"]) i_bias_table["bias_table FLOAT(num_buckets, num_heads)"] i_zero["zero INT64()"] i_one["one INT64()"] i_max_exact_int["max_exact INT64()"] i_max_exact_float["max_exact FLOAT()"] i_log_max["log_max FLOAT()"] i_scale["scale FLOAT()"] i_clamp["clamp_val INT64()"] Range_0[["Range(zero, seq_len, one)"]] Unsqueeze_1[["Unsqueeze(., [0])"]] Unsqueeze_2[["Unsqueeze(., [1])"]] Sub_3[["Sub(., .)"]] Abs_4[["Abs(.)"]] CastFloat_5[["Cast(., to=FLOAT)"]] Div_6[["Div(., max_exact_float)"]] Log_7[["Log(.)"]] Div_8[["Div(., log_max)"]] Mul_9[["Mul(., scale)"]] CastInt_10[["Cast(., to=INT64)"]] Add_11[["Add(., max_exact_int)"]] Shape_12[["Shape(.)"]] ConstantOfShape_13[["ConstantOfShape(., clamp_val)"]] Min_14[["Min(., .)"]] CastInt2_15[["Cast(., to=INT64)"]] Where_16[["Where(., ., .)"]] Gather_17[["Gather(bias_table, .)"]] Transpose_18[["Transpose(., perm=[2,0,1])"]] Unsqueeze_19[["Unsqueeze(., [0])"]] I_seq_len -->|"INT64()"| Range_0 Range_0 -->|"INT64(seq_len)"| Unsqueeze_1 Range_0 -->|"INT64(seq_len)"| Unsqueeze_2 Unsqueeze_1 -->|"INT64(1, seq_len)"| Sub_3 Unsqueeze_2 -->|"INT64(seq_len, 1)"| Sub_3 Sub_3 -->|"INT64(seq_len, seq_len)"| Abs_4 Abs_4 -->|"INT64(seq_len, seq_len)"| CastFloat_5 Abs_4 -->|"INT64(seq_len, seq_len)"| CastInt2_15 CastFloat_5 -->|"FLOAT(seq_len, seq_len)"| Div_6 Div_6 -->|"FLOAT(seq_len, seq_len)"| Log_7 Log_7 -->|"FLOAT(seq_len, seq_len)"| Div_8 Div_8 -->|"FLOAT(seq_len, seq_len)"| Mul_9 Mul_9 -->|"FLOAT(seq_len, seq_len)"| CastInt_10 CastInt_10 -->|"INT64(seq_len, seq_len)"| Add_11 Add_11 -->|"INT64(seq_len, seq_len)"| Shape_12 Add_11 -->|"INT64(seq_len, seq_len)"| Min_14 Shape_12 -->|"INT64(2)"| ConstantOfShape_13 ConstantOfShape_13 -->|"INT64(seq_len, seq_len)"| Min_14 CastInt2_15 -->|"INT64(seq_len, seq_len)"| Where_16 Min_14 -->|"INT64(seq_len, seq_len)"| Where_16 i_bias_table -->|"FLOAT(num_buckets, num_heads)"| Gather_17 Where_16 -->|"INT64(seq_len, seq_len)"| Gather_17 Gather_17 -->|"FLOAT(seq_len, seq_len, num_heads)"| Transpose_18 Transpose_18 -->|"FLOAT(num_heads, seq_len, seq_len)"| Unsqueeze_19 O_Y(["Y FLOAT(1, num_heads, seq_len, seq_len)"]) Unsqueeze_19 --> O_Y class I_seq_len,O_Y ioNode class i_bias_table,i_zero,i_one,i_max_exact_int,i_max_exact_float initNode class i_log_max,i_scale,i_clamp initNode class Range_0,Unsqueeze_1,Unsqueeze_2,Sub_3,Abs_4 opNode class CastFloat_5,Div_6,Log_7,Div_8,Mul_9,CastInt_10 opNode class Add_11,Shape_12,ConstantOfShape_13,Min_14 opNode class CastInt2_15,Where_16,Gather_17,Transpose_18,Unsqueeze_19 opNodeOutcome of the fusion:
graph TD classDef ioNode fill:#dfd,stroke:#333,color:#333 classDef initNode fill:#cccc00,stroke:#333,color:#333 classDef constNode fill:#f9f,stroke:#333,stroke-width:2px,color:#333 classDef opNode fill:#bbf,stroke:#333,stroke-width:2px,color:#333 I_seq_len(["seq_len INT64()"]) i_bias_table_t["bias_table_T FLOAT(num_heads, num_buckets)"] RelativePositionBias_0[["com.microsoft.RelativePositionBias(., ., .)"]] i_bias_table_t -->|"FLOAT(num_heads, num_buckets)"| RelativePositionBias_0 I_seq_len -->|"INT64()"| RelativePositionBias_0 I_seq_len -->|"INT64()"| RelativePositionBias_0 O_Y(["Y FLOAT(1, num_heads, seq_len, seq_len)"]) RelativePositionBias_0 --> O_Y class I_seq_len,O_Y ioNode class i_bias_table_t initNode class RelativePositionBias_0 opNode- apply(g: GraphBuilder, cast_float_node: NodeProto, div_pos_node: NodeProto, log_node: NodeProto, div_log_node: NodeProto, mul_node: NodeProto, cast_int_bucket_node: NodeProto, add_bucket_node: NodeProto, shape_node: NodeProto, const_of_shape_node: NodeProto, min_node: NodeProto, where_node: NodeProto, gather_node: NodeProto, transpose_node: NodeProto, unsqueeze_batch_node: NodeProto) List[NodeProto][source]#
The method does the rewriting. It assumes it can happen. It takes a list of nodes impacted by the rewriting assumes no other pattern optimizer will be modify them. It receives the list of nodes returned by method apply. Since it is a list of argument, method match can include None values. The method returns the new nodes. The optimizer considers that any node given to this function is removed from the graph, and any node returned by it are added. If a received node must be kept, it must be added to the list of returned node.
- Parameters:
nodes – nodes returned by method match, there are then removed
- Returns:
nodes to add to graph.
- match(g: GraphBuilderPatternOptimization, node: NodeProto, matched: List[MatchResult]) MatchResult | None[source]#
Determines nodes around node which can be rewritten.
- Parameters:
g – is a
GraphBuilderPatternOptimization, it holds all the existing nodes, is able to return any information about type, shape, the node before, the node after another one.node – the matching must determine if some nodes around this one are part of set of nodes this pattern optimizer can rewrite. From there, the function explores wherever it needs, checking any condition it needs.
matched – usually unused, it returns of nodes already matching a pattern
The method must not modify the graph. The method returns None if no match is found or an instance of class
MatchResult. It must contain:a list of nodes involved in the rewriting. It does not mean all of them will be removed but all of them are needed to do the rewriting and must not be impacted by other pattern optimizer.
A function doing the rewriting (usually method apply of the pattern class).
An existing node where the rewritten nodes can be inserted. Knowing it makes it faster to rewriter. If not specified, the optimizer will automatically determine the position of the new nodes.