onnx_diagnostic.torch_export_patches.patches.patch_transformers¶
- class onnx_diagnostic.torch_export_patches.patches.patch_transformers.kkpatched_AttentionMaskConverter[source]¶
Patches
transformers.modeling_attn_mask_utils.AttentionMaskConverter._make_causal_mask
.
- class onnx_diagnostic.torch_export_patches.patches.patch_transformers.patched_DynamicCache[source]¶
Applies modifications implemented in PR transformers/#36652.
- crop(max_length: int)[source]¶
Crop the past key values up to a new max_length in terms of tokens. max_length can also be negative to remove max_length tokens. This is used in assisted decoding and contrastive search.
- classmethod from_batch_splits(splits: List[DynamicCache]) DynamicCache [source]¶
This is the opposite of the above batch_split() method. This will be used by stack_model_outputs in generation.utils
- get_seq_length(layer_idx: int | None = 0) int [source]¶
Returns the sequence length of the cached states. A layer index can be optionally passed.
- reorder_cache(beam_idx: LongTensor)[source]¶
Reorders the cache for beam search, given the selected beam indices.
- update(key_states: Tensor, value_states: Tensor, layer_idx: int, cache_kwargs: Dict[str, Any] | None = None) Tuple[Tensor, Tensor] [source]¶
Updates the cache with the new key_states and value_states for the layer layer_idx.
- Parameters:
- key_states (torch.Tensor):
The new key states to cache.
- value_states (torch.Tensor):
The new value states to cache.
- layer_idx (int):
The index of the layer to cache the states for.
- cache_kwargs (Dict[str, Any], optional):
Additional arguments for the cache subclass. No additional arguments are used in DynamicCache.
- Return:
A tuple containing the updated key and value states.
- class onnx_diagnostic.torch_export_patches.patches.patch_transformers.patched_GenerationMixin[source]¶
Applies modifications implemented in PR transformers/#36652.
- prepare_inputs_for_generation(input_ids: LongTensor, past_key_values: Cache | None = None, attention_mask: LongTensor | None = None, inputs_embeds: FloatTensor | None = None, cache_position: LongTensor | None = None, **kwargs)[source]¶
Prepare the model inputs for generation. In includes operations like computing the 4D attention mask or slicing inputs given the existing cache.
See the forward pass in the model documentation for expected arguments (different models might have different requirements for e.g. past_key_values). This function should work as is for most LLMs.