onnx_diagnostic.ci_models.export_qwen25_vl¶

Export visual and embedding parts of Qwen/Qwen2.5-VL-7B-Instruct¶

Requirements¶

git+https://github.com/sdpython/experimental-experiment.git  # optional
huggingface_hub>=1.2.1
onnx-diagnostic>=0.8.6
onnxruntime>=1.23
torch>=2.9  # weekly is better
tqdm
transformers>=4.57

Examples¶

python -m onnx_diagnostic.ci_models.export_qwen25_vl \
    -m Qwen/Qwen2.5-VL-7B-Instruct \
    --device cpu --dtype float32 --exporter onnx-dynamo --pretrained --second-input --zip

To choose a specific Attention schema:

QWEN25ATTENTION=LOOPMHA python -m onnx_diagnostic.ci_models.export_qwen25_vl \
    -m Qwen/Qwen2.5-VL-7B-Instruct \
    --device cpu --dtype float32 --exporter onnx-dynamo --pretrained --second-input --zip

Cheat sheet for tar commands. To make a tar: tar -czvf model.tar.gz model.onnx model.data And to untar: tar -xzvf model.tar.gz.

Rewritings¶

overview
code: _patch_transformers_qwen2_5.py

Attention¶

The attention is either implemented with MultiHeadAttention in a loop, either with PackedMultiHeadAttention. The choice is made based on the device. It is possible to overwrite this by by setting environment variable QWEN25ATTENTION to:

PACKED: PackedMultiHeadAttention
LOOPMHA: Loop over MultiHeadAttention
LOOPA23: Loop over Attention(23), needs opset 23+.

onnx_diagnostic.ci_models.export_qwen25_vl.get_untrained_model(model_id: str, second_input: bool, verbose: int) → Dict[str, Any][source][source]¶

Returns an untrained model.

Parameters:

model_id – model id
second_input – second input set
verbose – verbosity

Returns:

model and data

onnx_diagnostic.ci_models.export_qwen25_vl.main(model_id: str = 'Qwen/Qwen2.5-VL-7B-Instruct', device: str = 'cpu', dtype: str = 'float32', exporter: str = 'onnx-dynamo', pretrained: bool = True, second_input: bool = True, make_zip: bool = False, output_folder: str = 'dump_models', existing_onnx: str | None = None, part: str = 'visual', atol: float = 0.01, mismatch01: float = 0.1, profile_exporter: bool = False)[source][source]¶

Exports model Qwen/Qwen2.5-VL-7B-Instruct or pieces of it. The script applies as well to other models based on the same architecture.

The function saves everything on disk. It does not generate new inputs on the second run but reuses the saved ones. Same goes for the expected outputs with are also saved on disk.

Parameters:

model_id – model id
device – device
dtype – dtype
exporter – exportor to use
pretrained – pretrained=False is usually used to test
second_input – checks discrepancies on more examples
make_zip – creates a zip at the end
output_folder – output folder
part – “” to export the whole model, "visual" for visual part, "embedding" for the embedding part
atol – raises an exception if tolerance is above that threshold
mismatch01 – raises an exception if the ratio of mismatches is above that threshold
profile_exporter – profiles the exporter