This document introduces the speed benchmark testing process for the Qwen2.5 series models (original and quantized models). For detailed reports, please refer to the [Qwen2.5 Speed Benchmark](https://qwen.readthedocs.io/en/latest/benchmark/speed_benchmark.html).
This document introduces the speed benchmark testing process for the Qwen2.5 series models (original and quantized models). For detailed reports, please refer to the [Qwen2.5 Speed Benchmark](https://qwen.readthedocs.io/en/latest/benchmark/speed_benchmark.html).
### 1. Model Collections
## 1. Model Collections
For models hosted on HuggingFace, please refer to [Qwen2.5 Collections-HuggingFace](https://huggingface.co/collections/Qwen/qwen25-66e81a666513e518adb90d9e).
For models hosted on HuggingFace, refer to [Qwen2.5 Model - HuggingFace](https://huggingface.co/collections/Qwen/qwen25-66e81a666513e518adb90d9e).
For models hosted on ModelScope, please refer to [Qwen2.5 Collections-ModelScope](https://modelscope.cn/collections/Qwen25-dbc4d30adb768).
For models hosted on ModelScope, refer to [Qwen2.5 Model - ModelScope](https://modelscope.cn/collections/Qwen25-dbc4d30adb768).
Below are two methods for executing tests: using a script or the Speed Benchmark tool.
#### 3.1 Inference using HuggingFace Transformers
### Method 1: Testing with Speed Benchmark Tool
-Use HuggingFace hub
Use the Speed Benchmark tool developed by [EvalScope](https://github.com/modelscope/evalscope) for testing. It supports automatic downloading of models from ModelScope and outputs test results. Refer to the [📖 User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/speed_benchmark.html).
`--model_id_or_path`: Model ID or local path, optional values refer to the `Model Resources` section
`--context_length`: Input length in tokens; optional values are 1, 6144, 14336, 30720, 63488, 129024; refer to the `Qwen2.5 Model Efficiency Evaluation Report` for specifics
`--generate_length`: Number of tokens to generate; default is 2048
`--gpus`: Equivalent to the environment variable CUDA_VISIBLE_DEVICES, e.g., `0,1,2,3`, `4,5`
`--use_modelscope`: If set, uses ModelScope to load the model; otherwise, uses HuggingFace
`--outputs_dir`: Output directory, default is `outputs/transformers`
Parameters:
#### vLLM Inference
`--model_id_or_path`: The model id on ModelScope or HuggingFace hub.
- Using HuggingFace Hub
`--context_length`: Input length in tokens; optional values are 1, 6144, 14336, 30720, 63488, 129024; Refer to the `Qwen2.5 SpeedBenchmark`.
`--generate_length`: Output length in tokens; default is 2048.
`--max_model_len`: Maximum model length in tokens; default is 32768. Optional values are 4096, 8192, 32768, 65536, 131072.
`--gpus`: Equivalent to the environment variable CUDA_VISIBLE_DEVICES. e.g. `0,1,2,3`, `4,5`
`--use_modelscope`: Use ModelScope when set this flag. Otherwise, use HuggingFace.
`--gpu_memory_utilization`: GPU memory utilization; range is (0, 1]; default is 0.9.
`--outputs_dir`: Output directory; default is outputs/vllm.
`--enforce_eager`: Whether to enforce eager mode; default is False.
`--model_id_or_path`: Model ID or local path, optional values refer to the `Model Resources` section
`--context_length`: Input length in tokens; optional values are 1, 6144, 14336, 30720, 63488, 129024; refer to the `Qwen2.5 Model Efficiency Evaluation Report` for specifics
`--generate_length`: Number of tokens to generate; default is 2048
`--max_model_len`: Maximum model length in tokens; default is 32768
`--gpus`: Equivalent to the environment variable CUDA_VISIBLE_DEVICES, e.g., `0,1,2,3`, `4,5`
`--use_modelscope`: If set, uses ModelScope to load the model; otherwise, uses HuggingFace
`--gpu_memory_utilization`: GPU memory utilization, range (0, 1]; default is 0.9
`--outputs_dir`: Output directory, default is `outputs/vllm`
`--enforce_eager`: Whether to enforce eager mode; default is False
- Run multiple experiments and compute the average result; a typical number is 3 times.
#### Test Results
- Make sure the GPU is idle before running experiments.
Test results can be found in the `outputs` directory, which by default includes two folders for `transformers` and `vllm`, storing test results for HuggingFace transformers and vLLM respectively.
### 4. Results
## Notes
Please check the `outputs` directory, which includes two directories by default: `transformers` and `vllm`, containing the experiments results for HuggingFace transformers and vLLM, respectively.
1. Conduct multiple tests and take the average, with a typical value of 3 tests.
2. Ensure the GPU is idle before testing to avoid interference from other tasks.