update speed benchmark doc

7812c405 · Yunnglin · 2f122d7a · 7812c405 · 7812c405
Commit 7812c405 authored 6 months ago by Yunnglin
--- a/examples/speed-benchmark/README.md
+++ b/examples/speed-benchmark/README.md
-## Speed Benchmark
+# Speed Benchmark
 This document introduces the speed benchmark testing process for the Qwen2.5 series models (original and quantized models). For detailed reports, please refer to the [Qwen2.5 Speed Benchmark](https://qwen.readthedocs.io/en/latest/benchmark/speed_benchmark.html).
-### 1. Model Collections
+## 1. Model Collections
-For models hosted on HuggingFace, please refer to [Qwen2.5 Collections-HuggingFace](https://huggingface.co/collections/Qwen/qwen25-66e81a666513e518adb90d9e).
+For models hosted on HuggingFace, refer to [Qwen2.5 Model - HuggingFace](https://huggingface.co/collections/Qwen/qwen25-66e81a666513e518adb90d9e).
-For models hosted on ModelScope, please refer to [Qwen2.5 Collections-ModelScope](https://modelscope.cn/collections/Qwen25-dbc4d30adb768).
+For models hosted on ModelScope, refer to [Qwen2.5 Model - ModelScope](https://modelscope.cn/collections/Qwen25-dbc4d30adb768).
-### 2. Environment Installation
+## 2. Environment Setup
 For inference using HuggingFace transformers:
@@ -38,69 +38,126 @@ conda activate qwen_perf_vllm
 pip install -r requirements-perf-vllm.txt
 ```
+## 3. Execute Tests
-### 3. Run Experiments
+Below are two methods for executing tests: using a script or the Speed Benchmark tool.
-#### 3.1 Inference using HuggingFace Transformers
+### Method 1: Testing with Speed Benchmark Tool
- Use HuggingFace hub
+Use the Speed Benchmark tool developed by [EvalScope](https://github.com/modelscope/evalscope) for testing. It supports automatic downloading of models from ModelScope and outputs test results. Refer to the [📖 User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/speed_benchmark.html).
+**Install Dependencies**
 ```shell
-python speed_benchmark_transformers.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --gpus 0 --outputs_dir outputs/transformers
+pip install 'evalscope[perf]' -U
 ```
- Use ModelScope hub
+#### HuggingFace Transformers Inference
+Execute the command as follows:
 ```shell
-python speed_benchmark_transformers.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --gpus 0 --use_modelscope --outputs_dir outputs/transformers
+CUDA_VISIBLE_DEVICES=0 evalscope perf \
+ --parallel 1 \
+ --model Qwen/Qwen2.5-0.5B-Instruct \
+ --attn-implementation flash_attention_2 \
+ --log-every-n-query 5 \
+ --connect-timeout 6000 \
+ --read-timeout 6000 \
+ --max-tokens 2048 \
+ --min-tokens 2048 \
+ --api local \
+ --dataset speed_benchmark 
+```
+#### vLLM Inference
+```shell
+CUDA_VISIBLE_DEVICES=0 evalscope perf \
+ --parallel 1 \
+ --model Qwen/Qwen2.5-0.5B-Instruct \
+ --log-every-n-query 1 \
+ --connect-timeout 60000 \
+ --read-timeout 60000\
+ --max-tokens 2048 \
+ --min-tokens 2048 \
+ --api local_vllm \
+ --dataset speed_benchmark
 ```
-Parameters:
+#### Parameter Explanation
+- `--parallel` sets the number of worker threads for concurrent requests, should be fixed at 1.
+- `--model` specifies the model file path or model ID, supporting automatic downloads from ModelScope, e.g., Qwen/Qwen2.5-0.5B-Instruct.
+- `--attn-implementation` sets the attention implementation method, with optional values: flash_attention_2|eager|sdpa.
+- `--log-every-n-query`: sets how often to log every n requests.
+- `--connect-timeout`: sets the connection timeout in seconds.
+- `--read-timeout`: sets the read timeout in seconds.
+- `--max-tokens`: sets the maximum output length in tokens.
+- `--min-tokens`: sets the minimum output length in tokens; both parameters set to 2048 means the model will output a fixed length of 2048.
+- `--api`: sets the inference interface; local inference options are local|local_vllm.
+- `--dataset`: sets the test dataset; options are speed_benchmark|speed_benchmark_long.
-    `--model_id_or_path`: The model path or id on ModelScope or HuggingFace hub
+#### Test Results
-    `--context_length`: Input length in tokens; optional values are 1, 6144, 14336, 30720, 63488, 129024; Refer to the `Qwen2.5 SpeedBenchmark`.  
-    `--generate_length`: Output length in tokens; default is 2048.
-    `--gpus`: Equivalent to the environment variable CUDA_VISIBLE_DEVICES.  e.g. `0,1,2,3`, `4,5`  
-    `--use_modelscope`: Use ModelScope when set this flag. Otherwise, use HuggingFace.  
-    `--outputs_dir`: Output directory; default is outputs/transformers.  
+Test results can be found in the `outputs/{model_name}/{timestamp}/speed_benchmark.json` file, which contains all request results and test parameters.
-#### 3.2 Inference using vLLM
+### Method 2: Testing with Scripts
- Use HuggingFace hub
+#### HuggingFace Transformers Inference
+- Using HuggingFace Hub
 ```shell
-python speed_benchmark_vllm.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --max_model_len 32768 --gpus 0 --gpu_memory_utilization 0.9 --outputs_dir outputs/vllm
+python speed_benchmark_transformers.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --gpus 0 --outputs_dir outputs/transformers
-```
+# Specify HF_ENDPOINT
+HF_ENDPOINT=https://hf-mirror.com python speed_benchmark_transformers.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --gpus 0 --outputs_dir outputs/transformers
+```
- Use ModelScope hub
+- Using ModelScope Hub
 ```shell
-python speed_benchmark_vllm.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --max_model_len 32768 --gpus 0 --use_modelscope --gpu_memory_utilization 0.9 --outputs_dir outputs/vllm
+python speed_benchmark_transformers.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --gpus 0 --use_modelscope --outputs_dir outputs/transformers
 ```
+Parameter Explanation:
+    `--model_id_or_path`: Model ID or local path, optional values refer to the `Model Resources` section  
+    `--context_length`: Input length in tokens; optional values are 1, 6144, 14336, 30720, 63488, 129024; refer to the `Qwen2.5 Model Efficiency Evaluation Report` for specifics  
+    `--generate_length`: Number of tokens to generate; default is 2048
+    `--gpus`: Equivalent to the environment variable CUDA_VISIBLE_DEVICES, e.g., `0,1,2,3`, `4,5`  
+    `--use_modelscope`: If set, uses ModelScope to load the model; otherwise, uses HuggingFace  
+    `--outputs_dir`: Output directory, default is `outputs/transformers`  
-Parameters:
+#### vLLM Inference
-    `--model_id_or_path`: The model id on ModelScope or HuggingFace hub.
+- Using HuggingFace Hub
-    `--context_length`: Input length in tokens; optional values are 1, 6144, 14336, 30720, 63488, 129024; Refer to the `Qwen2.5 SpeedBenchmark`.  
-    `--generate_length`: Output length in tokens; default is 2048.
-    `--max_model_len`: Maximum model length in tokens; default is 32768. Optional values are 4096, 8192, 32768, 65536, 131072.
-    `--gpus`: Equivalent to the environment variable CUDA_VISIBLE_DEVICES.  e.g. `0,1,2,3`, `4,5`  
-    `--use_modelscope`: Use ModelScope when set this flag. Otherwise, use HuggingFace.  
-    `--gpu_memory_utilization`: GPU memory utilization; range is (0, 1]; default is 0.9.  
-    `--outputs_dir`: Output directory; default is outputs/vllm.  
-    `--enforce_eager`: Whether to enforce eager mode; default is False.  
+```shell
+python speed_benchmark_vllm.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --max_model_len 32768 --gpus 0 --gpu_memory_utilization 0.9 --outputs_dir outputs/vllm
+```
+- Using ModelScope Hub
+```shell
+python speed_benchmark_vllm.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --max_model_len 32768 --gpus 0 --use_modelscope --gpu_memory_utilization 0.9 --outputs_dir outputs/vllm
+```
+Parameter Explanation:
-#### 3.3 Tips
+    `--model_id_or_path`: Model ID or local path, optional values refer to the `Model Resources` section  
+    `--context_length`: Input length in tokens; optional values are 1, 6144, 14336, 30720, 63488, 129024; refer to the `Qwen2.5 Model Efficiency Evaluation Report` for specifics  
+    `--generate_length`: Number of tokens to generate; default is 2048
+    `--max_model_len`: Maximum model length in tokens; default is 32768  
+    `--gpus`: Equivalent to the environment variable CUDA_VISIBLE_DEVICES, e.g., `0,1,2,3`, `4,5`   
+    `--use_modelscope`: If set, uses ModelScope to load the model; otherwise, uses HuggingFace  
+    `--gpu_memory_utilization`: GPU memory utilization, range (0, 1]; default is 0.9  
+    `--outputs_dir`: Output directory, default is `outputs/vllm`  
+    `--enforce_eager`: Whether to enforce eager mode; default is False  
- Run multiple experiments and compute the average result; a typical number is 3 times.
+#### Test Results
- Make sure the GPU is idle before running experiments.
+Test results can be found in the `outputs` directory, which by default includes two folders for `transformers` and `vllm`, storing test results for HuggingFace transformers and vLLM respectively.
-### 4. Results
+## Notes
-Please check the `outputs` directory, which includes two directories by default: `transformers` and `vllm`, containing the experiments results for HuggingFace transformers and vLLM, respectively.
+1. Conduct multiple tests and take the average, with a typical value of 3 tests.
+2. Ensure the GPU is idle before testing to avoid interference from other tasks.
\ No newline at end of file
--- a/examples/speed-benchmark/README_zh.md
+++ b/examples/speed-benchmark/README_zh.md
-## 效率评估
+# 效率评估
 本文介绍Qwen2.5系列模型（原始模型和量化模型）的效率测试流程，详细报告可参考 [Qwen2.5模型效率评估报告](https://qwen.readthedocs.io/en/latest/benchmark/speed_benchmark.html)。
-### 1. 模型资源
+## 1. 模型资源
 对于托管在HuggingFace上的模型，可参考 [Qwen2.5模型-HuggingFace](https://huggingface.co/collections/Qwen/qwen25-66e81a666513e518adb90d9e)。
 对于托管在ModelScope上的模型，可参考 [Qwen2.5模型-ModelScope](https://modelscope.cn/collections/Qwen25-dbc4d30adb768)。
-### 2. 环境安装
+## 2. 环境安装
 使用HuggingFace transformers推理，安装环境如下：
@@ -41,9 +40,70 @@ pip install -r requirements-perf-vllm.txt
 ```
-### 3. 执行测试
+## 3. 执行测试
+下面介绍两种执行测试的方法，分别是使用脚本测试和使用Speed Benchmark工具进行测试。
+### 方法1：使用Speed Benchmark工具测试
+使用[EvalScope](https://github.com/modelscope/evalscope)开发的Speed Benchmark工具进行测试，支持自动从modelscope下载模型并输出测试结果，参考[📖使用指南](https://evalscope.readthedocs.io/zh-cn/latest/user_guides/stress_test/speed_benchmark.html).
+**安装依赖**
+```shell
+pip install 'evalscope[perf]' -U
+```
+#### HuggingFace transformers推理
+执行命令如下：
+```shell
+CUDA_VISIBLE_DEVICES=0 evalscope perf \
+ --parallel 1 \
+ --model Qwen/Qwen2.5-0.5B-Instruct \
+ --attn-implementation flash_attention_2 \
+ --log-every-n-query 5 \
+ --connect-timeout 6000 \
+ --read-timeout 6000 \
+ --max-tokens 2048 \
+ --min-tokens 2048 \
+ --api local \
+ --dataset speed_benchmark 
+```
+#### vLLM推理
+```shell
+CUDA_VISIBLE_DEVICES=0 evalscope perf \
+ --parallel 1 \
+ --model Qwen/Qwen2.5-0.5B-Instruct \
+ --log-every-n-query 1 \
+ --connect-timeout 60000 \
+ --read-timeout 60000\
+ --max-tokens 2048 \
+ --min-tokens 2048 \
+ --api local_vllm \
+ --dataset speed_benchmark
+```
+#### 参数说明
+- `--parallel` 设置并发请求的worker数量，需固定为1。
+- `--model` 测试模型文件路径，也可为模型ID，支持自动从modelscope下载模型，例如Qwen/Qwen2.5-0.5B-Instruct。
+- `--attn-implementation` 设置attention实现方式，可选值为flash_attention_2|eager|sdpa。
+- `--log-every-n-query`: 设置每n个请求打印一次日志。
+- `--connect-timeout`: 设置连接超时时间，单位为秒。
+- `--read-timeout`: 设置读取超时时间，单位为秒。
+- `--max-tokens`: 设置最大输出长度，单位为token。
+- `--min-tokens`: 设置最小输出长度，单位为token；两个参数同时设置为2048则模型固定输出长度为2048。
+- `--api`: 设置推理接口，本地推理可选值为local|local_vllm。
+- `--dataset`: 设置测试数据集，可选值为speed_benchmark|speed_benchmark_long。
+#### 测试结果
-#### 3.1 使用HuggingFace transformers推理
+测试结果详见`outputs/{model_name}/{timestamp}/speed_benchmark.json`文件，其中包含所有请求结果和测试参数。
+### 方法2：使用脚本测试
+#### HuggingFace transformers推理
 - 使用HuggingFace hub
@@ -70,7 +130,7 @@ python speed_benchmark_transformers.py --model_id_or_path Qwen/Qwen2.5-0.5B-Inst
    `--outputs_dir`: 输出目录， 默认为`outputs/transformers`  
-#### 3.2 使用vLLM推理
+#### vLLM推理
 - 使用HuggingFace hub
@@ -99,12 +159,13 @@ python speed_benchmark_vllm.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --c
    `--outputs_dir`: 输出目录， 默认为`outputs/vllm`  
    `--enforce_eager`: 是否强制使用eager模式；默认为False  
+#### 测试结果
-#### 3.3 注意事项
+测试结果详见`outputs`目录下的文件，默认包括`transformers`和`vllm`两个目录，分别存放HuggingFace transformers和vLLM的测试结果。
+## 注意事项
 1. 多次测试，取平均值，典型值为3次
 2. 测试前请确保GPU处于空闲状态，避免其他任务影响测试结果
-### 4. 测试结果
-测试结果详见`outputs`目录下的文件，默认包括`transformers`和`vllm`两个目录，分别存放HuggingFace transformers和vLLM的测试结果。