diff --git a/examples/speed-benchmark/README.md b/examples/speed-benchmark/README.md
index 84987b7b2a85988622394fcf0179b84034008c9d..56fd0039cd65348c67b5043d78f71b37baad1b31 100644
--- a/examples/speed-benchmark/README.md
+++ b/examples/speed-benchmark/README.md
@@ -1,14 +1,14 @@
-## Speed Benchmark
+# Speed Benchmark
 
 This document introduces the speed benchmark testing process for the Qwen2.5 series models (original and quantized models). For detailed reports, please refer to the [Qwen2.5 Speed Benchmark](https://qwen.readthedocs.io/en/latest/benchmark/speed_benchmark.html).
 
-### 1. Model Collections
+## 1. Model Collections
 
-For models hosted on HuggingFace, please refer to [Qwen2.5 Collections-HuggingFace](https://huggingface.co/collections/Qwen/qwen25-66e81a666513e518adb90d9e).
+For models hosted on HuggingFace, refer to [Qwen2.5 Collections-HuggingFace](https://huggingface.co/collections/Qwen/qwen25-66e81a666513e518adb90d9e).
 
-For models hosted on ModelScope, please refer to [Qwen2.5 Collections-ModelScope](https://modelscope.cn/collections/Qwen25-dbc4d30adb768).
+For models hosted on ModelScope, refer to [Qwen2.5 Collections-ModelScope](https://modelscope.cn/collections/Qwen25-dbc4d30adb768).
 
-### 2. Environment Installation
+## 2. Environment Setup
 
 
 For inference using HuggingFace transformers:
@@ -38,69 +38,123 @@ conda activate qwen_perf_vllm
 pip install -r requirements-perf-vllm.txt
 ```
 
+## 3. Execute Tests
 
-### 3. Run Experiments
+Below are two methods for executing tests: using a script or the Speed Benchmark tool.
 
-#### 3.1 Inference using HuggingFace Transformers
+### Method 1: Testing with Speed Benchmark Tool
 
-- Use HuggingFace hub
+Use the Speed Benchmark tool developed by [EvalScope](https://github.com/modelscope/evalscope) for testing. It supports automatic downloading of models from ModelScope and outputs test results. Refer to the [📖 User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/speed_benchmark.html).
 
+**Install Dependencies**
 ```shell
-python speed_benchmark_transformers.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --gpus 0 --outputs_dir outputs/transformers
+pip install 'evalscope[perf]' -U
 ```
 
-- Use ModelScope hub
+#### HuggingFace Transformers Inference
 
+Execute the command as follows:
 ```shell
-python speed_benchmark_transformers.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --gpus 0 --use_modelscope --outputs_dir outputs/transformers
+CUDA_VISIBLE_DEVICES=0 evalscope perf \
+ --parallel 1 \
+ --model Qwen/Qwen2.5-0.5B-Instruct \
+ --attn-implementation flash_attention_2 \
+ --log-every-n-query 5 \
+ --connect-timeout 6000 \
+ --read-timeout 6000 \
+ --max-tokens 2048 \
+ --min-tokens 2048 \
+ --api local \
+ --dataset speed_benchmark 
+```
+
+#### vLLM Inference
+
+```shell
+CUDA_VISIBLE_DEVICES=0 evalscope perf \
+ --parallel 1 \
+ --model Qwen/Qwen2.5-0.5B-Instruct \
+ --log-every-n-query 1 \
+ --connect-timeout 60000 \
+ --read-timeout 60000\
+ --max-tokens 2048 \
+ --min-tokens 2048 \
+ --api local_vllm \
+ --dataset speed_benchmark
 ```
 
-Parameters:
+#### Parameter Explanation
+- `--parallel` sets the number of worker threads for concurrent requests, should be fixed at 1.
+- `--model` specifies the model file path or model ID, supporting automatic downloads from ModelScope, e.g., Qwen/Qwen2.5-0.5B-Instruct.
+- `--attn-implementation` sets the attention implementation method, with optional values: flash_attention_2|eager|sdpa.
+- `--log-every-n-query`: sets how often to log every n requests.
+- `--connect-timeout`: sets the connection timeout in seconds.
+- `--read-timeout`: sets the read timeout in seconds.
+- `--max-tokens`: sets the maximum output length in tokens.
+- `--min-tokens`: sets the minimum output length in tokens; both parameters set to 2048 means the model will output a fixed length of 2048.
+- `--api`: sets the inference interface; local inference options are local|local_vllm.
+- `--dataset`: sets the test dataset; options are speed_benchmark|speed_benchmark_long.
 
-    `--model_id_or_path`: The model path or id on ModelScope or HuggingFace hub
-    `--context_length`: Input length in tokens; optional values are 1, 6144, 14336, 30720, 63488, 129024; Refer to the `Qwen2.5 SpeedBenchmark`.  
-    `--generate_length`: Output length in tokens; default is 2048.
-    `--gpus`: Equivalent to the environment variable CUDA_VISIBLE_DEVICES.  e.g. `0,1,2,3`, `4,5`  
-    `--use_modelscope`: Use ModelScope when set this flag. Otherwise, use HuggingFace.  
-    `--outputs_dir`: Output directory; default is outputs/transformers.  
+#### Test Results
 
+Test results can be found in the `outputs/{model_name}/{timestamp}/speed_benchmark.json` file, which contains all request results and test parameters.
 
-#### 3.2 Inference using vLLM
+### Method 2: Testing with Scripts
 
-- Use HuggingFace hub
+#### HuggingFace Transformers Inference
+
+- Using HuggingFace Hub
 
 ```shell
-python speed_benchmark_vllm.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --max_model_len 32768 --gpus 0 --gpu_memory_utilization 0.9 --outputs_dir outputs/vllm
+python speed_benchmark_transformers.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --gpus 0 --outputs_dir outputs/transformers
 ```
 
-
-- Use ModelScope hub
+- Using ModelScope Hub
 
 ```shell
-python speed_benchmark_vllm.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --max_model_len 32768 --gpus 0 --use_modelscope --gpu_memory_utilization 0.9 --outputs_dir outputs/vllm
+python speed_benchmark_transformers.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --gpus 0 --use_modelscope --outputs_dir outputs/transformers
 ```
 
+Parameter Explanation:
+
+    `--model_id_or_path`: Model ID or local path, optional values refer to the `Model Resources` section  
+    `--context_length`: Input length in tokens; optional values are 1, 6144, 14336, 30720, 63488, 129024; refer to the `Qwen2.5 Model Efficiency Evaluation Report` for specifics  
+    `--generate_length`: Number of tokens to generate; default is 2048
+    `--gpus`: Equivalent to the environment variable CUDA_VISIBLE_DEVICES, e.g., `0,1,2,3`, `4,5`  
+    `--use_modelscope`: If set, uses ModelScope to load the model; otherwise, uses HuggingFace  
+    `--outputs_dir`: Output directory, default is `outputs/transformers`  
+
+#### vLLM Inference
 
-Parameters:
+- Using HuggingFace Hub
+
+```shell
+python speed_benchmark_vllm.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --max_model_len 32768 --gpus 0 --gpu_memory_utilization 0.9 --outputs_dir outputs/vllm
+```
 
-    `--model_id_or_path`: The model id on ModelScope or HuggingFace hub.
-    `--context_length`: Input length in tokens; optional values are 1, 6144, 14336, 30720, 63488, 129024; Refer to the `Qwen2.5 SpeedBenchmark`.  
-    `--generate_length`: Output length in tokens; default is 2048.
-    `--max_model_len`: Maximum model length in tokens; default is 32768. Optional values are 4096, 8192, 32768, 65536, 131072.
-    `--gpus`: Equivalent to the environment variable CUDA_VISIBLE_DEVICES.  e.g. `0,1,2,3`, `4,5`  
-    `--use_modelscope`: Use ModelScope when set this flag. Otherwise, use HuggingFace.  
-    `--gpu_memory_utilization`: GPU memory utilization; range is (0, 1]; default is 0.9.  
-    `--outputs_dir`: Output directory; default is outputs/vllm.  
-    `--enforce_eager`: Whether to enforce eager mode; default is False.  
+- Using ModelScope Hub
 
+```shell
+python speed_benchmark_vllm.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --max_model_len 32768 --gpus 0 --use_modelscope --gpu_memory_utilization 0.9 --outputs_dir outputs/vllm
+```
 
+Parameter Explanation:
 
-#### 3.3 Tips
+    `--model_id_or_path`: Model ID or local path, optional values refer to the `Model Resources` section  
+    `--context_length`: Input length in tokens; optional values are 1, 6144, 14336, 30720, 63488, 129024; refer to the `Qwen2.5 Model Efficiency Evaluation Report` for specifics  
+    `--generate_length`: Number of tokens to generate; default is 2048
+    `--max_model_len`: Maximum model length in tokens; default is 32768  
+    `--gpus`: Equivalent to the environment variable CUDA_VISIBLE_DEVICES, e.g., `0,1,2,3`, `4,5`   
+    `--use_modelscope`: If set, uses ModelScope to load the model; otherwise, uses HuggingFace  
+    `--gpu_memory_utilization`: GPU memory utilization, range (0, 1]; default is 0.9  
+    `--outputs_dir`: Output directory, default is `outputs/vllm`  
+    `--enforce_eager`: Whether to enforce eager mode; default is False  
 
-- Run multiple experiments and compute the average result; a typical number is 3 times.
-- Make sure the GPU is idle before running experiments.
+#### Test Results
 
+Test results can be found in the `outputs` directory, which by default includes two folders for `transformers` and `vllm`, storing test results for HuggingFace transformers and vLLM respectively.
 
-### 4. Results
+## Notes
 
-Please check the `outputs` directory, which includes two directories by default: `transformers` and `vllm`, containing the experiments results for HuggingFace transformers and vLLM, respectively.
+1. Conduct multiple tests and take the average, with a typical value of 3 tests.
+2. Ensure the GPU is idle before testing to avoid interference from other tasks.
\ No newline at end of file
diff --git a/examples/speed-benchmark/README_zh.md b/examples/speed-benchmark/README_zh.md
index 4e966876c1dee7ff2c2d3db1967241bfdaf3e05e..da43266e7e7a864c912a5f91841101aae889f6f2 100644
--- a/examples/speed-benchmark/README_zh.md
+++ b/examples/speed-benchmark/README_zh.md
@@ -1,16 +1,15 @@
-## 效率评估
+# 效率评估
 
 本文介绍Qwen2.5系列模型(原始模型和量化模型)的效率测试流程,详细报告可参考 [Qwen2.5模型效率评估报告](https://qwen.readthedocs.io/en/latest/benchmark/speed_benchmark.html)。
 
-### 1. 模型资源
+## 1. 模型资源
 
 对于托管在HuggingFace上的模型,可参考 [Qwen2.5模型-HuggingFace](https://huggingface.co/collections/Qwen/qwen25-66e81a666513e518adb90d9e)。
 
 对于托管在ModelScope上的模型,可参考 [Qwen2.5模型-ModelScope](https://modelscope.cn/collections/Qwen25-dbc4d30adb768)。
 
 
-### 2. 环境安装
-
+## 2. 环境安装
 
 使用HuggingFace transformers推理,安装环境如下:
 
@@ -41,9 +40,70 @@ pip install -r requirements-perf-vllm.txt
 ```
 
 
-### 3. 执行测试
+## 3. 执行测试
+
+下面介绍两种执行测试的方法,分别是使用脚本测试和使用Speed Benchmark工具进行测试。
+
+### 方法1:使用Speed Benchmark工具测试
+
+使用[EvalScope](https://github.com/modelscope/evalscope)开发的Speed Benchmark工具进行测试,支持自动从modelscope下载模型并输出测试结果,参考[📖使用指南](https://evalscope.readthedocs.io/zh-cn/latest/user_guides/stress_test/speed_benchmark.html).
+
+**安装依赖**
+```shell
+pip install 'evalscope[perf]' -U
+```
+
+#### HuggingFace transformers推理
+
+执行命令如下:
+```shell
+CUDA_VISIBLE_DEVICES=0 evalscope perf \
+ --parallel 1 \
+ --model Qwen/Qwen2.5-0.5B-Instruct \
+ --attn-implementation flash_attention_2 \
+ --log-every-n-query 5 \
+ --connect-timeout 6000 \
+ --read-timeout 6000 \
+ --max-tokens 2048 \
+ --min-tokens 2048 \
+ --api local \
+ --dataset speed_benchmark 
+```
+
+#### vLLM推理
+
+```shell
+CUDA_VISIBLE_DEVICES=0 evalscope perf \
+ --parallel 1 \
+ --model Qwen/Qwen2.5-0.5B-Instruct \
+ --log-every-n-query 1 \
+ --connect-timeout 60000 \
+ --read-timeout 60000\
+ --max-tokens 2048 \
+ --min-tokens 2048 \
+ --api local_vllm \
+ --dataset speed_benchmark
+```
+
+#### 参数说明
+- `--parallel` 设置并发请求的worker数量,需固定为1。
+- `--model` 测试模型文件路径,也可为模型ID,支持自动从modelscope下载模型,例如Qwen/Qwen2.5-0.5B-Instruct。
+- `--attn-implementation` 设置attention实现方式,可选值为flash_attention_2|eager|sdpa。
+- `--log-every-n-query`: 设置每n个请求打印一次日志。
+- `--connect-timeout`: 设置连接超时时间,单位为秒。
+- `--read-timeout`: 设置读取超时时间,单位为秒。
+- `--max-tokens`: 设置最大输出长度,单位为token。
+- `--min-tokens`: 设置最小输出长度,单位为token;两个参数同时设置为2048则模型固定输出长度为2048。
+- `--api`: 设置推理接口,本地推理可选值为local|local_vllm。
+- `--dataset`: 设置测试数据集,可选值为speed_benchmark|speed_benchmark_long。
+
+#### 测试结果
 
-#### 3.1 使用HuggingFace transformers推理
+测试结果详见`outputs/{model_name}/{timestamp}/speed_benchmark.json`文件,其中包含所有请求结果和测试参数。
+
+### 方法2:使用脚本测试
+
+#### HuggingFace transformers推理
 
 - 使用HuggingFace hub
 
@@ -70,7 +130,7 @@ python speed_benchmark_transformers.py --model_id_or_path Qwen/Qwen2.5-0.5B-Inst
     `--outputs_dir`: 输出目录, 默认为`outputs/transformers`  
 
 
-#### 3.2 使用vLLM推理
+#### vLLM推理
 
 - 使用HuggingFace hub
 
@@ -99,12 +159,13 @@ python speed_benchmark_vllm.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --c
     `--outputs_dir`: 输出目录, 默认为`outputs/vllm`  
     `--enforce_eager`: 是否强制使用eager模式;默认为False  
 
+#### 测试结果
 
-#### 3.3 注意事项
+测试结果详见`outputs`目录下的文件,默认包括`transformers`和`vllm`两个目录,分别存放HuggingFace transformers和vLLM的测试结果。
+
+## 注意事项
 
 1. 多次测试,取平均值,典型值为3次
 2. 测试前请确保GPU处于空闲状态,避免其他任务影响测试结果
 
-### 4. 测试结果
 
-测试结果详见`outputs`目录下的文件,默认包括`transformers`和`vllm`两个目录,分别存放HuggingFace transformers和vLLM的测试结果。