diff --git a/examples/speed-benchmark/README.md b/examples/speed-benchmark/README.md
index 84987b7b2a85988622394fcf0179b84034008c9d..3c521ffc04033ef431675d70c0d67d4a57996386 100644
--- a/examples/speed-benchmark/README.md
+++ b/examples/speed-benchmark/README.md
@@ -1,14 +1,14 @@
-## Speed Benchmark
+# Speed Benchmark
 
 This document introduces the speed benchmark testing process for the Qwen2.5 series models (original and quantized models). For detailed reports, please refer to the [Qwen2.5 Speed Benchmark](https://qwen.readthedocs.io/en/latest/benchmark/speed_benchmark.html).
 
-### 1. Model Collections
+## 1. Model Collections
 
-For models hosted on HuggingFace, please refer to [Qwen2.5 Collections-HuggingFace](https://huggingface.co/collections/Qwen/qwen25-66e81a666513e518adb90d9e).
+For models hosted on HuggingFace, refer to [Qwen2.5 Model - HuggingFace](https://huggingface.co/collections/Qwen/qwen25-66e81a666513e518adb90d9e).
 
-For models hosted on ModelScope, please refer to [Qwen2.5 Collections-ModelScope](https://modelscope.cn/collections/Qwen25-dbc4d30adb768).
+For models hosted on ModelScope, refer to [Qwen2.5 Model - ModelScope](https://modelscope.cn/collections/Qwen25-dbc4d30adb768).
 
-### 2. Environment Installation
+## 2. Environment Setup
 
 
 For inference using HuggingFace transformers:
@@ -38,69 +38,126 @@ conda activate qwen_perf_vllm
 pip install -r requirements-perf-vllm.txt
 ```
 
+## 3. Execute Tests
 
-### 3. Run Experiments
+Below are two methods for executing tests: using a script or the Speed Benchmark tool.
 
-#### 3.1 Inference using HuggingFace Transformers
+### Method 1: Testing with Speed Benchmark Tool
 
-- Use HuggingFace hub
+Use the Speed Benchmark tool developed by [EvalScope](https://github.com/modelscope/evalscope) for testing. It supports automatic downloading of models from ModelScope and outputs test results. Refer to the [ðŸ“– User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/speed_benchmark.html).
 
+**Install Dependencies**
 ```shell
-python speed_benchmark_transformers.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --gpus 0 --outputs_dir outputs/transformers
+pip install 'evalscope[perf]' -U
 ```
 
-- Use ModelScope hub
+#### HuggingFace Transformers Inference
 
+Execute the command as follows:
 ```shell
-python speed_benchmark_transformers.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --gpus 0 --use_modelscope --outputs_dir outputs/transformers
+CUDA_VISIBLE_DEVICES=0 evalscope perf \
+ --parallel 1 \
+ --model Qwen/Qwen2.5-0.5B-Instruct \
+ --attn-implementation flash_attention_2 \
+ --log-every-n-query 5 \
+ --connect-timeout 6000 \
+ --read-timeout 6000 \
+ --max-tokens 2048 \
+ --min-tokens 2048 \
+ --api local \
+ --dataset speed_benchmark 
+```
+
+#### vLLM Inference
+
+```shell
+CUDA_VISIBLE_DEVICES=0 evalscope perf \
+ --parallel 1 \
+ --model Qwen/Qwen2.5-0.5B-Instruct \
+ --log-every-n-query 1 \
+ --connect-timeout 60000 \
+ --read-timeout 60000\
+ --max-tokens 2048 \
+ --min-tokens 2048 \
+ --api local_vllm \
+ --dataset speed_benchmark
 ```
 
-Parameters:
+#### Parameter Explanation
+- `--parallel` sets the number of worker threads for concurrent requests, should be fixed at 1.
+- `--model` specifies the model file path or model ID, supporting automatic downloads from ModelScope, e.g., Qwen/Qwen2.5-0.5B-Instruct.
+- `--attn-implementation` sets the attention implementation method, with optional values: flash_attention_2|eager|sdpa.
+- `--log-every-n-query`: sets how often to log every n requests.
+- `--connect-timeout`: sets the connection timeout in seconds.
+- `--read-timeout`: sets the read timeout in seconds.
+- `--max-tokens`: sets the maximum output length in tokens.
+- `--min-tokens`: sets the minimum output length in tokens; both parameters set to 2048 means the model will output a fixed length of 2048.
+- `--api`: sets the inference interface; local inference options are local|local_vllm.
+- `--dataset`: sets the test dataset; options are speed_benchmark|speed_benchmark_long.
 
-    `--model_id_or_path`: The model path or id on ModelScope or HuggingFace hub
-    `--context_length`: Input length in tokens; optional values are 1, 6144, 14336, 30720, 63488, 129024; Refer to the `Qwen2.5 SpeedBenchmark`.  
-    `--generate_length`: Output length in tokens; default is 2048.
-    `--gpus`: Equivalent to the environment variable CUDA_VISIBLE_DEVICES.  e.g. `0,1,2,3`, `4,5`  
-    `--use_modelscope`: Use ModelScope when set this flag. Otherwise, use HuggingFace.  
-    `--outputs_dir`: Output directory; default is outputs/transformers.  
+#### Test Results
 
+Test results can be found in the `outputs/{model_name}/{timestamp}/speed_benchmark.json` file, which contains all request results and test parameters.
 
-#### 3.2 Inference using vLLM
+### Method 2: Testing with Scripts
 
-- Use HuggingFace hub
+#### HuggingFace Transformers Inference
+
+- Using HuggingFace Hub
 
 ```shell
-python speed_benchmark_vllm.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --max_model_len 32768 --gpus 0 --gpu_memory_utilization 0.9 --outputs_dir outputs/vllm
-```
+python speed_benchmark_transformers.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --gpus 0 --outputs_dir outputs/transformers
 
+# Specify HF_ENDPOINT
+HF_ENDPOINT=https://hf-mirror.com python speed_benchmark_transformers.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --gpus 0 --outputs_dir outputs/transformers
+```
 
-- Use ModelScope hub
+- Using ModelScope Hub
 
 ```shell
-python speed_benchmark_vllm.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --max_model_len 32768 --gpus 0 --use_modelscope --gpu_memory_utilization 0.9 --outputs_dir outputs/vllm
+python speed_benchmark_transformers.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --gpus 0 --use_modelscope --outputs_dir outputs/transformers
 ```
 
+Parameter Explanation:
+
+    `--model_id_or_path`: Model ID or local path, optional values refer to the `Model Resources` section  
+    `--context_length`: Input length in tokens; optional values are 1, 6144, 14336, 30720, 63488, 129024; refer to the `Qwen2.5 Model Efficiency Evaluation Report` for specifics  
+    `--generate_length`: Number of tokens to generate; default is 2048
+    `--gpus`: Equivalent to the environment variable CUDA_VISIBLE_DEVICES, e.g., `0,1,2,3`, `4,5`  
+    `--use_modelscope`: If set, uses ModelScope to load the model; otherwise, uses HuggingFace  
+    `--outputs_dir`: Output directory, default is `outputs/transformers`  
 
-Parameters:
+#### vLLM Inference
 
-    `--model_id_or_path`: The model id on ModelScope or HuggingFace hub.
-    `--context_length`: Input length in tokens; optional values are 1, 6144, 14336, 30720, 63488, 129024; Refer to the `Qwen2.5 SpeedBenchmark`.  
-    `--generate_length`: Output length in tokens; default is 2048.
-    `--max_model_len`: Maximum model length in tokens; default is 32768. Optional values are 4096, 8192, 32768, 65536, 131072.
-    `--gpus`: Equivalent to the environment variable CUDA_VISIBLE_DEVICES.  e.g. `0,1,2,3`, `4,5`  
-    `--use_modelscope`: Use ModelScope when set this flag. Otherwise, use HuggingFace.  
-    `--gpu_memory_utilization`: GPU memory utilization; range is (0, 1]; default is 0.9.  
-    `--outputs_dir`: Output directory; default is outputs/vllm.  
-    `--enforce_eager`: Whether to enforce eager mode; default is False.  
+- Using HuggingFace Hub
 
+```shell
+python speed_benchmark_vllm.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --max_model_len 32768 --gpus 0 --gpu_memory_utilization 0.9 --outputs_dir outputs/vllm
+```
+
+- Using ModelScope Hub
+
+```shell
+python speed_benchmark_vllm.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --max_model_len 32768 --gpus 0 --use_modelscope --gpu_memory_utilization 0.9 --outputs_dir outputs/vllm
+```
 
+Parameter Explanation:
 
-#### 3.3 Tips
+    `--model_id_or_path`: Model ID or local path, optional values refer to the `Model Resources` section  
+    `--context_length`: Input length in tokens; optional values are 1, 6144, 14336, 30720, 63488, 129024; refer to the `Qwen2.5 Model Efficiency Evaluation Report` for specifics  
+    `--generate_length`: Number of tokens to generate; default is 2048
+    `--max_model_len`: Maximum model length in tokens; default is 32768  
+    `--gpus`: Equivalent to the environment variable CUDA_VISIBLE_DEVICES, e.g., `0,1,2,3`, `4,5`   
+    `--use_modelscope`: If set, uses ModelScope to load the model; otherwise, uses HuggingFace  
+    `--gpu_memory_utilization`: GPU memory utilization, range (0, 1]; default is 0.9  
+    `--outputs_dir`: Output directory, default is `outputs/vllm`  
+    `--enforce_eager`: Whether to enforce eager mode; default is False  
 
-- Run multiple experiments and compute the average result; a typical number is 3 times.
-- Make sure the GPU is idle before running experiments.
+#### Test Results
 
+Test results can be found in the `outputs` directory, which by default includes two folders for `transformers` and `vllm`, storing test results for HuggingFace transformers and vLLM respectively.
 
-### 4. Results
+## Notes
 
-Please check the `outputs` directory, which includes two directories by default: `transformers` and `vllm`, containing the experiments results for HuggingFace transformers and vLLM, respectively.
+1. Conduct multiple tests and take the average, with a typical value of 3 tests.
+2. Ensure the GPU is idle before testing to avoid interference from other tasks.
\ No newline at end of file
diff --git a/examples/speed-benchmark/README_zh.md b/examples/speed-benchmark/README_zh.md
index 4e966876c1dee7ff2c2d3db1967241bfdaf3e05e..da43266e7e7a864c912a5f91841101aae889f6f2 100644
--- a/examples/speed-benchmark/README_zh.md
+++ b/examples/speed-benchmark/README_zh.md
@@ -1,16 +1,15 @@
-## æ•ˆçŽ‡è¯„ä¼°
+# æ•ˆçŽ‡è¯„ä¼°
 
 æœ¬æ–‡ä»‹ç»Qwen2.5ç³»åˆ—æ¨¡åž‹ï¼ˆåŽŸå§‹æ¨¡åž‹å’Œé‡åŒ–æ¨¡åž‹ï¼‰çš„æ•ˆçŽ‡æµ‹è¯•æµç¨‹ï¼Œè¯¦ç»†æŠ¥å‘Šå¯å‚è€ƒ [Qwen2.5æ¨¡åž‹æ•ˆçŽ‡è¯„ä¼°æŠ¥å‘Š](https://qwen.readthedocs.io/en/latest/benchmark/speed_benchmark.html)ã€‚
 
-### 1. æ¨¡åž‹èµ„æº
+## 1. æ¨¡åž‹èµ„æº
 
 å¯¹äºŽæ‰˜ç®¡åœ¨HuggingFaceä¸Šçš„æ¨¡åž‹ï¼Œå¯å‚è€ƒ [Qwen2.5æ¨¡åž‹-HuggingFace](https://huggingface.co/collections/Qwen/qwen25-66e81a666513e518adb90d9e)ã€‚
 
 å¯¹äºŽæ‰˜ç®¡åœ¨ModelScopeä¸Šçš„æ¨¡åž‹ï¼Œå¯å‚è€ƒ [Qwen2.5æ¨¡åž‹-ModelScope](https://modelscope.cn/collections/Qwen25-dbc4d30adb768)ã€‚
 
 
-### 2. çŽ¯å¢ƒå®‰è£…
-
+## 2. çŽ¯å¢ƒå®‰è£…
 
 ä½¿ç”¨HuggingFace transformersæŽ¨ç†ï¼Œå®‰è£…çŽ¯å¢ƒå¦‚ä¸‹ï¼š
 
@@ -41,9 +40,70 @@ pip install -r requirements-perf-vllm.txt
 ```
 
 
-### 3. æ‰§è¡Œæµ‹è¯•
+## 3. æ‰§è¡Œæµ‹è¯•
+
+ä¸‹é¢ä»‹ç»ä¸¤ç§æ‰§è¡Œæµ‹è¯•çš„æ–¹æ³•ï¼Œåˆ†åˆ«æ˜¯ä½¿ç”¨è„šæœ¬æµ‹è¯•å’Œä½¿ç”¨Speed Benchmarkå·¥å…·è¿›è¡Œæµ‹è¯•ã€‚
+
+### æ–¹æ³•1ï¼šä½¿ç”¨Speed Benchmarkå·¥å…·æµ‹è¯•
+
+ä½¿ç”¨[EvalScope](https://github.com/modelscope/evalscope)å¼€å‘çš„Speed Benchmarkå·¥å…·è¿›è¡Œæµ‹è¯•ï¼Œæ”¯æŒè‡ªåŠ¨ä»Žmodelscopeä¸‹è½½æ¨¡åž‹å¹¶è¾“å‡ºæµ‹è¯•ç»“æžœï¼Œå‚è€ƒ[ðŸ“–ä½¿ç”¨æŒ‡å—](https://evalscope.readthedocs.io/zh-cn/latest/user_guides/stress_test/speed_benchmark.html).
+
+**å®‰è£…ä¾èµ–**
+```shell
+pip install 'evalscope[perf]' -U
+```
+
+#### HuggingFace transformersæŽ¨ç†
+
+æ‰§è¡Œå‘½ä»¤å¦‚ä¸‹ï¼š
+```shell
+CUDA_VISIBLE_DEVICES=0 evalscope perf \
+ --parallel 1 \
+ --model Qwen/Qwen2.5-0.5B-Instruct \
+ --attn-implementation flash_attention_2 \
+ --log-every-n-query 5 \
+ --connect-timeout 6000 \
+ --read-timeout 6000 \
+ --max-tokens 2048 \
+ --min-tokens 2048 \
+ --api local \
+ --dataset speed_benchmark 
+```
+
+#### vLLMæŽ¨ç†
+
+```shell
+CUDA_VISIBLE_DEVICES=0 evalscope perf \
+ --parallel 1 \
+ --model Qwen/Qwen2.5-0.5B-Instruct \
+ --log-every-n-query 1 \
+ --connect-timeout 60000 \
+ --read-timeout 60000\
+ --max-tokens 2048 \
+ --min-tokens 2048 \
+ --api local_vllm \
+ --dataset speed_benchmark
+```
+
+#### å‚æ•°è¯´æ˜Ž
+- `--parallel` è®¾ç½®å¹¶å‘è¯·æ±‚çš„workeræ•°é‡ï¼Œéœ€å›ºå®šä¸º1ã€‚
+- `--model` æµ‹è¯•æ¨¡åž‹æ–‡ä»¶è·¯å¾„ï¼Œä¹Ÿå¯ä¸ºæ¨¡åž‹IDï¼Œæ”¯æŒè‡ªåŠ¨ä»Žmodelscopeä¸‹è½½æ¨¡åž‹ï¼Œä¾‹å¦‚Qwen/Qwen2.5-0.5B-Instructã€‚
+- `--attn-implementation` è®¾ç½®attentionå®žçŽ°æ–¹å¼ï¼Œå¯é€‰å€¼ä¸ºflash_attention_2|eager|sdpaã€‚
+- `--log-every-n-query`: è®¾ç½®æ¯nä¸ªè¯·æ±‚æ‰“å°ä¸€æ¬¡æ—¥å¿—ã€‚
+- `--connect-timeout`: è®¾ç½®è¿žæŽ¥è¶…æ—¶æ—¶é—´ï¼Œå•ä½ä¸ºç§’ã€‚
+- `--read-timeout`: è®¾ç½®è¯»å–è¶…æ—¶æ—¶é—´ï¼Œå•ä½ä¸ºç§’ã€‚
+- `--max-tokens`: è®¾ç½®æœ€å¤§è¾“å‡ºé•¿åº¦ï¼Œå•ä½ä¸ºtokenã€‚
+- `--min-tokens`: è®¾ç½®æœ€å°è¾“å‡ºé•¿åº¦ï¼Œå•ä½ä¸ºtokenï¼›ä¸¤ä¸ªå‚æ•°åŒæ—¶è®¾ç½®ä¸º2048åˆ™æ¨¡åž‹å›ºå®šè¾“å‡ºé•¿åº¦ä¸º2048ã€‚
+- `--api`: è®¾ç½®æŽ¨ç†æŽ¥å£ï¼Œæœ¬åœ°æŽ¨ç†å¯é€‰å€¼ä¸ºlocal|local_vllmã€‚
+- `--dataset`: è®¾ç½®æµ‹è¯•æ•°æ®é›†ï¼Œå¯é€‰å€¼ä¸ºspeed_benchmark|speed_benchmark_longã€‚
+
+#### æµ‹è¯•ç»“æžœ
 
-#### 3.1 ä½¿ç”¨HuggingFace transformersæŽ¨ç†
+æµ‹è¯•ç»“æžœè¯¦è§`outputs/{model_name}/{timestamp}/speed_benchmark.json`æ–‡ä»¶ï¼Œå…¶ä¸åŒ…å«æ‰€æœ‰è¯·æ±‚ç»“æžœå’Œæµ‹è¯•å‚æ•°ã€‚
+
+### æ–¹æ³•2ï¼šä½¿ç”¨è„šæœ¬æµ‹è¯•
+
+#### HuggingFace transformersæŽ¨ç†
 
 - ä½¿ç”¨HuggingFace hub
 
@@ -70,7 +130,7 @@ python speed_benchmark_transformers.py --model_id_or_path Qwen/Qwen2.5-0.5B-Inst
     `--outputs_dir`: è¾“å‡ºç›®å½•ï¼Œ é»˜è®¤ä¸º`outputs/transformers`  
 
 
-#### 3.2 ä½¿ç”¨vLLMæŽ¨ç†
+#### vLLMæŽ¨ç†
 
 - ä½¿ç”¨HuggingFace hub
 
@@ -99,12 +159,13 @@ python speed_benchmark_vllm.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --c
     `--outputs_dir`: è¾“å‡ºç›®å½•ï¼Œ é»˜è®¤ä¸º`outputs/vllm`  
     `--enforce_eager`: æ˜¯å¦å¼ºåˆ¶ä½¿ç”¨eageræ¨¡å¼ï¼›é»˜è®¤ä¸ºFalse  
 
+#### æµ‹è¯•ç»“æžœ
 
-#### 3.3 æ³¨æ„äº‹é¡¹
+æµ‹è¯•ç»“æžœè¯¦è§`outputs`ç›®å½•ä¸‹çš„æ–‡ä»¶ï¼Œé»˜è®¤åŒ…æ‹¬`transformers`å’Œ`vllm`ä¸¤ä¸ªç›®å½•ï¼Œåˆ†åˆ«å˜æ”¾HuggingFace transformerså’ŒvLLMçš„æµ‹è¯•ç»“æžœã€‚
+
+## æ³¨æ„äº‹é¡¹
 
 1. å¤šæ¬¡æµ‹è¯•ï¼Œå–å¹³å‡å€¼ï¼Œå…¸åž‹å€¼ä¸º3æ¬¡
 2. æµ‹è¯•å‰è¯·ç¡®ä¿GPUå¤„äºŽç©ºé—²çŠ¶æ€ï¼Œé¿å…å…¶ä»–ä»»åŠ¡å½±å“æµ‹è¯•ç»“æžœ
 
-### 4. æµ‹è¯•ç»“æžœ
 
-æµ‹è¯•ç»“æžœè¯¦è§`outputs`ç›®å½•ä¸‹çš„æ–‡ä»¶ï¼Œé»˜è®¤åŒ…æ‹¬`transformers`å’Œ`vllm`ä¸¤ä¸ªç›®å½•ï¼Œåˆ†åˆ«å˜æ”¾HuggingFace transformerså’ŒvLLMçš„æµ‹è¯•ç»“æžœã€‚