NPU#
⚠️ If you encounter problems running vime on Ascend NPU, feel free to open an issue on vllm-project/vime.
Introduction#
If you are running vime on Ascend NPU, please refer to the following materials. This tutorial explains how to set up the runtime environment and provides an end-to-end example for running GRPO training. It uses the Megatron training backend together with the vLLM Ascend rollout backend, synchronizing actor weights to vLLM through the native HCCL weight-sync path.
The current NPU support targets Ascend Atlas A2 / A3 (aarch64) hosts with the
Ascend driver and CANN 9.0.0 (Toolkit, Kernels, and NNAL/ATB) installed.
Only python==3.12 is supported.
Docker#
The recommended path for validation is the published vime NPU image.
export IMAGE=quay.io/ascend/vime:vime-latest
# A2: export IMAGE=quay.io/ascend/vime:vime-a2-latest
docker pull "${IMAGE}"
For source builds and dependency debugging, the patch list and pinned commits are
documented in docker/npu_patch/README.md.
Quick Start#
Environment Setup#
Start the container, mounting the Ascend devices and driver files. Device names and driver mount paths vary by host; reuse the mounts from a known working vLLM Ascend container if the layout differs.
docker run -d --name vime-npu -it --net=host --shm-size=1024g \
--privileged=true \
--cap-add=SYS_PTRACE \
--device=/dev/davinci_manager \
--device=/dev/hisi_hdc \
--device=/dev/devmm_svm \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/sbin:/usr/local/sbin \
-v /home:/home \
-v /mnt:/mnt \
-v /tmp:/tmp \
-v /data:/data \
-v /path/to:/path/to \
-v /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime \
"${IMAGE}"
docker exec -it vime-npu bash
Inside the container, initialize the CANN environment before training:
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
Prepare Model and Data#
Set MODEL_ROOT to a host-visible directory that will hold both the checkpoint
and the dataset, then download the Qwen3-4B checkpoint and the DAPO Math 17K
dataset:
export MODEL_ROOT=/root
mkdir -p ${MODEL_ROOT}/models ${MODEL_ROOT}/datasets
# hf checkpoint
hf download Qwen/Qwen3-4B \
--local-dir ${MODEL_ROOT}/models/Qwen3-4B
# train data
hf download --repo-type dataset zhuzilin/dapo-math-17k \
--local-dir ${MODEL_ROOT}/datasets/dapo-math-17k
Example: Qwen3-4B#
We provide an example to run GRPO training with Qwen3-4B on 8 NPUs (4 for the actor, 4 for rollout), please refer to: scripts/models/qwen3-4B_npu.sh. Just run:
cd /root/vime
# Source these explicitly if not already initialized by the image.
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
MODEL_ROOT=/root bash scripts/models/qwen3-4B_npu.sh
The full log is written to /root/vime/train_qwen3_4b_vllm.log.
⚠️ Note: The main difference between the NPU training script and the NVIDIA one
is the Ascend-specific environment variables — ASCEND_RT_VISIBLE_DEVICES
selects the NPUs, and RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES=1 lets
Ray schedule them correctly. The reference target is an Atlas A3 host with 16
visible NPUs; on an 8-NPU host, set
ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7.
We show the training script below:
export SLIME_SCRIPT_TRAIN_BACKEND=megatron
export PYTHONPATH="/root/Megatron-Bridge/src:/root/Megatron-LM/:$PYTHONPATH"
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
export CUDA_DEVICE_MAX_CONNECTIONS=1
export RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES=1
export HCCL_HOST_SOCKET_PORT_RANGE=60000-60050
export HCCL_NPU_SOCKET_PORT_RANGE=61000-61050
export HYDRA_FULL_ERROR=1
export MASTER_PORT=$(shuf -i 20000-65000 -n 1) # or any free port
export DISABLE_L2_CACHE=1
export VLLM_ASCEND_ENABLE_NZ=0
SCRIPT_DIR="/root/vime/scripts/"
source "${SCRIPT_DIR}/models/qwen3-4B.sh"
LOG_FILE="/root/vime/train_qwen3_4b_vllm.log"
MODEL_ROOT="${MODEL_ROOT:-/root}"
python /root/vime/train.py \
--train-backend megatron \
--actor-num-nodes 1 \
--actor-num-gpus-per-node 4 \
--rollout-num-gpus 4 \
--rollout-num-gpus-per-engine 4 \
${MODEL_ARGS[@]} \
\
--hf-checkpoint ${MODEL_ROOT}/models/Qwen3-4B/ \
\
--prompt-data ${MODEL_ROOT}/datasets/dapo-math-17k/dapo-math-17k.jsonl \
--input-key prompt \
--label-key label \
--apply-chat-template \
--rollout-shuffle \
--rm-type math \
\
--rollout-backend vllm \
--vllm-weight-sync-mode native \
--vllm-gpu-memory-utilization 0.6 \
--vllm-enable-sleep-mode \
--vllm-max-model-len 4096 \
\
--num-rollout 200 \
--rollout-batch-size 32 \
--n-samples-per-prompt 8 \
--rollout-max-response-len 2048 \
--rollout-temperature 1.0 \
--global-batch-size 256 \
--balance-data \
\
--advantage-estimator grpo \
--kl-loss-coef 0.0 \
--kl-loss-type low_var_kl \
--kl-coef 0.00 \
--entropy-coef 0.0 \
--eps-clip 0.2 \
--eps-clip-high 0.28 \
\
--optimizer adam \
--lr 1e-6 \
--lr-decay-style constant \
--weight-decay 0.1 \
--adam-beta1 0.9 \
--adam-beta2 0.98 \
\
--tensor-model-parallel-size 4 \
--pipeline-model-parallel-size 1 \
--context-parallel-size 1 \
--expert-model-parallel-size 1 \
--expert-tensor-parallel-size 1 \
--recompute-granularity full \
--recompute-method uniform \
--recompute-num-layers 1 \
--use-dynamic-batch-size \
--max-tokens-per-gpu 8192 \
--load ${MODEL_ROOT}/models/Qwen3-4B \
--megatron-to-hf-mode bridge \
\
--attention-dropout 0.0 \
--hidden-dropout 0.0 \
--accumulate-allreduce-grads-in-fp32 \
--attention-softmax-in-fp32 \
--attention-backend flash \
--micro-batch-size 1 \
--use-flash-attn \
\
--train-memory-margin-bytes 2147483648 \
2>&1 | tee -a "$LOG_FILE"