KTO 训练每次保持ckt 都报错

Describe the bug
What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程，最好有截图)

CUDA_VISIBLE_DEVICES=0,1,2,3 NPROC_PER_NODE=4 swift rlhf --model_type qwen2_5 --rlhf_type kto --model /data/changjianhou/Qwen2_5-14B-Instruct --train_type lora --dataset /data/changjianhou/DATA/got/ppt_online/all_json_0324_v7_kto.json --torch_dtype bfloat16 --num_train_epochs 2 --per_device_train_batch_size 2 --per_device_eval_batch_size 2 --learning_rate 1e-4 --lora_rank 48 --lora_alpha 32 --eval_steps 5000 --save_steps 5000 --save_total_limit 5 --logging_steps 5 --max_length 20000 --output_dir output --warmup_ratio 0.05 --dataloader_num_workers 4 --output_dir /data/changjianhou/DATA/model_output/ppt_kto --deepspeed zero3 --sequence_parallel_size 4

Your hardware and system info
Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息，如CUDA版本，系统，GPU型号和torch版本等)
cuda：12.2
torch 2.5.1
GPU: H88*4

Additional context
Add any other context about the problem here(在这里补充其他信息)

报错信息如下：

Invalidate trace cache @ step 0 and module 0: cache has only 0 modules
Invalidate trace cache @ step 10: expected module 11, but got module 19
[rank3]: Traceback (most recent call last):
[rank3]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/swift/cli/rlhf.py", line 5, in
[rank3]: rlhf_main()
[rank3]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/swift/llm/train/rlhf.py", line 96, in rlhf_main
[rank3]: return SwiftRLHF(args).main()
[rank3]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/swift/llm/base.py", line 46, in main
[rank3]: result = self.run()
[rank3]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/swift/llm/train/sft.py", line 143, in run
[rank3]: return self.train(trainer)
[rank3]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/swift/llm/train/sft.py", line 202, in train
[rank3]: trainer.train(trainer.args.resume_from_checkpoint)
[rank3]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/swift/trainers/mixin.py", line 266, in train
[rank3]: res = super().train(*args, **kwargs)
[rank3]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/transformers/trainer.py", line 2164, in train
[rank3]: return inner_training_loop(
[rank3]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/transformers/trainer.py", line 2591, in _inner_training_loop
[rank3]: self._maybe_log_save_evaluate(
[rank3]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/swift/trainers/mixin.py", line 321, in _maybe_log_save_evaluate
[rank3]: super()._maybe_log_save_evaluate(tr_loss, *args, **kwargs)
[rank3]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/transformers/trainer.py", line 3049, in _maybe_log_save_evaluate
[rank3]: metrics = self._evaluate(trial, ignore_keys_for_eval)
[rank3]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/transformers/trainer.py", line 3003, in _evaluate
[rank3]: metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
[rank3]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/transformers/trainer.py", line 4050, in evaluate
[rank3]: output = eval_loop(
[rank3]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/trl/trainer/kto_trainer.py", line 1457, in evaluation_loop
[rank3]: initial_output = super().evaluation_loop(
[rank3]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/transformers/trainer.py", line 4244, in evaluation_loop
[rank3]: losses, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys)
[rank3]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/trl/trainer/kto_trainer.py", line 1385, in prediction_step
[rank3]: loss, metrics = self.get_batch_loss_metrics(model, inputs)
[rank3]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/trl/trainer/kto_trainer.py", line 1237, in get_batch_loss_metrics
[rank3]: ) = self.forward(self.model, batch)[:5]
[rank3]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/swift/trainers/rlhf_trainer/kto_trainer.py", line 69, in forward
[rank3]: return super().forward(model, batch)
[rank3]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/trl/trainer/kto_trainer.py", line 1064, in forward
[rank3]: KL_logits = model(
[rank3]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank3]: return self._call_impl(*args, **kwargs)
[rank3]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1844, in _call_impl
[rank3]: return inner()
[rank3]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1803, in inner
[rank3]: hook_result = hook(self, args, result)
[rank3]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank3]: ret_val = func(*args, **kwargs)
[rank3]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 229, in _end_of_forward_hook
[rank3]: self.get_param_coordinator(training=False).reset_step()
[rank3]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 208, in reset_step
[rank3]: raise RuntimeError(f"still have inflight params "
[rank3]: RuntimeError: still have inflight params [{'id': 580, 'status': 'AVAILABLE', 'numel': 245760, 'ds_numel': 245760, 'shape': (5120, 48), 'ds_shape': (5120, 48), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([61440])}, {'id': 582, 'status': 'AVAILABLE', 'numel': 49152, 'ds_numel': 49152, 'shape': (1024, 48), 'ds_shape': (1024, 48), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([12288])}, {'id': 584, 'status': 'AVAILABLE', 'numel': 49152, 'ds_numel': 49152, 'shape': (1024, 48), 'ds_shape': (1024, 48), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([12288])}, {'id': 586, 'status': 'AVAILABLE', 'numel': 245760, 'ds_numel': 245760, 'shape': (5120, 48), 'ds_shape': (5120, 48), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([61440])}, {'id': 588, 'status': 'AVAILABLE', 'numel': 663552, 'ds_numel': 663552, 'shape': (13824, 48), 'ds_shape': (13824, 48), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([165888])}, {'id': 590, 'status': 'AVAILABLE', 'numel': 663552, 'ds_numel': 663552, 'shape': (13824, 48), 'ds_shape': (13824, 48), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([165888])}, {'id': 592, 'status': 'AVAILABLE', 'numel': 245760, 'ds_numel': 245760, 'shape': (5120, 48), 'ds_shape': (5120, 48), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([61440])}, {'id': 594, 'status': 'AVAILABLE', 'numel': 245760, 'ds_numel': 245760, 'shape': (5120, 48), 'ds_shape': (5120, 48), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([61440])}, {'id': 596, 'status': 'AVAILABLE', 'numel': 49152, 'ds_numel': 49152, 'shape': (1024, 48), 'ds_shape': (1024, 48), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([12288])}, {'id': 598, 'status': 'AVAILABLE', 'numel': 49152, 'ds_numel': 49152, 'shape': (1024, 48), 'ds_shape': (1024, 48), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([12288])}, {'id': 600, 'status': 'AVAILABLE', 'numel': 245760, 'ds_numel': 245760, 'shape': (5120, 48), 'ds_shape': (5120, 48), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([61440])}, {'id': 602, 'status': 'AVAILABLE', 'numel': 663552, 'ds_numel': 663552, 'shape': (13824, 48), 'ds_shape': (13824, 48), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([165888])}]
[rank1]: Traceback (most recent call last):
[rank1]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/swift/cli/rlhf.py", line 5, in
[rank1]: rlhf_main()
[rank1]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/swift/llm/train/rlhf.py", line 96, in rlhf_main
[rank1]: return SwiftRLHF(args).main()
[rank1]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/swift/llm/base.py", line 46, in main
[rank1]: result = self.run()
[rank1]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/swift/llm/train/sft.py", line 143, in run
[rank1]: return self.train(trainer)
[rank1]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/swift/llm/train/sft.py", line 202, in train
[rank1]: trainer.train(trainer.args.resume_from_checkpoint)
[rank1]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/swift/trainers/mixin.py", line 266, in train
[rank1]: res = super().train(*args, **kwargs)
[rank1]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/transformers/trainer.py", line 2164, in train
[rank1]: return inner_training_loop(
[rank1]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/transformers/trainer.py", line 2591, in _inner_training_loop
[rank1]: self._maybe_log_save_evaluate(
[rank1]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/swift/trainers/mixin.py", line 321, in _maybe_log_save_evaluate
[rank1]: super()._maybe_log_save_evaluate(tr_loss, *args, **kwargs)
[rank1]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/transformers/trainer.py", line 3049, in _maybe_log_save_evaluate
[rank1]: metrics = self._evaluate(trial, ignore_keys_for_eval)
[rank1]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/transformers/trainer.py", line 3003, in _evaluate
[rank1]: metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
[rank1]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/transformers/trainer.py", line 4050, in evaluate
[rank1]: output = eval_loop(
[rank1]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/trl/trainer/kto_trainer.py", line 1457, in evaluation_loop
[rank1]: initial_output = super().evaluation_loop(
[rank1]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/transformers/trainer.py", line 4244, in evaluation_loop
[rank1]: losses, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys)
[rank1]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/trl/trainer/kto_trainer.py", line 1385, in prediction_step
[rank1]: loss, metrics = self.get_batch_loss_metrics(model, inputs)
[rank1]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/trl/trainer/kto_trainer.py", line 1237, in get_batch_loss_metrics
[rank1]: ) = self.forward(self.model, batch)[:5]
[rank1]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/swift/trainers/rlhf_trainer/kto_trainer.py", line 69, in forward
[rank1]: return super().forward(model, batch)
[rank1]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/trl/trainer/kto_trainer.py", line 1064, in forward
[rank1]: KL_logits = model(
[rank1]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank1]: return self._call_impl(*args, **kwargs)
[rank1]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1844, in _call_impl
[rank1]: return inner()
[rank1]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1803, in inner
[rank1]: hook_result = hook(self, args, result)
[rank1]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank1]: ret_val = func(*args, **kwargs)
[rank1]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 229, in _end_of_forward_hook
[rank1]: self.get_param_coordinator(training=False).reset_step()
[rank1]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 208, in reset_step
[rank1]: raise RuntimeError(f"still have inflight params "
[rank1]: RuntimeError: still have inflight params [{'id': 582, 'status': 'AVAILABLE', 'numel': 49152, 'ds_numel': 49152, 'shape': (1024, 48), 'ds_shape': (1024, 48), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([12288])}, {'id': 584, 'status': 'AVAILABLE', 'numel': 49152, 'ds_numel': 49152, 'shape': (1024, 48), 'ds_shape': (1024, 48), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([12288])}, {'id': 580, 'status': 'AVAILABLE', 'numel': 245760, 'ds_numel': 245760, 'shape': (5120, 48), 'ds_shape': (5120, 48), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([61440])}, {'id': 586, 'status': 'AVAILABLE', 'numel': 245760, 'ds_numel': 245760, 'shape': (5120, 48), 'ds_shape': (5120, 48), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([61440])}, {'id': 588, 'status': 'AVAILABLE', 'numel': 663552, 'ds_numel': 663552, 'shape': (13824, 48), 'ds_shape': (13824, 48), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([165888])}, {'id': 590, 'status': 'AVAILABLE', 'numel': 663552, 'ds_numel': 663552, 'shape': (13824, 48), 'ds_shape': (13824, 48), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([165888])}, {'id': 592, 'status': 'AVAILABLE', 'numel': 245760, 'ds_numel': 245760, 'shape': (5120, 48), 'ds_shape': (5120, 48), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([61440])}, {'id': 594, 'status': 'AVAILABLE', 'numel': 245760, 'ds_numel': 245760, 'shape': (5120, 48), 'ds_shape': (5120, 48), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([61440])}, {'id': 596, 'status': 'AVAILABLE', 'numel': 49152, 'ds_numel': 49152, 'shape': (1024, 48), 'ds_shape': (1024, 48), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([12288])}, {'id': 598, 'status': 'AVAILABLE', 'numel': 49152, 'ds_numel': 49152, 'shape': (1024, 48), 'ds_shape': (1024, 48), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([12288])}, {'id': 600, 'status': 'AVAILABLE', 'numel': 245760, 'ds_numel': 245760, 'shape': (5120, 48), 'ds_shape': (5120, 48), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([61440])}, {'id': 602, 'status': 'AVAILABLE', 'numel': 663552, 'ds_numel': 663552, 'shape': (13824, 48), 'ds_shape': (13824, 48), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([165888])}]
[rank0]: Traceback (most recent call last):
[rank0]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/swift/cli/rlhf.py", line 5, in
[rank0]: rlhf_main()
[rank0]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/swift/llm/train/rlhf.py", line 96, in rlhf_main
[rank0]: return SwiftRLHF(args).main()
[rank0]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/swift/llm/base.py", line 46, in main
[rank0]: result = self.run()
[rank0]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/swift/llm/train/sft.py", line 143, in run
[rank0]: return self.train(trainer)
[rank0]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/swift/llm/train/sft.py", line 202, in train
[rank0]: trainer.train(trainer.args.resume_from_checkpoint)
[rank0]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/swift/trainers/mixin.py", line 266, in train
[rank0]: res = super().train(*args, **kwargs)
[rank0]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/transformers/trainer.py", line 2164, in train
[rank0]: return inner_training_loop(
[rank0]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/transformers/trainer.py", line 2591, in _inner_training_loop
[rank0]: self._maybe_log_save_evaluate(
[rank0]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/swift/trainers/mixin.py", line 321, in _maybe_log_save_evaluate
[rank0]: super()._maybe_log_save_evaluate(tr_loss, *args, **kwargs)
[rank0]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/transformers/trainer.py", line 3049, in _maybe_log_save_evaluate
[rank0]: metrics = self._evaluate(trial, ignore_keys_for_eval)
[rank0]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/transformers/trainer.py", line 3003, in _evaluate
[rank0]: metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
[rank0]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/transformers/trainer.py", line 4050, in evaluate
[rank0]: output = eval_loop(
[rank0]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/trl/trainer/kto_trainer.py", line 1457, in evaluation_loop
[rank0]: initial_output = super().evaluation_loop(
[rank0]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/transformers/trainer.py", line 4244, in evaluation_loop
[rank0]: losses, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys)
[rank0]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/trl/trainer/kto_trainer.py", line 1385, in prediction_step
[rank0]: loss, metrics = self.get_batch_loss_metrics(model, inputs)
[rank0]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/trl/trainer/kto_trainer.py", line 1237, in get_batch_loss_metrics
[rank0]: ) = self.forward(self.model, batch)[:5]
[rank0]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/swift/trainers/rlhf_trainer/kto_trainer.py", line 69, in forward
[rank0]: return super().forward(model, batch)
[rank0]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/trl/trainer/kto_trainer.py", line 1064, in forward
[rank0]: KL_logits = model(
[rank0]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1844, in _call_impl
[rank0]: return inner()
[rank0]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1803, in inner
[rank0]: hook_result = hook(self, args, result)
[rank0]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank0]: ret_val = func(*args, **kwargs)
[rank0]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 229, in _end_of_forward_hook
[rank0]: self.get_param_coordinator(training=False).reset_step()
[rank0]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 208, in reset_step
[rank0]: raise RuntimeError(f"still have inflight params "
[rank0]: RuntimeError: still have inflight params [{'id': 582, 'status': 'AVAILABLE', 'numel': 49152, 'ds_numel': 49152, 'shape': (1024, 48), 'ds_shape': (1024, 48), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([12288])}, {'id': 580, 'status': 'AVAILABLE', 'numel': 245760, 'ds_numel': 245760, 'shape': (5120, 48), 'ds_shape': (5120, 48), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([61440])}, {'id': 584, 'status': 'AVAILABLE', 'numel': 49152, 'ds_numel': 49152, 'shape': (1024, 48), 'ds_shape': (1024, 48), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([12288])}, {'id': 586, 'status': 'AVAILABLE', 'numel': 245760, 'ds_numel': 245760, 'shape': (5120, 48), 'ds_shape': (5120, 48), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([61440])}, {'id': 588, 'status': 'AVAILABLE', 'numel': 663552, 'ds_numel': 663552, 'shape': (13824, 48), 'ds_shape': (13824, 48), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([165888])}, {'id': 590, 'status': 'AVAILABLE', 'numel': 663552, 'ds_numel': 663552, 'shape': (13824, 48), 'ds_shape': (13824, 48), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([165888])}, {'id': 592, 'status': 'AVAILABLE', 'numel': 245760, 'ds_numel': 245760, 'shape': (5120, 48), 'ds_shape': (5120, 48), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([61440])}, {'id': 594, 'status': 'AVAILABLE', 'numel': 245760, 'ds_numel': 245760, 'shape': (5120, 48), 'ds_shape': (5120, 48), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([61440])}, {'id': 596, 'status': 'AVAILABLE', 'numel': 49152, 'ds_numel': 49152, 'shape': (1024, 48), 'ds_shape': (1024, 48), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([12288])}, {'id': 598, 'status': 'AVAILABLE', 'numel': 49152, 'ds_numel': 49152, 'shape': (1024, 48), 'ds_shape': (1024, 48), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([12288])}, {'id': 600, 'status': 'AVAILABLE', 'numel': 245760, 'ds_numel': 245760, 'shape': (5120, 48), 'ds_shape': (5120, 48), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([61440])}, {'id': 602, 'status': 'AVAILABLE', 'numel': 663552, 'ds_numel': 663552, 'shape': (13824, 48), 'ds_shape': (13824, 48), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([165888])}]
[rank2]: Traceback (most recent call last):
[rank2]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/swift/cli/rlhf.py", line 5, in
[rank2]: rlhf_main()
[rank2]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/swift/llm/train/rlhf.py", line 96, in rlhf_main
[rank2]: return SwiftRLHF(args).main()
[rank2]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/swift/llm/base.py", line 46, in main
[rank2]: result = self.run()
[rank2]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/swift/llm/train/sft.py", line 143, in run
[rank2]: return self.train(trainer)
[rank2]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/swift/llm/train/sft.py", line 202, in train
[rank2]: trainer.train(trainer.args.resume_from_checkpoint)
[rank2]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/swift/trainers/mixin.py", line 266, in train
[rank2]: res = super().train(*args, **kwargs)
[rank2]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/transformers/trainer.py", line 2164, in train
[rank2]: return inner_training_loop(
[rank2]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/transformers/trainer.py", line 2591, in _inner_training_loop
[rank2]: self._maybe_log_save_evaluate(
[rank2]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/swift/trainers/mixin.py", line 321, in _maybe_log_save_evaluate
[rank2]: super()._maybe_log_save_evaluate(tr_loss, *args, **kwargs)
[rank2]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/transformers/trainer.py", line 3049, in _maybe_log_save_evaluate
[rank2]: metrics = self._evaluate(trial, ignore_keys_for_eval)
[rank2]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/transformers/trainer.py", line 3003, in _evaluate
[rank2]: metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
[rank2]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/transformers/trainer.py", line 4050, in evaluate
[rank2]: output = eval_loop(
[rank2]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/trl/trainer/kto_trainer.py", line 1457, in evaluation_loop
[rank2]: initial_output = super().evaluation_loop(
[rank2]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/transformers/trainer.py", line 4244, in evaluation_loop
[rank2]: losses, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys)
[rank2]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/trl/trainer/kto_trainer.py", line 1385, in prediction_step
[rank2]: loss, metrics = self.get_batch_loss_metrics(model, inputs)
[rank2]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/trl/trainer/kto_trainer.py", line 1237, in get_batch_loss_metrics
[rank2]: ) = self.forward(self.model, batch)[:5]
[rank2]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/swift/trainers/rlhf_trainer/kto_trainer.py", line 69, in forward
[rank2]: return super().forward(model, batch)
[rank2]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/trl/trainer/kto_trainer.py", line 1064, in forward
[rank2]: KL_logits = model(
[rank2]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank2]: return self._call_impl(*args, **kwargs)
[rank2]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1844, in _call_impl
[rank2]: return inner()
[rank2]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1803, in inner
[rank2]: hook_result = hook(self, args, result)
[rank2]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank2]: ret_val = func(*args, **kwargs)
[rank2]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 229, in _end_of_forward_hook
[rank2]: self.get_param_coordinator(training=False).reset_step()
[rank2]: File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 208, in reset_step
[rank2]: raise RuntimeError(f"still have inflight params "
[rank2]: RuntimeError: still have inflight params [{'id': 580, 'status': 'AVAILABLE', 'numel': 245760, 'ds_numel': 245760, 'shape': (5120, 48), 'ds_shape': (5120, 48), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([61440])}, {'id': 582, 'status': 'AVAILABLE', 'numel': 49152, 'ds_numel': 49152, 'shape': (1024, 48), 'ds_shape': (1024, 48), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([12288])}, {'id': 584, 'status': 'AVAILABLE', 'numel': 49152, 'ds_numel': 49152, 'shape': (1024, 48), 'ds_shape': (1024, 48), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([12288])}, {'id': 586, 'status': 'AVAILABLE', 'numel': 245760, 'ds_numel': 245760, 'shape': (5120, 48), 'ds_shape': (5120, 48), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([61440])}, {'id': 588, 'status': 'AVAILABLE', 'numel': 663552, 'ds_numel': 663552, 'shape': (13824, 48), 'ds_shape': (13824, 48), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([165888])}, {'id': 590, 'status': 'AVAILABLE', 'numel': 663552, 'ds_numel': 663552, 'shape': (13824, 48), 'ds_shape': (13824, 48), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([165888])}, {'id': 592, 'status': 'AVAILABLE', 'numel': 245760, 'ds_numel': 245760, 'shape': (5120, 48), 'ds_shape': (5120, 48), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([61440])}, {'id': 594, 'status': 'AVAILABLE', 'numel': 245760, 'ds_numel': 245760, 'shape': (5120, 48), 'ds_shape': (5120, 48), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([61440])}, {'id': 596, 'status': 'AVAILABLE', 'numel': 49152, 'ds_numel': 49152, 'shape': (1024, 48), 'ds_shape': (1024, 48), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([12288])}, {'id': 598, 'status': 'AVAILABLE', 'numel': 49152, 'ds_numel': 49152, 'shape': (1024, 48), 'ds_shape': (1024, 48), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([12288])}, {'id': 600, 'status': 'AVAILABLE', 'numel': 245760, 'ds_numel': 245760, 'shape': (5120, 48), 'ds_shape': (5120, 48), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([61440])}, {'id': 602, 'status': 'AVAILABLE', 'numel': 663552, 'ds_numel': 663552, 'shape': (13824, 48), 'ds_shape': (13824, 48), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([165888])}]
Train: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 1618/1618 [2:43:07<00:00, 6.05s/it]
[rank1]:[W325 19:43:14.176026856 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
[rank0]:[W325 19:43:14.177238163 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
[rank3]:[W325 19:43:14.178636703 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
[rank2]:[W325 19:43:14.180286790 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
W0325 19:43:15.582000 3327892 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3328123 closing signal SIGTERM
W0325 19:43:15.584000 3327892 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3328124 closing signal SIGTERM
W0325 19:43:15.584000 3327892 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3328125 closing signal SIGTERM
E0325 19:43:16.149000 3327892 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 3328122) of binary: /data/aobozhang/miniconda3/envs/new_vllm/bin/python
Traceback (most recent call last):
File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/torch/distributed/run.py", line 923, in

CUDA_VISIBLE_DEVICES=0,1,2,3 NPROC_PER_NODE=4 swift rlhf --model_type qwen2_5 --rlhf_type kto --model /data/changjianhou/Qwen2_5-14B-Instruct --train_type lora --dataset /data/changjianhou/DATA/got/ppt_online/all_json_0324_v7_kto.json --torch_dtype bfloat16 --num_train_epochs 2 --per_device_train_batch_size 2 --per_device_eval_batch_size 2 --learning_rate 1e-4 --lora_rank 48 --lora_alpha 32 --eval_steps 5000 --save_steps 5000 --save_total_limit 5 --logging_steps 5 --max_length 20000 --output_dir output --warmup_ratio 0.05 --dataloader_num_workers 4 --output_dir /data/changjianhou/DATA/model_output/ppt_kto --deepspeed zero3 --sequence_parallel_size 4
main()
File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper
return f(*args, kwargs)
File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main
run(args)
File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File "/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/swift/cli/rlhf.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2025-03-25_19:43:15
host : 984098
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 3328122)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

KTO 训练每次保持ckt 都报错 #3669

/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/swift/cli/rlhf.py FAILED

Failures:
<NO_OTHER_FAILURES>

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

KTO 训练每次保持ckt 都报错 #3669

Description

/data/aobozhang/miniconda3/envs/new_vllm/lib/python3.10/site-packages/swift/cli/rlhf.py FAILED

Failures: <NO_OTHER_FAILURES>

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Failures:
<NO_OTHER_FAILURES>