Pulse · microsoft/DeepSpeed · GitHub

The Wayback Machine - https://web.archive.org/web/20240912111319/https://github.com/microsoft/DeepSpeed/pulse

September 5, 2024 – September 12, 2024

Overview

12 Active pull requests

35 Active issues

9 Pull requests merged by 7 people

Add conditional on torch version for scaled_dot_product_attention
#6517 merged Sep 12, 2024
Avoid security issues of subprocess shell
#6498 merged Sep 11, 2024
wrap include cuda_bf16.h with ifdef BF16_AVAILABLE
#6520 merged Sep 10, 2024
Revert "BF16 optimizer: Clear lp grads after updating hp grads in hook"
#6508 merged Sep 9, 2024
fix environment variable export bug for MultiNodeRunner
#5878 merged Sep 8, 2024
Fix the broken url link
#6500 merged Sep 6, 2024
fix pipeline eval_batch micro_batches argument for schedule
#6484 merged Sep 5, 2024
Op_builder->is_compatible quite warning
#6093 merged Sep 5, 2024
HPU: add required ENV vars to acccelerator init
#6495 merged Sep 5, 2024

3 Pull requests opened by 2 people

Handle when `backend` is also in compile_kwargs
#6502 opened Sep 7, 2024
Fix dynamo issue in llama
#6527 opened Sep 12, 2024
add bfloat16 to inference support dtypes
#6528 opened Sep 12, 2024

25 Issues closed by 9 people

can TP + zero3 train LLM ? how to do this ?
#6523 closed Sep 12, 2024
[BUG]AttributeError: module 'torch.nn.functional' has no attribute 'scaled_dot_product_attention'
#5534 closed Sep 12, 2024
nv-nightly CI test failure
#6518 closed Sep 11, 2024
[BUG] torch.cat(): expected a non-empty list of Tensors with Deepspeed Zero 3 with offload
#4176 closed Sep 10, 2024
[BUG] DeepSpeed hangs during evaluation under multi-GPU
#5394 closed Sep 9, 2024
why cpu_checkpointing can't work?
#522 closed Sep 9, 2024
Why is CPU Checkpointing only available with partitioned activations?
#541 closed Sep 9, 2024
Why does ZeRO not support float32 training?
#555 closed Sep 9, 2024
RuntimeError: : Optimizer lost track of step count!
#562 closed Sep 9, 2024
[BUG] AssertionError: Unable to pre-compile ops without torch installed.
#3329 closed Sep 9, 2024
[BUG] [Win] Installation from source fails for `transformer_inference_op`
#1631 closed Sep 9, 2024
[REQUEST] Please spend more time on the usability of the project, especially the doc.
#3220 closed Sep 9, 2024
deepspeed.ops.op_builder.async_io.AsyncIOBuilder assert
#1037 closed Sep 9, 2024
[BUG]pip install doesn't work. Please help.
#2137 closed Sep 9, 2024
Install fails on Windows
#1189 closed Sep 9, 2024
Problems Installing with Windows
#1121 closed Sep 9, 2024
How to do Inference on Multi Node with Multi GPUs using deepspeed?
#6483 closed Sep 9, 2024
[BUG] Why is LoRA much slower than Freeze?
#6507 closed Sep 9, 2024
[BUG] Universal checkpoint incompatibility with HF Trainer
#6470 closed Sep 7, 2024
Inference acceleration doesn't work
#6044 closed Sep 7, 2024
install issue in Windows 10
#435 closed Sep 6, 2024
build fail
#6499 closed Sep 6, 2024
[REQUEST] can we load a deepspeed ckpt without deepspeed?
#5895 closed Sep 5, 2024
[REQUEST] Add documentation on how to run fast inference of `transformers` models with ZeRO-3
#5498 closed Sep 5, 2024
[BUG] DeepSpeed is loads the whole model to every GPUs instead of partitioning
#5592 closed Sep 5, 2024

10 Issues opened by 10 people

nv-nightly CI test failure
#6529 opened Sep 12, 2024
[REQUEST] parallelize zero_to_fp32.py to use multiple cpu-cores and threads
#6526 opened Sep 11, 2024
[BUG] pydantic_core._pydantic_core.ValidationError: 1 validation error for DeepSpeedZeroConfig
#6525 opened Sep 11, 2024
[BUG] Distributed Training randomly stuck in trainings loop
#6524 opened Sep 11, 2024
[BUG] error ：past_key, past_value = layer_past，how to solve this ?
#6522 opened Sep 11, 2024
[BUG] RuntimeError: Error building extension 'inference_core_ops'
#6519 opened Sep 10, 2024
[BUG] No way to simply obtain optimizer state after training and w['optimizer'] is blank
#6506 opened Sep 8, 2024
[REQUEST] ZeRO3 doc - support for wrapping model sub-components seperately for training
#6505 opened Sep 8, 2024
[BUG] Universal checkpointing doesn't work when changing model parallel size (pp and dp change are ok)
#6503 opened Sep 8, 2024
[BUG] Config mesh_device None
#6501 opened Sep 6, 2024

32 Unresolved conversations

Sometimes conversations happen on old items that aren’t yet closed. Here is a list of all the Issues and Pull Requests with unresolved conversations.

[BUG]error: can't copy 'deepspeed/accelerator': doesn't exist or not a regular file
#3207 commented on Sep 5, 2024 • 0 new comments
[BUG] When using Zero-Infinity, Assertion `n_completes >= min_completes' failed
#4888 commented on Sep 5, 2024 • 0 new comments
Use Pipeline Parallelism and get stuck in the mid[BUG]
#5568 commented on Sep 6, 2024 • 0 new comments
[BUG] assert all_groups_norm > 0 | Error related to Bf16 optimizer it seems
#5223 commented on Sep 7, 2024 • 0 new comments
[BUG] Universal checkpoint conversion - "Cannot find layer_01* files in there"
#5776 commented on Sep 9, 2024 • 0 new comments
[REQUEST] MiCS vs Zero++ hpZ for Hybrid FSDP
#6467 commented on Sep 9, 2024 • 0 new comments
[BUG] `reduce_bucket_size` influences training convergence of Zero2
#6351 commented on Sep 9, 2024 • 0 new comments
[BUG] Gradient accumulation causing training loss differences in Deepspeed vs FSDP
#5898 commented on Sep 9, 2024 • 0 new comments
[BUG] deepspeed.utils.safe_get_full_grad get all nan value
#5883 commented on Sep 9, 2024 • 0 new comments
[BUG] Grad_norm is nan and Loss is 0
#5347 commented on Sep 9, 2024 • 0 new comments
[BUG] Training time regression with ZeRO-3 after upgrade to torch 2.3.1 and CUDA 12.1
#5844 commented on Sep 9, 2024 • 0 new comments
Tensor（hidden states）missing across GPU in Pipeline Parallelism Training[BUG]
#5696 commented on Sep 10, 2024 • 0 new comments
[BUG] zero3 hang during inference, need to detach part of computational graph, .detach()/torch.no_grad do not work.
#6438 commented on Sep 10, 2024 • 0 new comments
[BUG] deepspeed tries to call "hostname -I" which is not a valid flag for hostname. it should be "hostname -i"
#6497 commented on Sep 10, 2024 • 0 new comments
[REQUEST] Asynchronous Checkpointing
#5721 commented on Sep 10, 2024 • 0 new comments
[BUG] Universal checkpoint conversion failed
#5822 commented on Sep 11, 2024 • 0 new comments
[BUG] Pipeline Dataloader Samler: `shuffle=False`
#5619 commented on Sep 11, 2024 • 0 new comments
[REQUEST] Moving a trainable model with an optimiser between GPU and CPU
#5620 commented on Sep 11, 2024 • 0 new comments
[BUG] `max_in_cpu` seems to be ignored?
#4221 commented on Sep 12, 2024 • 0 new comments
[BUG]
#5241 commented on Sep 12, 2024 • 0 new comments
[BUG] AVX2 support for AdamCPU with DeepSpeed 0.14.2
#6363 commented on Sep 12, 2024 • 0 new comments
Add Cache to Comm Group
#4849 commented on Sep 9, 2024 • 0 new comments
Fix Initial Error in "WarmupCosineLR" scheduler at steps 0 and 1
#5287 commented on Sep 5, 2024 • 0 new comments
Update names of CPU Adam/Adagrad/Lion params to better match torch/GPU ops.
#5382 commented on Sep 6, 2024 • 0 new comments
Rearrange inference OPS and stop using builder.load
#5490 commented on Sep 11, 2024 • 0 new comments
inference: remove unused _validate_args function
#5505 commented on Sep 5, 2024 • 0 new comments
state_dict_factory: llama checkpoint - support SWIGLU
#5601 commented on Sep 6, 2024 • 0 new comments
Add APIs to offload states of model, optimizer, and engine
#6011 commented on Sep 9, 2024 • 0 new comments
Add weights_only=True in torch.load
#6094 commented on Sep 6, 2024 • 0 new comments
Adding the new feature of FPDT
#6462 commented on Sep 6, 2024 • 0 new comments
[Accelerator] Cambricon MLU support
#6472 commented on Sep 11, 2024 • 0 new comments
add option to disable logger while compiling to avoid graph breaks
#6496 commented on Sep 6, 2024 • 0 new comments