-
Notifications
You must be signed in to change notification settings - Fork 4k
Insights: microsoft/DeepSpeed
September 5, 2024 – September 12, 2024
Overview
Could not load contribution data
Please try again later
9 Pull requests merged by 7 people
-
Add conditional on torch version for scaled_dot_product_attention
#6517 merged
Sep 12, 2024 -
Avoid security issues of subprocess shell
#6498 merged
Sep 11, 2024 -
wrap include cuda_bf16.h with ifdef BF16_AVAILABLE
#6520 merged
Sep 10, 2024 -
Revert "BF16 optimizer: Clear lp grads after updating hp grads in hook"
#6508 merged
Sep 9, 2024 -
fix environment variable export bug for MultiNodeRunner
#5878 merged
Sep 8, 2024 -
Fix the broken url link
#6500 merged
Sep 6, 2024 -
fix pipeline eval_batch micro_batches argument for schedule
#6484 merged
Sep 5, 2024 -
Op_builder->is_compatible quite warning
#6093 merged
Sep 5, 2024 -
HPU: add required ENV vars to acccelerator init
#6495 merged
Sep 5, 2024
3 Pull requests opened by 2 people
-
Handle when `backend` is also in compile_kwargs
#6502 opened
Sep 7, 2024 -
Fix dynamo issue in llama
#6527 opened
Sep 12, 2024 -
add bfloat16 to inference support dtypes
#6528 opened
Sep 12, 2024
25 Issues closed by 9 people
-
can TP + zero3 train LLM ? how to do this ?
#6523 closed
Sep 12, 2024 -
[BUG]AttributeError: module 'torch.nn.functional' has no attribute 'scaled_dot_product_attention'
#5534 closed
Sep 12, 2024 -
nv-nightly CI test failure
#6518 closed
Sep 11, 2024 -
[BUG] torch.cat(): expected a non-empty list of Tensors with Deepspeed Zero 3 with offload
#4176 closed
Sep 10, 2024 -
[BUG] DeepSpeed hangs during evaluation under multi-GPU
#5394 closed
Sep 9, 2024 -
why cpu_checkpointing can't work?
#522 closed
Sep 9, 2024 -
Why is CPU Checkpointing only available with partitioned activations?
#541 closed
Sep 9, 2024 -
Why does ZeRO not support float32 training?
#555 closed
Sep 9, 2024 -
RuntimeError: : Optimizer lost track of step count!
#562 closed
Sep 9, 2024 -
[BUG] AssertionError: Unable to pre-compile ops without torch installed.
#3329 closed
Sep 9, 2024 -
[BUG] [Win] Installation from source fails for `transformer_inference_op`
#1631 closed
Sep 9, 2024 -
[REQUEST] Please spend more time on the usability of the project, especially the doc.
#3220 closed
Sep 9, 2024 -
deepspeed.ops.op_builder.async_io.AsyncIOBuilder assert
#1037 closed
Sep 9, 2024 -
[BUG]pip install doesn't work. Please help.
#2137 closed
Sep 9, 2024 -
Install fails on Windows
#1189 closed
Sep 9, 2024 -
Problems Installing with Windows
#1121 closed
Sep 9, 2024 -
How to do Inference on Multi Node with Multi GPUs using deepspeed?
#6483 closed
Sep 9, 2024 -
[BUG] Why is LoRA much slower than Freeze?
#6507 closed
Sep 9, 2024 -
[BUG] Universal checkpoint incompatibility with HF Trainer
#6470 closed
Sep 7, 2024 -
Inference acceleration doesn't work
#6044 closed
Sep 7, 2024 -
install issue in Windows 10
#435 closed
Sep 6, 2024 -
build fail
#6499 closed
Sep 6, 2024 -
[REQUEST] can we load a deepspeed ckpt without deepspeed?
#5895 closed
Sep 5, 2024 -
[REQUEST] Add documentation on how to run fast inference of `transformers` models with ZeRO-3
#5498 closed
Sep 5, 2024 -
[BUG] DeepSpeed is loads the whole model to every GPUs instead of partitioning
#5592 closed
Sep 5, 2024
10 Issues opened by 10 people
-
nv-nightly CI test failure
#6529 opened
Sep 12, 2024 -
[REQUEST] parallelize zero_to_fp32.py to use multiple cpu-cores and threads
#6526 opened
Sep 11, 2024 -
[BUG] pydantic_core._pydantic_core.ValidationError: 1 validation error for DeepSpeedZeroConfig
#6525 opened
Sep 11, 2024 -
[BUG] Distributed Training randomly stuck in trainings loop
#6524 opened
Sep 11, 2024 -
[BUG] error :past_key, past_value = layer_past,how to solve this ?
#6522 opened
Sep 11, 2024 -
[BUG] RuntimeError: Error building extension 'inference_core_ops'
#6519 opened
Sep 10, 2024 -
[BUG] No way to simply obtain optimizer state after training and w['optimizer'] is blank
#6506 opened
Sep 8, 2024 -
[REQUEST] ZeRO3 doc - support for wrapping model sub-components seperately for training
#6505 opened
Sep 8, 2024 -
[BUG] Config mesh_device None
#6501 opened
Sep 6, 2024
32 Unresolved conversations
Sometimes conversations happen on old items that aren’t yet closed. Here is a list of all the Issues and Pull Requests with unresolved conversations.
-
[BUG]error: can't copy 'deepspeed/accelerator': doesn't exist or not a regular file
#3207 commented on
Sep 5, 2024 • 0 new comments -
[BUG] When using Zero-Infinity, Assertion `n_completes >= min_completes' failed
#4888 commented on
Sep 5, 2024 • 0 new comments -
Use Pipeline Parallelism and get stuck in the mid[BUG]
#5568 commented on
Sep 6, 2024 • 0 new comments -
[BUG] assert all_groups_norm > 0 | Error related to Bf16 optimizer it seems
#5223 commented on
Sep 7, 2024 • 0 new comments -
[BUG] Universal checkpoint conversion - "Cannot find layer_01* files in there"
#5776 commented on
Sep 9, 2024 • 0 new comments -
[REQUEST] MiCS vs Zero++ hpZ for Hybrid FSDP
#6467 commented on
Sep 9, 2024 • 0 new comments -
[BUG] `reduce_bucket_size` influences training convergence of Zero2
#6351 commented on
Sep 9, 2024 • 0 new comments -
[BUG] Gradient accumulation causing training loss differences in Deepspeed vs FSDP
#5898 commented on
Sep 9, 2024 • 0 new comments -
[BUG] deepspeed.utils.safe_get_full_grad get all nan value
#5883 commented on
Sep 9, 2024 • 0 new comments -
[BUG] Grad_norm is nan and Loss is 0
#5347 commented on
Sep 9, 2024 • 0 new comments -
[BUG] Training time regression with ZeRO-3 after upgrade to torch 2.3.1 and CUDA 12.1
#5844 commented on
Sep 9, 2024 • 0 new comments -
Tensor(hidden states)missing across GPU in Pipeline Parallelism Training[BUG]
#5696 commented on
Sep 10, 2024 • 0 new comments -
[BUG] zero3 hang during inference, need to detach part of computational graph, .detach()/torch.no_grad do not work.
#6438 commented on
Sep 10, 2024 • 0 new comments -
[BUG] deepspeed tries to call "hostname -I" which is not a valid flag for hostname. it should be "hostname -i"
#6497 commented on
Sep 10, 2024 • 0 new comments -
[REQUEST] Asynchronous Checkpointing
#5721 commented on
Sep 10, 2024 • 0 new comments -
[BUG] Universal checkpoint conversion failed
#5822 commented on
Sep 11, 2024 • 0 new comments -
[BUG] Pipeline Dataloader Samler: `shuffle=False`
#5619 commented on
Sep 11, 2024 • 0 new comments -
[REQUEST] Moving a trainable model with an optimiser between GPU and CPU
#5620 commented on
Sep 11, 2024 • 0 new comments -
[BUG] `max_in_cpu` seems to be ignored?
#4221 commented on
Sep 12, 2024 • 0 new comments -
[BUG]
#5241 commented on
Sep 12, 2024 • 0 new comments -
[BUG] AVX2 support for AdamCPU with DeepSpeed 0.14.2
#6363 commented on
Sep 12, 2024 • 0 new comments -
Add Cache to Comm Group
#4849 commented on
Sep 9, 2024 • 0 new comments -
Fix Initial Error in "WarmupCosineLR" scheduler at steps 0 and 1
#5287 commented on
Sep 5, 2024 • 0 new comments -
Update names of CPU Adam/Adagrad/Lion params to better match torch/GPU ops.
#5382 commented on
Sep 6, 2024 • 0 new comments -
Rearrange inference OPS and stop using builder.load
#5490 commented on
Sep 11, 2024 • 0 new comments -
inference: remove unused _validate_args function
#5505 commented on
Sep 5, 2024 • 0 new comments -
state_dict_factory: llama checkpoint - support SWIGLU
#5601 commented on
Sep 6, 2024 • 0 new comments -
Add APIs to offload states of model, optimizer, and engine
#6011 commented on
Sep 9, 2024 • 0 new comments -
Add weights_only=True in torch.load
#6094 commented on
Sep 6, 2024 • 0 new comments -
Adding the new feature of FPDT
#6462 commented on
Sep 6, 2024 • 0 new comments -
[Accelerator] Cambricon MLU support
#6472 commented on
Sep 11, 2024 • 0 new comments -
add option to disable logger while compiling to avoid graph breaks
#6496 commented on
Sep 6, 2024 • 0 new comments

