The Wayback Machine - https://web.archive.org/web/20220602171453/https://github.com/topics/distributed-training
Skip to content
#

distributed-training

Here are 83 public repositories matching this topic...

jankrynauw
jankrynauw commented Jun 6, 2019

We would like to forward a particular 'key' column which is part of the features to appear alongside the predictions - this is to be able to identify to which set of features a particular prediction belongs to. Here is an example of predictions output using the tensorflow.contrib.estimator.multi_class_head:

{"classes": ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"],
 "scores": [0.068196
enhancement help wanted good first issue
wizard1203
wizard1203 commented Nov 7, 2020

It seems that the number of joining clients (not the num of computing clients) is fixed in fedml_api/data_preprocessing/**/data_loader and cannot be changed except CIFAR10 datasets.

Here I mean that it seems the total clients is decided by the datasets, rather the input from run_fedavg_distributed_pytorch.sh.

https://github.com/FedML-AI/FedML/blob/3d9fda8d149c95f25ec4898e31df76f035a33b5d/fed

good first issue
borzunov
borzunov commented Sep 21, 2021

Simple mistakes trigger unclear error messages in the ALBERT example, that is:

  • Absence of the unpacked data for trainer (currently triggers requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/api/models/data/tokenizer)
  • Running all peers in --client_mode (currently triggers AllReduce failed: could not find a group)

It would be great to

good first issue help wanted
merrymercy
merrymercy commented May 7, 2022

Background

Currently, Alpa uses cupy as the python API binding for nccl. This causes two problems

  • We need to do conversion between cupy tensors and xla tensors. Although we can achieve zero-copy through dlpack, this part of code is error-prune and hacky.
  • There can be conflicts between the nccl used by cupy and the nccl used by XLA. cupy.nccl and [xla/nccl_utils](https://github.com/alp
good first issue
adaptdl
aurickq
aurickq commented Sep 6, 2020

torchtext (as of 0.4.0) adopts torch.utils.data.DataLoader, and the older iterator interface is deprecated. Ensure AdaptDL's AdaptiveDataLoader supports this new torchtext interface for data loading, and port the example transformer code to the new interface. Then, adaptdl.data.iterator can be deprecated/removed.

enhancement good first issue

Improve this page

Add a description, image, and links to the distributed-training topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the distributed-training topic, visit your repo's landing page and select "manage topics."

Learn more