distributed-training

We would like to forward a particular 'key' column which is part of the features to appear alongside the predictions - this is to be able to identify to which set of features a particular prediction belongs to. Here is an example of predictions output using the tensorflow.contrib.estimator.multi_class_head:

{"classes": ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"],
 "scores": [0.068196

I have the same hardware envs, same network, but I could not get the result as you, almost half as you. Any best practices and experience? thanks very much! for bytePS with 1 instance and 8 GPU, I have similar testing result.

It seems that the number of joining clients (not the num of computing clients) is fixed in fedml_api/data_preprocessing/**/data_loader and cannot be changed except CIFAR10 datasets.

Here I mean that it seems the total clients is decided by the datasets, rather the input from run_fedavg_distributed_pytorch.sh.

https://github.com/FedML-AI/FedML/blob/3d9fda8d149c95f25ec4898e31df76f035a33b5d/fed

Simple mistakes trigger unclear error messages in the ALBERT example, that is:

Absence of the unpacked data for trainer (currently triggers requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/api/models/data/tokenizer)
Running all peers in --client_mode (currently triggers AllReduce failed: could not find a group)

It would be great to

Background

Currently, Alpa uses cupy as the python API binding for nccl. This causes two problems

We need to do conversion between cupy tensors and xla tensors. Although we can achieve zero-copy through dlpack, this part of code is error-prune and hacky.
There can be conflicts between the nccl used by cupy and the nccl used by XLA. cupy.nccl and [xla/nccl_utils](https://github.com/alp

torchtext (as of 0.4.0) adopts torch.utils.data.DataLoader, and the older iterator interface is deprecated. Ensure AdaptDL's AdaptiveDataLoader supports this new torchtext interface for data loading, and port the example transformer code to the new interface. Then, adaptdl.data.iterator can be deprecated/removed.

May	JUN	Jul
	02
2021	2022	2023

distributed-training

Here are 83 public repositories matching this topic...

rwightman / pytorch-image-models

PaddlePaddle / Paddle

tensorflow / adanet

bytedance / byteps

determined-ai / determined

FedML-AI / FedML

tensorlayer / hyperpose

learning-at-home / hivemind

alibaba / DeepRec

alpa-projects / alpa

Background

petuum / adaptdl

awslabs / deeplearning-cfn

lsds / KungFu

dougsouza / pytorch-sync-batchnorm-example

DeNA / HandyRL

maudzung / YOLO3D-YOLOv4-PyTorch

DataCanvasIO / HyperGBM

wenwei202 / terngrad

synxlin / deep-gradient-compression

pytorch / torchx

alibaba / EasyParallelLibrary

ZJU-OpenKS / OpenKS

PaddlePaddle / PLSC

INET-RC / GeoMX

PKU-DAIR / Hetu

richardkxu / distributed-pytorch

Oneflow-Inc / libai

bindog / pytorch-model-parallel

bytedance / ps-lite

bryanyzhu / Video-Tutorial-CVPR2020

Improve this page

Add this topic to your repo