Newest 'distributed-computing' Questions

0 votes

1 answer

28 views

Upsert! Operation Throws "A table can't contain duplicate column names" Error

I have a base table A and a result table B in DolphinDB. Table B was initially empty and is used to store calculated results based on table A. When trying to insert the calculated results into table B,...

RORO

1

asked Oct 24 at 9:52

0 votes

0 answers

98 views

vLLM + Ray multi-node tensor-parallel deployment completely blocked by pending placement groups and raylet heartbeat failures

Environment: Ray version: 2.x vLLM version: 0.9.2 Python version: 3.9 OS / Container base: Linux (CentOS-based UBI8 in Kubernetes) Cloud / Infrastructure: AWS based Kubernetes cluster (pods scheduled ...

NullUser

9

asked Aug 5 at 17:38

3 votes

1 answer

129 views

In Apache Ignite the Replication mode and Partition mode does not work all together

I’m working with Apache Ignite 2.17.0. I load database tables into Ignite caches and run SQL queries using the SQLFieldsQuery API. Recently, I modified the cache configuration for some tables to use ...

kushal Baldev

799

asked Jul 29 at 17:31

1 vote

0 answers

32 views

Spark DSv2 Options vs Properties

I'm playing around with making a DSv2 data source, and I'm a bit confused about what the differences between the "options" and "properties" args passed to some of the TableProvider ...

William

141

asked Jun 27 at 23:01

0 votes

0 answers

60 views

Get two different nodes to access and distribute the same SQL table in Apache spark?

I have the following code to test. I created a table on worker 1. Then I tried to read the table on worker 2 and it got TABLE_OR_VIEW_NOT_FOUND. Worker 2 is in the some computer as Master. I ran the ...

Rick C. Ferreira

1

asked Jun 16 at 19:25

3 votes

2 answers

200 views

How Ray async actors handle calls to sync methods

I'm working with Ray async actors and I want to understand exactly what happens—at a deep technical level—when a synchronous method is called on such an actor. I know that calling a synchronous method ...

hegash

893

asked May 26 at 11:00

0 votes

0 answers

46 views

How to best partition my data with a 32 core EMR instance and make sure I max out the parallelize feature?

I’m optimizing a PySpark pipeline that processes records with a heavily skewed categorical column (category). The data has: A few high-frequency categories (e.g., 90% of records fall into 2-3 ...

Bilal Jamil

27

asked Apr 30 at 2:51

0 votes

0 answers

132 views

How to set up MS-MPI multi-machine communication between two Windows 11 systems?

I'm trying to set up a multi-machine communication environment using MS-MPI on two Windows 11 laptops, but I'm encountering some issues. Here are the details of my setup: Environment Details: ...

user29094781

1

asked Apr 5 at 6:29

1 vote

1 answer

96 views

Distributed REST API Calls using SPARK with maintaining consistency

I have a Spark DataFrame created from a Delta table, with one column of type STRUCT(JSON). For each row in this DataFrame, I need to make a REST API call using the JSON payload in the column. ...

uds0128

53

asked Mar 2 at 18:42

0 votes

0 answers

18 views

MLP Speed-Up in PySpark fluctuates with more cores – possible cache memory issue?

enter image description here I have conducted experiments running the MLP (Multi-Layer Perceptron) algorithm on a PC cluster with Apache Spark, with configurations ranging from small data to large ...

Syahel Razaba

1

asked Feb 16 at 22:23

0 votes

0 answers

316 views

PyTorch DDP Multi-Node Training: ncclInternalError: Internal check failed. Bootstrap : no socket interface found

I am trying to run a multi-node training job using PyTorch's DistributedDataParallel (DDP) following this guide. However, when I launch the job with torchrun, I encounter the following NCCL error on ...

yunjeong

1

asked Jan 31 at 7:19

0 votes

1 answer

773 views

Clearing Cached Data on Databricks Cluster

The problem I am facing is that my "used" memory is only around 16GB, however the cached memory takes up so much space, that I am forced to use a compute with higher memory (64GB). So I ...

Manav Karthikeyan

53

asked Jan 17 at 14:31

1 vote

0 answers

89 views

Segmentation Fault During Validation with MirroredStrategy on Multiple GPUs

I am training a model using TensorFlow 2.18.0 with the tf.distribute.MirroredStrategy across two GPUs. The training works fine on a single GPU, but when I try to run it on two GPUs, it ends with a ...

TGD

56

asked Jan 13 at 7:42

0 votes

1 answer

94 views

I want to use the distributed package in PyTorch for point-to-point communication between two ranks. but run error

def runTpoly(rank, size, pp, cs, pkArithmetics_evals, pkSelectors_evals, domain): init_process(rank, size) group2 = torch.distributed.new_group([1,2]) if rank == 0: device ...

wynne yin

1

asked Jan 7 at 10:59

0 votes

0 answers

74 views

Vertex AI Reduction Server returning 500 Internal Error

I am looking to finetune a pre-trained deberta model on Vertex AI with pytorch. I'm attempting to run a distributed job, making use of the Vertex AI reduction server. I'm following this notebook: ...

purpleFudge

1

asked Jan 1 at 14:59

Collectives™ on Stack Overflow

Upsert! Operation Throws "A table can't contain duplicate column names" Error

vLLM + Ray multi-node tensor-parallel deployment completely blocked by pending placement groups and raylet heartbeat failures

In Apache Ignite the Replication mode and Partition mode does not work all together

Spark DSv2 Options vs Properties

Get two different nodes to access and distribute the same SQL table in Apache spark?

How Ray async actors handle calls to sync methods

How to best partition my data with a 32 core EMR instance and make sure I max out the parallelize feature?

How to set up MS-MPI multi-machine communication between two Windows 11 systems?

Distributed REST API Calls using SPARK with maintaining consistency

MLP Speed-Up in PySpark fluctuates with more cores – possible cache memory issue?

PyTorch DDP Multi-Node Training: ncclInternalError: Internal check failed. Bootstrap : no socket interface found

Clearing Cached Data on Databricks Cluster

Segmentation Fault During Validation with MirroredStrategy on Multiple GPUs

I want to use the distributed package in PyTorch for point-to-point communication between two ranks. but run error

Vertex AI Reduction Server returning 500 Internal Error

Hot Network Questions