Skip to main content
0 votes
1 answer
28 views

Upsert! Operation Throws "A table can't contain duplicate column names" Error

I have a base table A and a result table B in DolphinDB. Table B was initially empty and is used to store calculated results based on table A. When trying to insert the calculated results into table B,...
RORO's user avatar
  • 1
0 votes
0 answers
97 views

vLLM + Ray multi-node tensor-parallel deployment completely blocked by pending placement groups and raylet heartbeat failures

Environment: Ray version: 2.x vLLM version: 0.9.2 Python version: 3.9 OS / Container base: Linux (CentOS-based UBI8 in Kubernetes) Cloud / Infrastructure: AWS based Kubernetes cluster (pods scheduled ...
NullUser's user avatar
3 votes
1 answer
129 views

In Apache Ignite the Replication mode and Partition mode does not work all together

I’m working with Apache Ignite 2.17.0. I load database tables into Ignite caches and run SQL queries using the SQLFieldsQuery API. Recently, I modified the cache configuration for some tables to use ...
kushal Baldev's user avatar
1 vote
0 answers
32 views

Spark DSv2 Options vs Properties

I'm playing around with making a DSv2 data source, and I'm a bit confused about what the differences between the "options" and "properties" args passed to some of the TableProvider ...
William's user avatar
  • 141
0 votes
0 answers
60 views

Get two different nodes to access and distribute the same SQL table in Apache spark?

I have the following code to test. I created a table on worker 1. Then I tried to read the table on worker 2 and it got TABLE_OR_VIEW_NOT_FOUND. Worker 2 is in the some computer as Master. I ran the ...
Rick C. Ferreira's user avatar
3 votes
2 answers
198 views

How Ray async actors handle calls to sync methods

I'm working with Ray async actors and I want to understand exactly what happens—at a deep technical level—when a synchronous method is called on such an actor. I know that calling a synchronous method ...
hegash's user avatar
  • 893
0 votes
0 answers
46 views

How to best partition my data with a 32 core EMR instance and make sure I max out the parallelize feature?

I’m optimizing a PySpark pipeline that processes records with a heavily skewed categorical column (category). The data has: A few high-frequency categories (e.g., 90% of records fall into 2-3 ...
Bilal Jamil's user avatar
0 votes
0 answers
132 views

How to set up MS-MPI multi-machine communication between two Windows 11 systems?

I'm trying to set up a multi-machine communication environment using MS-MPI on two Windows 11 laptops, but I'm encountering some issues. Here are the details of my setup: Environment Details: ...
user29094781's user avatar
1 vote
1 answer
96 views

Distributed REST API Calls using SPARK with maintaining consistency

I have a Spark DataFrame created from a Delta table, with one column of type STRUCT(JSON). For each row in this DataFrame, I need to make a REST API call using the JSON payload in the column. ...
uds0128's user avatar
  • 53
0 votes
0 answers
18 views

MLP Speed-Up in PySpark fluctuates with more cores – possible cache memory issue?

enter image description here I have conducted experiments running the MLP (Multi-Layer Perceptron) algorithm on a PC cluster with Apache Spark, with configurations ranging from small data to large ...
Syahel Razaba's user avatar
0 votes
0 answers
316 views

PyTorch DDP Multi-Node Training: ncclInternalError: Internal check failed. Bootstrap : no socket interface found

I am trying to run a multi-node training job using PyTorch's DistributedDataParallel (DDP) following this guide. However, when I launch the job with torchrun, I encounter the following NCCL error on ...
yunjeong's user avatar
0 votes
1 answer
773 views

Clearing Cached Data on Databricks Cluster

The problem I am facing is that my "used" memory is only around 16GB, however the cached memory takes up so much space, that I am forced to use a compute with higher memory (64GB). So I ...
Manav Karthikeyan's user avatar
1 vote
0 answers
89 views

Segmentation Fault During Validation with MirroredStrategy on Multiple GPUs

I am training a model using TensorFlow 2.18.0 with the tf.distribute.MirroredStrategy across two GPUs. The training works fine on a single GPU, but when I try to run it on two GPUs, it ends with a ...
TGD's user avatar
  • 56
0 votes
1 answer
94 views

I want to use the distributed package in PyTorch for point-to-point communication between two ranks. but run error

def runTpoly(rank, size, pp, cs, pkArithmetics_evals, pkSelectors_evals, domain): init_process(rank, size) group2 = torch.distributed.new_group([1,2]) if rank == 0: device ...
wynne yin's user avatar
0 votes
0 answers
74 views

Vertex AI Reduction Server returning 500 Internal Error

I am looking to finetune a pre-trained deberta model on Vertex AI with pytorch. I'm attempting to run a distributed job, making use of the Vertex AI reduction server. I'm following this notebook: ...
purpleFudge's user avatar

15 30 50 per page
1
2 3 4 5
191