2,859 questions
0
votes
1
answer
28
views
Upsert! Operation Throws "A table can't contain duplicate column names" Error
I have a base table A and a result table B in DolphinDB. Table B was initially empty and is used to store calculated results based on table A. When trying to insert the calculated results into table B,...
0
votes
0
answers
98
views
vLLM + Ray multi-node tensor-parallel deployment completely blocked by pending placement groups and raylet heartbeat failures
Environment:
Ray version: 2.x
vLLM version: 0.9.2
Python version: 3.9
OS / Container base: Linux (CentOS-based UBI8 in Kubernetes)
Cloud / Infrastructure: AWS based Kubernetes cluster (pods scheduled ...
3
votes
1
answer
129
views
In Apache Ignite the Replication mode and Partition mode does not work all together
I’m working with Apache Ignite 2.17.0. I load database tables into Ignite caches and run SQL queries using the SQLFieldsQuery API.
Recently, I modified the cache configuration for some tables to use ...
1
vote
0
answers
32
views
Spark DSv2 Options vs Properties
I'm playing around with making a DSv2 data source, and I'm a bit confused about what the differences between the "options" and "properties" args passed to some of the TableProvider ...
0
votes
0
answers
60
views
Get two different nodes to access and distribute the same SQL table in Apache spark?
I have the following code to test. I created a table on worker 1. Then I tried to read the table on worker 2 and it got TABLE_OR_VIEW_NOT_FOUND. Worker 2 is in the some computer as Master.
I ran the ...
3
votes
2
answers
200
views
How Ray async actors handle calls to sync methods
I'm working with Ray async actors and I want to understand exactly what happens—at a deep technical level—when a synchronous method is called on such an actor.
I know that calling a synchronous method ...
0
votes
0
answers
46
views
How to best partition my data with a 32 core EMR instance and make sure I max out the parallelize feature?
I’m optimizing a PySpark pipeline that processes records with a heavily skewed categorical column (category). The data has:
A few high-frequency categories (e.g., 90% of records fall into 2-3 ...
0
votes
0
answers
132
views
How to set up MS-MPI multi-machine communication between two Windows 11 systems?
I'm trying to set up a multi-machine communication environment using MS-MPI on two Windows 11 laptops, but I'm encountering some issues. Here are the details of my setup:
Environment Details:
...
1
vote
1
answer
96
views
Distributed REST API Calls using SPARK with maintaining consistency
I have a Spark DataFrame created from a Delta table, with one column of type STRUCT(JSON). For each row in this DataFrame, I need to make a REST API call using the JSON payload in the column. ...
0
votes
0
answers
18
views
MLP Speed-Up in PySpark fluctuates with more cores – possible cache memory issue?
enter image description here
I have conducted experiments running the MLP (Multi-Layer Perceptron) algorithm on a PC cluster with Apache Spark, with configurations ranging from small data to large ...
0
votes
0
answers
316
views
PyTorch DDP Multi-Node Training: ncclInternalError: Internal check failed. Bootstrap : no socket interface found
I am trying to run a multi-node training job using PyTorch's DistributedDataParallel (DDP) following this guide. However, when I launch the job with torchrun, I encounter the following NCCL error on ...
0
votes
1
answer
773
views
Clearing Cached Data on Databricks Cluster
The problem I am facing is that my "used" memory is only around 16GB, however the cached memory takes up so much space, that I am forced to use a compute with higher memory (64GB).
So I ...
1
vote
0
answers
89
views
Segmentation Fault During Validation with MirroredStrategy on Multiple GPUs
I am training a model using TensorFlow 2.18.0 with the tf.distribute.MirroredStrategy across two GPUs. The training works fine on a single GPU, but when I try to run it on two GPUs, it ends with a ...
0
votes
1
answer
94
views
I want to use the distributed package in PyTorch for point-to-point communication between two ranks. but run error
def runTpoly(rank, size, pp, cs, pkArithmetics_evals,
pkSelectors_evals, domain):
init_process(rank, size)
group2 = torch.distributed.new_group([1,2])
if rank == 0:
device ...
0
votes
0
answers
74
views
Vertex AI Reduction Server returning 500 Internal Error
I am looking to finetune a pre-trained deberta model on Vertex AI with pytorch. I'm attempting to run a distributed job, making use of the Vertex AI reduction server.
I'm following this notebook: ...