Questions tagged [apache-spark]
Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.
34 questions
-2
votes
1
answer
92
views
Code that loops through a df and joins two seperate dataframe and sink into delta lake, how do I make it run faster?
I have two dataframes: Budget and Forecast. For those dataframes, I'm trying to create snapshot record by joining with temp table snapshot_to_collect for loop. I'm ...
1
vote
1
answer
168
views
PYSPARK: Find the penultimate (2nd largest value) row-wise
I am working out some Pyspark exercises and trying to write efficient and best practice code to be ready to work in a production environment.
Could use some feedback w.r.t.:
Is the code structured ...
1
vote
1
answer
91
views
PySpark: Create a column containing the minimum divided by maximum of each row
I am working out some PySpark exercises and trying to write efficient and best practice code to be ready to work in a production environment.
I could use some feedback with respect to:
Is the code ...
1
vote
1
answer
214
views
Find the youngest athlete to win a Gold medal
I have a dataframe loaded from CSV file which contains data from Olympics games. Goal is find out the athletes with minimum age who won Gold medal. I have managed to come up with following code. Is ...
0
votes
1
answer
143
views
Graph coloring problem with Spark (JAVA)?
I am trying to create the algorithm described in the image from this link
Simply put any two adjacent nodes must not share tuples and if they do they are colored with the same color. If there's a ...
3
votes
2
answers
963
views
PySpark SCD Type 1
I am using PySpark in Azure DataBricks to try to create a SCD Type 1.
I would like to know if this is an efficient way of doing this?
Here is my SQL table:
...
1
vote
1
answer
176
views
Group events close in time into sessions and assign unique session IDs
The following is a trimmed-down example of my actual code, but it suffices to show the algorithmic problem I'm trying to solve.
Given is a DataFrame with events, each with a user ID and a timestamp.
<...
4
votes
1
answer
218
views
Rewriting scala code in object-oriented style style to reduce repetitive use of similar functions
I need help in rewriting my code to be less repetitive. I am used to coding procedural and not object-oriented. My scala program is for Databricks.
how would you combine cmd 3 and 5 together? Does ...
2
votes
0
answers
1k
views
Spark Scala: SQL rlike vs Custom UDF
I've a scenario where 10K+ regular expressions are stored in a table along with various other columns and this needs to be joined against an incoming dataset. Initially I was using "spark sql rlike" ...
2
votes
2
answers
823
views
Scala app to transpose columns into rows
This is the first application or really any Scala I have every written. So far it functions as I would hope it would. I just found this community and would love some peer review of possible ...
2
votes
1
answer
71
views
Filtering and creating new columns by condensing the lists for each item information
I am trying to improve my programming skills at work (analyst) and one of the engineering projects I worked is around ETL. Essentially, we roll up all individuals account information to a single row ...
2
votes
0
answers
281
views
Quantiles calculation in Pyspark MLib
I am trying to find out quantiles for each column on the table for various firms using spark 1.6
I have around 5000 entries in firm_list and 300 entries in attr_lst...
3
votes
0
answers
56
views
spark takes long time for checking an array of items present in another array
I am new to spark. I have two dataframes df1 and df2. df1 has three rows. df2 has more than few million rows. I want to check whether all items in df2 are in transaction of df1, if so sum up the costs....
4
votes
3
answers
1k
views
Binary check code in pyspark
I am trying to find out if a column is binary or not. If a column is only having 1 or 0 then I am flagging it as binary, else ...
1
vote
1
answer
176
views
Managing PySpark DataFrames
I was successfully able to write a small script using PySpark to retrieve and organize data from a large .xml file. Being new to using PySpark, I am wondering if there is any better way to write the ...