Skip to main content

Questions tagged [apache-spark]

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

-2 votes
1 answer
92 views

Code that loops through a df and joins two seperate dataframe and sink into delta lake, how do I make it run faster?

I have two dataframes: Budget and Forecast. For those dataframes, I'm trying to create snapshot record by joining with temp table snapshot_to_collect for loop. I'm ...
user282159's user avatar
1 vote
1 answer
168 views

PYSPARK: Find the penultimate (2nd largest value) row-wise

I am working out some Pyspark exercises and trying to write efficient and best practice code to be ready to work in a production environment. Could use some feedback w.r.t.: Is the code structured ...
Jongert's user avatar
  • 29
1 vote
1 answer
91 views

PySpark: Create a column containing the minimum divided by maximum of each row

I am working out some PySpark exercises and trying to write efficient and best practice code to be ready to work in a production environment. I could use some feedback with respect to: Is the code ...
Jongert's user avatar
  • 29
1 vote
1 answer
214 views

Find the youngest athlete to win a Gold medal

I have a dataframe loaded from CSV file which contains data from Olympics games. Goal is find out the athletes with minimum age who won Gold medal. I have managed to come up with following code. Is ...
Bagira's user avatar
  • 191
0 votes
1 answer
143 views

Graph coloring problem with Spark (JAVA)?

I am trying to create the algorithm described in the image from this link Simply put any two adjacent nodes must not share tuples and if they do they are colored with the same color. If there's a ...
John Campbell's user avatar
3 votes
2 answers
963 views

PySpark SCD Type 1

I am using PySpark in Azure DataBricks to try to create a SCD Type 1. I would like to know if this is an efficient way of doing this? Here is my SQL table: ...
AGB's user avatar
  • 31
1 vote
1 answer
176 views

Group events close in time into sessions and assign unique session IDs

The following is a trimmed-down example of my actual code, but it suffices to show the algorithmic problem I'm trying to solve. Given is a DataFrame with events, each with a user ID and a timestamp. <...
Tobias Hermann's user avatar
4 votes
1 answer
218 views

Rewriting scala code in object-oriented style style to reduce repetitive use of similar functions

I need help in rewriting my code to be less repetitive. I am used to coding procedural and not object-oriented. My scala program is for Databricks. how would you combine cmd 3 and 5 together? Does ...
Dung Tran's user avatar
  • 143
2 votes
0 answers
1k views

Spark Scala: SQL rlike vs Custom UDF

I've a scenario where 10K+ regular expressions are stored in a table along with various other columns and this needs to be joined against an incoming dataset. Initially I was using "spark sql rlike" ...
Wiki_91's user avatar
  • 43
2 votes
2 answers
823 views

Scala app to transpose columns into rows

This is the first application or really any Scala I have every written. So far it functions as I would hope it would. I just found this community and would love some peer review of possible ...
GenericDisplayName's user avatar
2 votes
1 answer
71 views

Filtering and creating new columns by condensing the lists for each item information

I am trying to improve my programming skills at work (analyst) and one of the engineering projects I worked is around ETL. Essentially, we roll up all individuals account information to a single row ...
Rob's user avatar
  • 121
2 votes
0 answers
281 views

Quantiles calculation in Pyspark MLib

I am trying to find out quantiles for each column on the table for various firms using spark 1.6 I have around 5000 entries in firm_list and 300 entries in attr_lst...
Vishwanath560's user avatar
3 votes
0 answers
56 views

spark takes long time for checking an array of items present in another array

I am new to spark. I have two dataframes df1 and df2. df1 has three rows. df2 has more than few million rows. I want to check whether all items in df2 are in transaction of df1, if so sum up the costs....
priya's user avatar
  • 131
4 votes
3 answers
1k views

Binary check code in pyspark

I am trying to find out if a column is binary or not. If a column is only having 1 or 0 then I am flagging it as binary, else ...
Shankar Panda's user avatar
1 vote
1 answer
176 views

Managing PySpark DataFrames

I was successfully able to write a small script using PySpark to retrieve and organize data from a large .xml file. Being new to using PySpark, I am wondering if there is any better way to write the ...
Wilson's user avatar
  • 111

15 30 50 per page