Newest 'apache-spark' Questions

-2 votes

1 answer

92 views

Code that loops through a df and joins two seperate dataframe and sink into delta lake, how do I make it run faster?

I have two dataframes: Budget and Forecast. For those dataframes, I'm trying to create snapshot record by joining with temp table snapshot_to_collect for loop. I'm ...

user282159

1

asked Apr 2, 2024 at 16:00

1 vote

1 answer

168 views

PYSPARK: Find the penultimate (2nd largest value) row-wise

I am working out some Pyspark exercises and trying to write efficient and best practice code to be ready to work in a production environment. Could use some feedback w.r.t.: Is the code structured ...

Jongert

29

asked Feb 22, 2024 at 15:22

1 vote

1 answer

91 views

PySpark: Create a column containing the minimum divided by maximum of each row

I am working out some PySpark exercises and trying to write efficient and best practice code to be ready to work in a production environment. I could use some feedback with respect to: Is the code ...

Jongert

29

asked Feb 22, 2024 at 14:41

1 vote

1 answer

214 views

Find the youngest athlete to win a Gold medal

I have a dataframe loaded from CSV file which contains data from Olympics games. Goal is find out the athletes with minimum age who won Gold medal. I have managed to come up with following code. Is ...

Bagira

191

asked Aug 11, 2023 at 13:00

0 votes

1 answer

143 views

Graph coloring problem with Spark (JAVA)?

I am trying to create the algorithm described in the image from this link Simply put any two adjacent nodes must not share tuples and if they do they are colored with the same color. If there's a ...

John Campbell

11

asked Oct 5, 2021 at 16:08

3 votes

2 answers

963 views

PySpark SCD Type 1

I am using PySpark in Azure DataBricks to try to create a SCD Type 1. I would like to know if this is an efficient way of doing this? Here is my SQL table: ...

AGB

31

asked Apr 27, 2021 at 15:35

1 vote

1 answer

176 views

Group events close in time into sessions and assign unique session IDs

The following is a trimmed-down example of my actual code, but it suffices to show the algorithmic problem I'm trying to solve. Given is a DataFrame with events, each with a user ID and a timestamp. <...

Tobias Hermann

596

asked Dec 9, 2020 at 7:48

4 votes

1 answer

218 views

Rewriting scala code in object-oriented style style to reduce repetitive use of similar functions

I need help in rewriting my code to be less repetitive. I am used to coding procedural and not object-oriented. My scala program is for Databricks. how would you combine cmd 3 and 5 together? Does ...

Dung Tran

143

asked Aug 12, 2020 at 14:17

2 votes

0 answers

1k views

Spark Scala: SQL rlike vs Custom UDF

I've a scenario where 10K+ regular expressions are stored in a table along with various other columns and this needs to be joined against an incoming dataset. Initially I was using "spark sql rlike" ...

Wiki_91

43

asked May 26, 2020 at 0:12

2 votes

2 answers

823 views

Scala app to transpose columns into rows

This is the first application or really any Scala I have every written. So far it functions as I would hope it would. I just found this community and would love some peer review of possible ...

GenericDisplayName

121

asked Sep 5, 2019 at 15:19

2 votes

1 answer

71 views

Filtering and creating new columns by condensing the lists for each item information

I am trying to improve my programming skills at work (analyst) and one of the engineering projects I worked is around ETL. Essentially, we roll up all individuals account information to a single row ...

Rob

121

asked Jun 26, 2019 at 15:49

2 votes

0 answers

281 views

Quantiles calculation in Pyspark MLib

I am trying to find out quantiles for each column on the table for various firms using spark 1.6 I have around 5000 entries in firm_list and 300 entries in attr_lst...

Vishwanath560

21

asked Apr 25, 2019 at 5:24

3 votes

0 answers

56 views

spark takes long time for checking an array of items present in another array

I am new to spark. I have two dataframes df1 and df2. df1 has three rows. df2 has more than few million rows. I want to check whether all items in df2 are in transaction of df1, if so sum up the costs....

priya

131

asked Apr 9, 2019 at 9:04

4 votes

3 answers

1k views

Binary check code in pyspark

I am trying to find out if a column is binary or not. If a column is only having 1 or 0 then I am flagging it as binary, else ...

Shankar Panda

143

asked Mar 26, 2019 at 9:53

1 vote

1 answer

176 views

Managing PySpark DataFrames

I was successfully able to write a small script using PySpark to retrieve and organize data from a large .xml file. Being new to using PySpark, I am wondering if there is any better way to write the ...

Wilson

111

asked Oct 19, 2018 at 14:43

Stack Exchange Network

Questions tagged [apache-spark]

Code that loops through a df and joins two seperate dataframe and sink into delta lake, how do I make it run faster?

PYSPARK: Find the penultimate (2nd largest value) row-wise

PySpark: Create a column containing the minimum divided by maximum of each row

Find the youngest athlete to win a Gold medal

Graph coloring problem with Spark (JAVA)?

PySpark SCD Type 1

Group events close in time into sessions and assign unique session IDs

Rewriting scala code in object-oriented style style to reduce repetitive use of similar functions

Spark Scala: SQL rlike vs Custom UDF

Scala app to transpose columns into rows

Filtering and creating new columns by condensing the lists for each item information

Quantiles calculation in Pyspark MLib

spark takes long time for checking an array of items present in another array

Binary check code in pyspark

Managing PySpark DataFrames

Hot Network Questions