PySpark filter RDD using spark native functions

Question

Using pySpark, I need to filter an RDD that is a list of strings:

In [74]: sc.textFile("*.txt").collect()
Out[74]:
['laber\tblubber',
 'foo\tbar',
 'dummy\tdumdum',
 'col1\tcol2\tcol3\tcol4\tcol5',
 ' 1\t2\t3\t4\t5',
 ' 11\t22\t33\t44\t44',
 ' 9\t8\t7\t6\t5',
 'laber\tblubber',
 'foo\tbar',
 'dummy\tdumdum',
 'col1\tcol2\tcol3\tcol4\tcol5',
 ' 99\t2\t3\t4\t5',
 ' 99\t22\t33\t44\t44',
 ' 99\t8\t7\t6\t5']

I would like to filter out any line that does not start with a space. This, I know, I can be achieved with:

sc.textFile("*.txt").filter(lambda x: x[0] == " ")

However I would like maximum performance and In my understanding, using python lamdas adds overhead and cannot be optimized by the query planner very well.

How can I use spark native functions on a RDD?

I am expecting something like this:

sc.textFile("*.txt").filter("substr(_, 0, 1) == ' '")

Native Spark SQL functions work on dataframes. Can you use dataframes instead of RDDs? — mck
– mck, Commented May 21, 2021 at 8:37
I could. How is the performance impact of round-tripping rdd -> df -> rdd (I need an rdd to feed back to the CSV parser, which I am not keen on implementing myself with regex and such, for reasons of performance and edge cases) — jonas
– jonas, Commented May 21, 2021 at 8:40
not sure if it gives a better performance, but you can try spark.createDataFrame(sc.textFile("*.txt").map(Row)).filter("substring(_1, 1, 1) != ' '").rdd.map(lambda x: x[0]) — mck
– mck, Commented May 21, 2021 at 8:50
can you post the original input file, by looking at ur output, I believe we should be able to read this file in a dataframe as well using "spark.read.csv" with "\t" delimeter, once we have in dataframe we should be able to filter out using native spark. — Hussain Bohra
– Hussain Bohra, Commented May 22, 2021 at 16:05

botchniaque · Accepted Answer · 2021-05-22 15:54:08Z

1

You can use spark SQL functions like:

df = spark.sql("""
SELECT line FROM text.`./`
WHERE line NOT LIKE ' %'
""")

I never loaded text file like this (mostly parquet, JSON or CSV) but I believe it should also work. Have a look at this spark SQL docs entry.

answered May 22, 2021 at 15:54

botchniaque

5,1844 gold badges40 silver badges74 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Ged Over a year ago

The question is about rdds

botchniaque Over a year ago

Yes, but also about native function optimized for performance. I believe SQL functions are the way to go.

botchniaque Over a year ago

Obviously the best would be to see a benchmark of different methods

Ged Over a year ago

Rdds are old hat

Collectives™ on Stack Overflow

PySpark filter RDD using spark native functions

1 Answer 1

4 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Related