0

Using pySpark, I need to filter an RDD that is a list of strings:

In [74]: sc.textFile("*.txt").collect()
Out[74]:
['laber\tblubber',
 'foo\tbar',
 'dummy\tdumdum',
 'col1\tcol2\tcol3\tcol4\tcol5',
 ' 1\t2\t3\t4\t5',
 ' 11\t22\t33\t44\t44',
 ' 9\t8\t7\t6\t5',
 'laber\tblubber',
 'foo\tbar',
 'dummy\tdumdum',
 'col1\tcol2\tcol3\tcol4\tcol5',
 ' 99\t2\t3\t4\t5',
 ' 99\t22\t33\t44\t44',
 ' 99\t8\t7\t6\t5']

I would like to filter out any line that does not start with a space. This, I know, I can be achieved with:

sc.textFile("*.txt").filter(lambda x: x[0] == " ")

However I would like maximum performance and In my understanding, using python lamdas adds overhead and cannot be optimized by the query planner very well.

How can I use spark native functions on a RDD?

I am expecting something like this:

sc.textFile("*.txt").filter("substr(_, 0, 1) == ' '")
4
  • Native Spark SQL functions work on dataframes. Can you use dataframes instead of RDDs? Commented May 21, 2021 at 8:37
  • I could. How is the performance impact of round-tripping rdd -> df -> rdd (I need an rdd to feed back to the CSV parser, which I am not keen on implementing myself with regex and such, for reasons of performance and edge cases) Commented May 21, 2021 at 8:40
  • 1
    not sure if it gives a better performance, but you can try spark.createDataFrame(sc.textFile("*.txt").map(Row)).filter("substring(_1, 1, 1) != ' '").rdd.map(lambda x: x[0]) Commented May 21, 2021 at 8:50
  • can you post the original input file, by looking at ur output, I believe we should be able to read this file in a dataframe as well using "spark.read.csv" with "\t" delimeter, once we have in dataframe we should be able to filter out using native spark. Commented May 22, 2021 at 16:05

1 Answer 1

1

You can use spark SQL functions like:

df = spark.sql("""
SELECT line FROM text.`./`
WHERE line NOT LIKE ' %'
""")

I never loaded text file like this (mostly parquet, JSON or CSV) but I believe it should also work. Have a look at this spark SQL docs entry.

Sign up to request clarification or add additional context in comments.

4 Comments

The question is about rdds
Yes, but also about native function optimized for performance. I believe SQL functions are the way to go.
Obviously the best would be to see a benchmark of different methods
Rdds are old hat

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.