Using pySpark, I need to filter an RDD that is a list of strings:
In [74]: sc.textFile("*.txt").collect()
Out[74]:
['laber\tblubber',
 'foo\tbar',
 'dummy\tdumdum',
 'col1\tcol2\tcol3\tcol4\tcol5',
 ' 1\t2\t3\t4\t5',
 ' 11\t22\t33\t44\t44',
 ' 9\t8\t7\t6\t5',
 'laber\tblubber',
 'foo\tbar',
 'dummy\tdumdum',
 'col1\tcol2\tcol3\tcol4\tcol5',
 ' 99\t2\t3\t4\t5',
 ' 99\t22\t33\t44\t44',
 ' 99\t8\t7\t6\t5']
I would like to filter out any line that does not start with a space. This, I know, I can be achieved with:
sc.textFile("*.txt").filter(lambda x: x[0] == " ")
However I would like maximum performance and In my understanding, using python lamdas adds overhead and cannot be optimized by the query planner very well.
How can I use spark native functions on a RDD?
I am expecting something like this:
sc.textFile("*.txt").filter("substr(_, 0, 1) == ' '")
    
spark.createDataFrame(sc.textFile("*.txt").map(Row)).filter("substring(_1, 1, 1) != ' '").rdd.map(lambda x: x[0])