Load a variable into a dataframe

Question

In PySpark, I am trying to load a dataframe from a string variable.

My variable is a multi line text..

string_data = """
 Name|age|city
 david|23|London
 krish|24|Bali
 john|56|Goa
"""

I wanted to load this data into a dataframe in PySpark. Thought of using datasets but they are not available in PySpark.

Using Pandas, I used to write like this:

string2 = StringIO(string_data)

df = pd.read_csv(string2,sep='|')

mck · Accepted Answer · 2021-02-01 16:16:02Z

1

You can split the string by newline characters, parallelize that into an RDD, and feed that into spark.read.csv.

df = spark.read.csv(sc.parallelize(string_data.split('\n')), sep='|', header=True)

df.show() 
+-----+---+------+
| Name|age|  city|
+-----+---+------+
|david| 23|London|
|krish| 24|  Bali|
| john| 56|   Goa|
+-----+---+------+

answered Feb 1, 2021 at 16:16

mck

42.7k13 gold badges44 silver badges62 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Load a variable into a dataframe

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related