0

In PySpark, I am trying to load a dataframe from a string variable.

My variable is a multi line text..

string_data = """
 Name|age|city
 david|23|London
 krish|24|Bali
 john|56|Goa
"""

I wanted to load this data into a dataframe in PySpark. Thought of using datasets but they are not available in PySpark.

Using Pandas, I used to write like this:

string2 = StringIO(string_data)

df = pd.read_csv(string2,sep='|')

1 Answer 1

1

You can split the string by newline characters, parallelize that into an RDD, and feed that into spark.read.csv.

df = spark.read.csv(sc.parallelize(string_data.split('\n')), sep='|', header=True)

df.show() 
+-----+---+------+
| Name|age|  city|
+-----+---+------+
|david| 23|London|
|krish| 24|  Bali|
| john| 56|   Goa|
+-----+---+------+
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.