I am trying to read a pipe delimited text file in pyspark dataframe into separate columns but I am unable to do so by specifying the format as 'text'. It works fine when I give the format as csv.
This code is what I think is correct as it is a text file but all columns are coming into a single column.
df = spark.read.format('text').options(header=True).options(sep='|').load("path\\test.txt")
df.show()
+--------------------+
| value|
+--------------------+
|Name|Color|Size|O...|
|Rabbit|Brown|7|Wa...|
| Horse|Green|28|Dock|
| Pig|Orange|17|Port|
|Cow|Blue|23|Wareh...|
| Bird|Yellow|2|Dock|
| Dog|Brown|10|Port|
|Carrot Man|Orange...|
+--------------------+
This piece of code is working correctly by splitting the data into separate columns but I have to give the format as csv even though the file is actually .txt.
df = spark.read.format('csv').options(header=True).options(sep='|').load("path\\test.txt")
df.show()
+----------+------+----+---------+
| Name| Color|Size| Origin|
+----------+------+----+---------+
| Rabbit| Brown| 7|Warehouse|
| Horse| Green| 28| Dock|
| Pig|Orange| 17| Port|
| Cow| Blue| 23|Warehouse|
| Bird|Yellow| 2| Dock|
| Dog| Brown| 10| Port|
|Carrot Man|Orange| 22|Warehouse|
+----------+------+----+---------+
sepis not a valid option fortext. spark.apache.org/docs/latest/sql-data-sources-text.html