Pyspark: Output to csv -- Timestamp format is different

Question

I am working with a dataset with the following Timestamp format: yyyy-MM-dd HH:mm:ss

When I output the data to csv the format changes to something like this: 2019-04-29T00:15:00.000Z

Is there any way to get it to the original format like: 2019-04-29 00:15:00

Do I need to convert that column to string and then push it to csv?

I am saying my file to csv like so:

df.coalesce(1).write.format("com.databricks.spark.csv"
                                       ).mode('overwrite'
                                             ).option("header", "true"
                                               ).save("date_fix.csv")

Som · Accepted Answer · 2020-06-10 03:07:02Z

Alternative

spark >=2.0.0

set option("timestampFormat", "yyyy-MM-dd HH:mm:ss") for format("csv")

df.coalesce(1).write.format("csv"
                            ).mode('overwrite'
                            ).option("header", "true"
                            ).option("timestampFormat", "yyyy-MM-dd HH:mm:ss"
                            ).save("date_fix.csv")

As per documentation-

timestampFormat (default yyyy-MM-dd'T'HH:mm:ss.SSSXXX): sets the string that indicates a timestamp format. Custom date formats follow the formats at java.text.SimpleDateFormat. This applies to timestamp type.

spark < 2.0.0

set option("dateFormat", "yyyy-MM-dd HH:mm:ss") for format("csv")

df.coalesce(1).write.format("com.databricks.spark.csv"
                            ).mode('overwrite'
                            ).option("header", "true"
                            ).option("dateFormat", "yyyy-MM-dd HH:mm:ss"
                            ).save("date_fix.csv")

As per documentation-

dateFormat: specifies a string that indicates the date format to use when reading dates or timestamps. Custom date formats follow the formats at java.text.SimpleDateFormat. This applies to both DateType and TimestampType. By default, it is null which means trying to parse times and date by java.sql.Timestamp.valueOf() and java.sql.Date.valueOf()

ref - readme

I didn't know this, this is awesome! Every day is a school day. Have an upvote! :)

Napoleon Borntoparty · Accepted Answer · 2020-06-09 19:51:21Z

1

Yes, that's correct. The easiest way to achieve this is using pyspark.sql.functions.date_format such as:

import pyspark.sql.functions as f

df.withColumn(
  "date_column_formatted",  
  f.date_format(f.col("timestamp"), "yyyy-MM-dd HH:mm:ss")
)

More info about it here https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.date_format. Hope this helps!

answered Jun 9, 2020 at 19:51

Napoleon Borntoparty

2,0222 gold badges10 silver badges31 bronze badges

3 Comments

sanjayr Over a year ago

So my column is already correctly formatted in Timestamp in pyspark. My issue is that when I land my file to csv, it takes tacks on the T and .000Z. That's what I need to get rid of when saving my file to csv.

Napoleon Borntoparty Over a year ago

I understand - the problem is going to remain however, because your column is likely TimestampType, which will always get converted to the ISO 8601 format. Maybe try just casting, rather than converting using date_format?

sanjayr Over a year ago

Ok makes sense, so cast my timestamp column to string and that should work?

Collectives™ on Stack Overflow

Pyspark: Output to csv -- Timestamp format is different

2 Answers 2

Alternative

spark >=2.0.0

spark < 2.0.0

1 Comment

3 Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Alternative

spark >=2.0.0

spark < 2.0.0

1 Comment

3 Comments

Related