1

I am working with a dataset with the following Timestamp format: yyyy-MM-dd HH:mm:ss

When I output the data to csv the format changes to something like this: 2019-04-29T00:15:00.000Z

Is there any way to get it to the original format like: 2019-04-29 00:15:00

Do I need to convert that column to string and then push it to csv?

I am saying my file to csv like so:

df.coalesce(1).write.format("com.databricks.spark.csv"
                                       ).mode('overwrite'
                                             ).option("header", "true"
                                               ).save("date_fix.csv")

2 Answers 2

3

Alternative

spark >=2.0.0

set option("timestampFormat", "yyyy-MM-dd HH:mm:ss") for format("csv")

df.coalesce(1).write.format("csv"
                            ).mode('overwrite'
                            ).option("header", "true"
                            ).option("timestampFormat", "yyyy-MM-dd HH:mm:ss"
                            ).save("date_fix.csv")

As per documentation-

timestampFormat (default yyyy-MM-dd'T'HH:mm:ss.SSSXXX): sets the string that indicates a timestamp format. Custom date formats follow the formats at java.text.SimpleDateFormat. This applies to timestamp type.

spark < 2.0.0

set option("dateFormat", "yyyy-MM-dd HH:mm:ss") for format("csv")

df.coalesce(1).write.format("com.databricks.spark.csv"
                            ).mode('overwrite'
                            ).option("header", "true"
                            ).option("dateFormat", "yyyy-MM-dd HH:mm:ss"
                            ).save("date_fix.csv")

As per documentation-

dateFormat: specifies a string that indicates the date format to use when reading dates or timestamps. Custom date formats follow the formats at java.text.SimpleDateFormat. This applies to both DateType and TimestampType. By default, it is null which means trying to parse times and date by java.sql.Timestamp.valueOf() and java.sql.Date.valueOf()

ref - readme

Sign up to request clarification or add additional context in comments.

1 Comment

I didn't know this, this is awesome! Every day is a school day. Have an upvote! :)
1

Yes, that's correct. The easiest way to achieve this is using pyspark.sql.functions.date_format such as:

import pyspark.sql.functions as f

df.withColumn(
  "date_column_formatted",  
  f.date_format(f.col("timestamp"), "yyyy-MM-dd HH:mm:ss")
)

More info about it here https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.date_format. Hope this helps!

3 Comments

So my column is already correctly formatted in Timestamp in pyspark. My issue is that when I land my file to csv, it takes tacks on the T and .000Z. That's what I need to get rid of when saving my file to csv.
I understand - the problem is going to remain however, because your column is likely TimestampType, which will always get converted to the ISO 8601 format. Maybe try just casting, rather than converting using date_format?
Ok makes sense, so cast my timestamp column to string and that should work?

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.