PySpark won't convert timestamp

Question

I have a very simple CSV, call it test.csv

name,timestamp,action
A,2012-10-12 00:30:00.0000000,1
B,2012-10-12 01:00:00.0000000,2 
C,2012-10-12 01:30:00.0000000,2 
D,2012-10-12 02:00:00.0000000,3 
E,2012-10-12 02:30:00.0000000,1

I'm trying to read it using pyspark and add a new column indicating the month.

First I read in the data, and everything looks ok.

df = spark.read.csv('test.csv', inferSchema=True, header=True)
df.printSchema()
df.show()

Output:

root
 |-- name: string (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- action: double (nullable = true)

+----+-------------------+------+
|name|          timestamp|action|
+----+-------------------+------+
|   A|2012-10-12 00:30:00|   1.0|
|   B|2012-10-12 01:00:00|   2.0|
|   C|2012-10-12 01:30:00|   2.0|
|   D|2012-10-12 02:00:00|   3.0|
|   E|2012-10-12 02:30:00|   1.0|
+----+-------------------+------+

But when I try to add my column, the formatting option doesn't seem to do anything.

df.withColumn('month', to_date(col('timestamp'), format='MMM')).show()

Output:

+----+-------------------+------+----------+
|name|          timestamp|action|     month|
+----+-------------------+------+----------+
|   A|2012-10-12 00:30:00|   1.0|2012-10-12|
|   B|2012-10-12 01:00:00|   2.0|2012-10-12|
|   C|2012-10-12 01:30:00|   2.0|2012-10-12|
|   D|2012-10-12 02:00:00|   3.0|2012-10-12|
|   E|2012-10-12 02:30:00|   1.0|2012-10-12|
+----+-------------------+------+----------+

What's going on here?

Yes. According to the documentation on the Oracle page, MMM should accomplish that, but no format i've tried as any effect. docs.oracle.com/javase/tutorial/i18n/format/… — Mark Dunne
– Mark Dunne, Commented Dec 9, 2017 at 15:59
there is a inbuilt function called month spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/… — Anahcolus
– Anahcolus, Commented Dec 9, 2017 at 16:10
@RameshMaharjan That's very useful, I didn't know functions like that existed! However, you'll appreciate that this was a simplified example, and I would still like to get custom formatting working, or understand why it doesn't work. — Mark Dunne
– Mark Dunne, Commented Dec 9, 2017 at 16:16
what you are doing is column based transformations and there is to_date function as well in the above link which doesn't take format parameter. thus its not working for you . I guess what you are looking for is udf functions. — Anahcolus
– Anahcolus, Commented Dec 9, 2017 at 16:19

Alper t. Turker · Accepted Answer · 2017-12-09 16:43:35Z

to_date with format is used for parse string type columns. What you need is date_format

from pyspark.sql.functions import date_format

df.withColumn('month', date_format(col('timestamp'), format='MMM')).show()

# +----+-------------------+------+-----+
# |name|          timestamp|action|month|
# +----+-------------------+------+-----+
# |   A|2012-10-12 00:30:00|   1.0|  Oct|
# |   B|2012-10-12 01:00:00|   2.0|  Oct|
# |   C|2012-10-12 01:30:00|   2.0|  Oct|
# |   D|2012-10-12 02:00:00|   3.0|  Oct|
# |   E|2012-10-12 02:30:00|   1.0|  Oct|
# +----+-------------------+------+-----+

Collectives™ on Stack Overflow

PySpark won't convert timestamp

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related