1

I have a very simple CSV, call it test.csv

name,timestamp,action
A,2012-10-12 00:30:00.0000000,1
B,2012-10-12 01:00:00.0000000,2 
C,2012-10-12 01:30:00.0000000,2 
D,2012-10-12 02:00:00.0000000,3 
E,2012-10-12 02:30:00.0000000,1

I'm trying to read it using pyspark and add a new column indicating the month.

First I read in the data, and everything looks ok.

df = spark.read.csv('test.csv', inferSchema=True, header=True)
df.printSchema()
df.show()

Output:

root
 |-- name: string (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- action: double (nullable = true)

+----+-------------------+------+
|name|          timestamp|action|
+----+-------------------+------+
|   A|2012-10-12 00:30:00|   1.0|
|   B|2012-10-12 01:00:00|   2.0|
|   C|2012-10-12 01:30:00|   2.0|
|   D|2012-10-12 02:00:00|   3.0|
|   E|2012-10-12 02:30:00|   1.0|
+----+-------------------+------+

But when I try to add my column, the formatting option doesn't seem to do anything.

df.withColumn('month', to_date(col('timestamp'), format='MMM')).show()

Output:

+----+-------------------+------+----------+
|name|          timestamp|action|     month|
+----+-------------------+------+----------+
|   A|2012-10-12 00:30:00|   1.0|2012-10-12|
|   B|2012-10-12 01:00:00|   2.0|2012-10-12|
|   C|2012-10-12 01:30:00|   2.0|2012-10-12|
|   D|2012-10-12 02:00:00|   3.0|2012-10-12|
|   E|2012-10-12 02:30:00|   1.0|2012-10-12|
+----+-------------------+------+----------+

What's going on here?

6
  • what do you want to convert it to? month? Commented Dec 9, 2017 at 15:55
  • Yes. According to the documentation on the Oracle page, MMM should accomplish that, but no format i've tried as any effect. docs.oracle.com/javase/tutorial/i18n/format/… Commented Dec 9, 2017 at 15:59
  • 1
    there is a inbuilt function called month spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/… Commented Dec 9, 2017 at 16:10
  • @RameshMaharjan That's very useful, I didn't know functions like that existed! However, you'll appreciate that this was a simplified example, and I would still like to get custom formatting working, or understand why it doesn't work. Commented Dec 9, 2017 at 16:16
  • what you are doing is column based transformations and there is to_date function as well in the above link which doesn't take format parameter. thus its not working for you . I guess what you are looking for is udf functions. Commented Dec 9, 2017 at 16:19

1 Answer 1

1

to_date with format is used for parse string type columns. What you need is date_format

from pyspark.sql.functions import date_format

df.withColumn('month', date_format(col('timestamp'), format='MMM')).show()

# +----+-------------------+------+-----+
# |name|          timestamp|action|month|
# +----+-------------------+------+-----+
# |   A|2012-10-12 00:30:00|   1.0|  Oct|
# |   B|2012-10-12 01:00:00|   2.0|  Oct|
# |   C|2012-10-12 01:30:00|   2.0|  Oct|
# |   D|2012-10-12 02:00:00|   3.0|  Oct|
# |   E|2012-10-12 02:30:00|   1.0|  Oct|
# +----+-------------------+------+-----+
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.