pandas creating new table from two tables

Question

I have to join two tables and create a table with dates, but my code is way to long and I believe that I done it the super long way.Apparently the soulution to this only had 22 lines. Is there another way and more shorter way to approach this problem. Here is the question

HERE IS MY CODE, and again I believe it is to long and I think there is a shorter way to do this.

import numpy as np
import pandas as pd
import datetime

#YOUR CODE GOES HERE#

def get_month(i):
    """this function returns the number of the month based on stringinput"""
    if i == "January":
        return 1
    elif i == "February":
        return 2
    elif i == "March":
        return 3
    elif i == "April":
        return 4
    elif i == "May":
        return 5
    elif i == "June":
        return 6
    elif i == "July":
        return 7
    elif i == "August":
        return 8
    elif i == "September":
        return 9
    elif i == "October":
        return 10
    elif i == "November":
        return 11
    elif i == "December":
        return 12

def get_reformatted_date(s):
    """this function reformats a datetime object to the output we're looking for"""
    return s.strftime("%d-%b-%y")


month_names = []
tab1 = pd.read_csv("data1.csv")
tab2 = pd.read_csv("data2.csv")
tab1_tweets = tab1['Tweet'].tolist()[::-1]
tab2_tweets = tab2['Tweet'].tolist()[::-1]
tab1_months = tab1['Month'].tolist()[::-1]
tab2_months = tab2['Month'].tolist()[::-1]
tab1_days = tab1['Day'].tolist()[::-1]
tab2_days = tab2['Day'].tolist()[::-1]
tab1_years = tab1['Year'].tolist()[::-1]
tab2_years = tab2['Year'].tolist()[::-1]
all_dates = []
all_tweets = []
tab1_count = 0
tab2_count = 0
for i in range(len(tab1_tweets) + len(tab2_tweets)):
    if(tab1_count < len(tab1_years) and tab2_count < len(tab2_years)):
        t1_date = datetime.date(tab1_years[tab1_count], tab1_months[tab1_count], tab1_days[tab1_count])
        t2_date = datetime.date(tab2_years[tab2_count], get_month(tab2_months[tab2_count]), tab2_days[tab2_count])
        if t1_date > t2_date:
            all_dates.append(t1_date)
            all_tweets.append(tab1_tweets[tab1_count])
            tab1_count += 1
        else:
            all_dates.append(t2_date)
            all_tweets.append(tab2_tweets[tab2_count])
            tab2_count += 1
    elif(tab2_count < len(tab2_years)):
        t2_date = datetime.date(tab2_years[tab2_count], get_month(tab2_months[tab2_count]), tab2_days[tab2_count])
        all_dates.append(t2_date)
        all_tweets.append(tab2_tweets[tab2_count])
        tab2_count += 1
    else:
        t1_date = datetime.date(tab1_years[tab1_count], tab1_months[tab1_count], tab1_days[tab1_count])
        all_dates.append(t1_date)
        all_tweets.append(tab1_tweets[tab1_count])
        tab1_count += 1

table_data = {'Date': all_dates, 'Tweet': all_tweets}
df = pd.DataFrame(table_data)
df['Date'] = df['Date'].apply(get_reformatted_date)
print(df)

data1.csv is

Tweet                 Month Day  Year
Hello World             6    2    2013
I want ice-cream!       7    23   2013
Friends will be friends 9    30   2017
Done with school        12   12   2017

the data2.csv is

Month   Day Year    Hour    Tweet
January 2   2015    12  Happy New Year
March   21  2016    7   Today is my final
May     30  2017    23  Summer is about to begin
July    15  2018    11  Ocean is still cold

we had a datetime.date() object that takes in dates, but i didnt know that they only store the dates but rather than print it. So I used s.strftime('%d-%b-%y') to print it out. I think there is a shorter way. Since I took the table and turned every column into a list — clumbzy1
– clumbzy1, Commented May 21, 2018 at 0:13
It's tough to confirm my answer for this is good since you don't share the csv that must be processed. Can you share the csv or at least the header visible in the question? — Psyche
– Psyche, Commented May 21, 2018 at 0:14
please post data1.csv and data2.csv as text. Then people can cut and paste it to try to help you — sacuL
– sacuL, Commented May 21, 2018 at 0:14

sacuL · Accepted Answer · 2018-05-21 01:01:43Z

I think that you can theoretically do this whole thing in one line:

finaldf = (pd.concat([pd.read_csv('data1.csv',
                            parse_dates={'Date':['Year', 'Month', 'Day']}),
                      pd.read_csv('data2.csv',
                            parse_dates={'Date':['Year', 'Month', 'Day']})
                      [['Date', 'Tweet']]])
            .sort_values('Date', ascending=False))

But for the sake of readability, its better to split it into a few lines:

df1 = pd.read_csv('data1.csv', parse_dates={'Date':['Year', 'Month','Day']})
df2 = pd.read_csv('data2.csv', parse_dates={'Date':['Year', 'Month','Day']})

finaldf = (pd.concat([df1, df2[['Date', 'Tweet']]])
          .sort_values('Date', ascending=False))

I think that for what you're trying to do, the main things to read up about are the parse_dates argument of pandas read_csv, and pd.concat to concatenate dataframes

Edit: in order to get the dates in the correct format as you have in your example output, you can call this after the code above, using Series.dt.strftime():

finaldf['Date'] = finaldf['Date'].dt.strftime('%d-%b-%y')

yes, it works but I think the date formatting is wrong because we had to use pandas.datetime() and produce the date shown in the chart above. I thought the using datetime() printed but i did not know it stored the value. So I think we had to make a new list so it is able to store it?
See my edit, you can just add that last line. This isn't the only way to do things, and you could probably come up with a way using pd.to_datetime(), but I believe this is an efficient way to get what you are looking for!
I see!! . Ya, My way took forever b/c i had to turn everything to string.. Thank you

Collectives™ on Stack Overflow

pandas creating new table from two tables

1 Answer 1

3 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Related