Pandas multiindex from series of dataframes

Question

I have a series of dataframes with identical structure that represent results of a simulation for each hour of the year. Each simulation contains results for a series of coordinates (x,y).

Each dataframe is imported from a csv file that has time information only in the file name. Example:

results_YYMMDDHH.csv

contains data such

   x   y         a         b
 0.0 0.0  0.318705 -0.871259
 0.1 0.0 -0.937012  0.704270
 0.1 0.1 -0.032225 -1.939544
 0.0 0.1 -1.874781 -0.033073

I would like to create a single MultiIndexed Dataframe (level 0 is time and level 1 is (x,y)) that would allow me to perform various operations like averages, sums, max, etc. between these dataframes using the resampling or groupby methods. For each time step

The resulting dataframe should look something like this

                       x   y         a         b
2010-01-01 10:00     0.0 0.0  0.318705 -0.871259
                     0.1 0.0 -0.934512  0.745270
                     0.1 0.1 -0.0334525 -1.963544
                     0.0 0.1 -1.835781 -0.067573

2010-01-01 11:00     0.0 0.0  0.318705 -0.871259
                     0.1 0.0 -0.923012  0.745670
                     0.1 0.1 -0.035225 -1.963544
                     0.0 0.1 -1.835781 -0.067573
.................
.................
2010-12-01 10:00     0.0 0.0  0.318705 -0.871259
                     0.1 0.0 -0.923012  0.723270
                     0.1 0.1 -0.034225 -1.963234
                     0.0 0.1 -1.835781 -0.067233

You can imagine this for each hour of the year. I would like now to be able to calculate for example the average for the whole year or the average for June. Also any other function like the number of hours above a certain threshold or between a min and a max value. Please bear in mind that the result should be in any of these operations a DataFrame. For example the monthly averages should look like

              x   y     a     b
2010-01     0.0 0.0  0.45 -0.13
2010-02     0.1 0.0  0.55 -0.87
2010-03     0.1 0.1  0.24 -0.83
2010-04     0.0 0.1  0.11 -0.87

How do I build this MultiIndexed dataframe? I picture this like a timeseries of dataframes.

why do you think this needs to be multiindexed? it seems like all you need is a single index that is the timestamp of the file. your desired output, fwiw, is singly-indexed. — acushner
– acushner, Commented Feb 26, 2015 at 19:16
You have a point. Apologies, I have been looking at these data for so long that I kinda think other people know them! I have added further clarifications. — Rojj
– Rojj, Commented Feb 26, 2015 at 19:40

Brian from QuantRocket · Accepted Answer · 2015-02-24 19:39:56Z

I would make a Panel then convert it into a multiindexed DataFrame using to_frame():

In [29]: df1 = pd.DataFrame(dict(a=[0.318705,-0.937012,-0.032225,-1.874781], b=[-0.871259,0.704270,-1.939544,-0.033073]))

In [30]: df2 = pd.DataFrame(dict(a=[0.318705,-0.937012,-0.032225,-1.874781], b=[-0.871259,0.704270,-1.939544,-0.033073]))

In [31]: df1
Out[31]:
          a         b
0  0.318705 -0.871259
1 -0.937012  0.704270
2 -0.032225 -1.939544
3 -1.874781 -0.033073

In [32]: data = {datetime.datetime(2010,6,21,10,0,0): df1, datetime.datetime(2010,6,22,10,0,0): df2}

In [33]: p = pd.Panel(data)

In [34]: p.to_frame()
Out[34]:
             2010-06-21 10:00:00  2010-06-22 10:00:00
major minor
0     a                 0.318705             0.318705
      b                -0.871259            -0.871259
1     a                -0.937012            -0.937012
      b                 0.704270             0.704270
2     a                -0.032225            -0.032225
      b                -1.939544            -1.939544
3     a                -1.874781            -1.874781
      b                -0.033073            -0.033073

Depending on how you want to look at your data, you can use swapaxes to rearrange it:

In [35]: p.swapaxes("major", "items").to_frame()
Out[35]:
                                  0         1         2         3
major               minor
2010-06-21 10:00:00 a      0.318705 -0.937012 -0.032225 -1.874781
                    b     -0.871259  0.704270 -1.939544 -0.033073
2010-06-22 10:00:00 a      0.318705 -0.937012 -0.032225 -1.874781
                    b     -0.871259  0.704270 -1.939544 -0.033073

Thanks Brian, but it does not really look like my sample, does it?
It's not clear whether your example is the input you have or the output you want to get. It's helpful to give both and specify which is which.

Brian from QuantRocket · Accepted Answer · 2015-02-27 02:00:27Z

Here is a different answer from my earlier one, in light of the more fully explained question. Iterate through the files and read them into pandas, parse the date and add it to the dataframe, then use set_index to create your multiindex. Once you've got all your dataframes, use pd.concat to combine them:

dataframes = []
for filename in filenames:
    df = pd.read_csv(filename)
    df["datetime"] = datetime.datetime.strptime(filename[8:18], "%Y%m%d%H")
    dataframes.append(df.set_index(["datetime","x", "y"]))

combined_df = pd.concat(dataframes)

Collectives™ on Stack Overflow

Pandas multiindex from series of dataframes

2 Answers 2

2 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Related