I am a beginner in Python and Pandas, and it has been 2 days since I opened Wes McKinney's book. So, this question might be a basic one.
I am using Anaconda distribution (Python 3.6.6) and Pandas 0.21.0. I researched the following threads (https://pandas.pydata.org/pandas-docs/stable/advanced.html, xs function at https://pandas.pydata.org/pandas-docs/stable/advanced.html#advanced-xs, Select only one index of multiindex DataFrame, Selecting rows from pandas by subset of multiindex, and https://pandas.pydata.org/pandas-docs/stable/indexing.html) before posting this. All of them explain how to subset data.frame using either hierarchical index or hierarchical column, but not both.
Here's the data.
import pandas as pd
import numpy as np
from numpy import nan as NA
#Hierarchical index for row and column
data = pd.DataFrame(np.arange(36).reshape(6,6),
index=[['a']*2+['b']*1+['c']*1+['d']*2,
[1, 2, 3, 1, 3, 1]],
columns = [['Title1']*3+['Title2']*3,
['A']*2+['B']*2+['C']*2])
data.index.names = ['key1','key2']
data.columns.names = ['state','color']
Here are my questions:
Question:1 I'd like to access key1 = a, key2 = 1, state = Title1 (column), and color = A (column).
After a few trial and errors, I found that this version works (I really don't know why this works--my hypothesis is that data.loc['a',1] gives an indexed dataframe, which is then subset...and so on):
data.loc['a',1].loc['Title1'].loc['A']
Is there a better way to subset above?
Question:2 How do I subset the data after deleting the indices?
data_wo_index = data.reset_index()
I'm relatively comfortable with data.table in R. So, I thought of using http://datascience-enthusiast.com/R/pandas_datatable.html to subset the data using my data.table knowledge.
I tried one step at a time, but even the first step (i.e. subsetting key1 = a gave me an error:
data_wo_index[data_wo_index['key1']=='a']
Exception: cannot handle a non-unique multi-index!
I don't know why Pandas is still thinking that there is multi-index. I have already reset it.
Question:3 If I run data.columns command, I get the following output:
MultiIndex(levels=[['Title1', 'Title2'], ['A', 'B', 'C']],
labels=[[0, 0, 0, 1, 1, 1], [0, 0, 1, 1, 2, 2]],
names=['state', 'color'])
It seems to me that column names are also indexes. I am saying this because I see MultiIndex class, which is what I see if I run data.index:
MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
labels=[[0, 0, 1, 2, 3, 3], [0, 1, 2, 0, 2, 0]],
names=['key1', 'key2'])
I am unsure why column names are also on object of MultiIndex class. If they are indeed an object of MultiIndex class, then why do we need to set aside a few columns (e.g. key1 and key2 in our example above) as indices, meaning why can't we just use column-based indices? (As a comparison, in data.table in R, we can setkey to whatever columns we want.)
Question 4 Why are column names an object of MultiIndex class? It will be great if someone can offer a theoretical treatment for this.
As a beginner, I'd really appreciate your thoughts. I have spent 3-4 hours researching this topic and have hit a dead-end.
R - data.tablewithpandas.