3

I have a Pandas Dataframe created from a dictionary with the following code:

import pandas as pd
pd.set_option('max_colwidth', 150)

df = pd.DataFrame.from_dict(data, orient= 'index', columns = ['text'])
df

The output is as follows:

                                            text
./form/2003Q4/0001041379_2003-12-15.html    \n10-K\n1\ng86024e10vk.htm\nAFC ENTERPRISES\n\n\n\nAFC ENTERPRISES\n\n\n\nTable of Contents\n\n\n\n\n\n\n\nUNITED STATES SECURITIES AND EXCHANGE\n...
./form/2007Q2/0001303804_2007-04-17.html    \n10-K\n1\na07-6053_210k.htm\nANNUAL REPORT PURSUANT TO SECTION 13 AND 15(D)\n\n\n\n\n\n\n   \nUNITED\nSTATES\nSECURITIES AND EXCHANGE\nCOMMISSION...
./form/2007Q2/0001349848_2007-04-02.html    \n10-K\n1\nff060310k.txt\n\n UNITED STATES\n SECURITIES AND EXCHANGE COMMISSION\n ...
./form/2014Q1/0001141807_2014-03-31.html    \n10-K\n1\nf32414010k.htm\nFOR THE FISCAL YEAR ENDED DECEMBER 31, 2013\n\n\n\nf32414010k.htm\n\n\n\n\n\n\n\n\n\n\nUNITED STATES\nSECURITIES AND EX...
./form/2007Q2/0001341853_2007-04-02.html    \n10-K\n1\na07-9697_110k.htm\n10-K\n\n\n\n\n\n\n   \n \nUNITED STATES\nSECURITIES AND EXCHANGE COMMISSION\nWashington, D.C. 20549\n \nFORM 10-K\n ...

I need to split the first column (the index) into three separate columns, Year & Qtr, CIK, Filing Data. So the values in these columns from the first row would be: 2003Q4, 0001041379, 2003-12-15.

I think that if this was in a proper column that I could do this using code similar to Example #2 found here:

https://www.geeksforgeeks.org/python-pandas-split-strings-into-two-list-columns-using-str-split/

However I am thrown by the fact that it is the index that I need to split, and not a named column.

Is there a way to separate the index or do I need to somehow save this as another column, and is this possible?

I'd appreciate any help. I am a newbie, so I don't always understand the more difficult solutions. Thanks in advance.

4
  • 2
    Use df['Data'] = df.index then reset the index. Then you just need to use str.split on data on each delimiter to get the data you need. Commented Jan 13, 2020 at 20:18
  • 1
    @EdekiOkoh Thank you for your answer. Such a simple answer, but I couldn't find that anywhere else! Can I credit you for that answer at all? Commented Jan 13, 2020 at 20:25
  • 2
    Many methods available to Series are also available to pandas Index objects: df.index.str.extract(r'.\/form\/(.*)\/(.*)_(.*).html') Commented Jan 13, 2020 at 20:29
  • No need. Glad I could help Commented Jan 13, 2020 at 20:58

1 Answer 1

1

The fact that the column is the index makes no difference when extracting components from it but you need to be careful when assigning those components back to the original dataframe.

# Extract the components from the index
# pandas allowed us to name the columns via named captured groups
pattern = r'(?P<Quarter>\d{4}Q\d)\/(?P<CIK>\d+)_(?P<Year>\d{4})-(?P<Month>\d{2})-(?P<Day>\d{2})'
tmp = df.index.str.extract(pattern) \
        .assign(Date=lambda x: pd.to_datetime(x[['Year', 'Month', 'Day']]))

# Since `df` and `tmp` are both dataframe, assignments between them
# will be based row label. We want them to align by position (i.e. 
# line 1 to line 1) so we have to convert the right hand side to
# numpy array
cols = ['Quarter', 'CIK', 'Date']
df[cols] =  tmp[cols].values
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.