everyone
I have a dataframe with 2 million unique codes for students and two other columns: initial and final year. I need to create a new dataframe with only two columns (student cod and year), with one row for each year the student remained studying. For instance, if student with code 1234567 studied from 2013 to 2015, the new dataframe must have three rows, as shown below:
| COD | YEAR |
|-------- | ------ |
| 1234567 | 2013 |
| 1234567 | 2014 |
| 1234567 | 2015 |
I have the following for loop working:
import pandas as pd
import numpy as np
# creating a df
df = pd.DataFrame({
'COD': np.random.randint(100, 1000000, size=18),
'YEAR_INCLUSION' : [2017, 2018, 2020] * 6,
'YEAR_END' : [2019, 2020, 2021] * 6,
})
newdf = pd.DataFrame(columns = ['COD', 'YEAR'])
for index, row in df.iterrows():
for i in range(row['YEAR_INCLUSION'], row['YEAR_END']+1):
newdf = pd.concat([df, pd.DataFrame.from_records([{ 'COD': row['BOLSISTA_CODIGO'], 'YEAR': i }])])
The problem is time. Even splitting the data into smaller df, it takes too long. With a 411,000 lines split the code takes 16~20 hours.
I tried the same code with itertuples, but times were significantly slower, though itertuples is known for being better then iterrows:
newdf = pd.DataFrame(columns = ['COD', 'YEAR'])
for index, row in df.itertuples():
for i in range(row.YEAR_INCLUSION, row.YEAR_END+1):
newdf = pd.concat([df, pd.DataFrame.from_records([{ 'COD': row.BOLSISTA_CODIGO, 'YEAR': i }])])
I couldn't figure out a way to use map or apply, which allegedly would present much better results.
Thanks in advance for the help!