Pandas, get first and last column index for row value

Question

I have the following dataframe:

columns = pd.date_range(start="2022-05-21", end="2022-06-30")
data = [
    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5],
    [5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
    [5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5]
]
df = pd.DataFrame(data, columns=columns)
 2022-05-21  2022-05-22  2022-05-23  ...  2022-06-28  2022-06-29  2022-06-30
0           0           0           0  ...           5           5           5
1           5           5           5  ...           1           1           1
2           5           5           5  ...           5           5           5

I have to take the first and last column index for every distinct value in the order they are. The correct output for this dataframe will be:

[
    [
        {'value': 0, 'start': '2022-05-21', 'end': '2022-05-31'}, 
        {'value': 2, 'start': '2022-06-01', 'end': '2022-06-19'}, 
        {'value': 5, 'start': '2022-06-20', 'end': '2022-06-30'}
    ],
    [
        {'value': 5, 'start': '2022-05-21', 'end': '2022-05-31'},
        {'value': 2, 'start': '2022-06-01', 'end': '2022-06-19'},
        {'value': 1, 'start': '2022-06-20', 'end': '2022-06-30'}
    ],
    [
        {'value': 5, 'start': '2022-05-21', 'end': '2022-05-31'},
        {'value': 2, 'start': '2022-06-01', 'end': '2022-06-19'}, 
        {'value': 5, 'start': '2022-06-20', 'end': '2022-06-30'}
    ]
]

My best approach for the moment is:

series_set = df.apply(frozenset, axis=1)
container = []
for index in range(len(df.index)):
    row = df.iloc[[index]]
    values = series_set.iloc[[index]]

    inner_container = []
    for value in values[index]:
        single_value_series = row[row.columns[row.isin([value]).all()]]
        dates = single_value_series.columns
        result = dict(value=value, start=dates[0].strftime("%Y-%m-%d"), end=dates[-1].strftime("%Y-%m-%d"))
        inner_container.append(result)

    container.append(inner_container)

The result is:

[
    [
        {'value': 0, 'start': '2022-05-21', 'end': '2022-05-31'}, 
        {'value': 2, 'start': '2022-06-01', 'end': '2022-06-19'}, 
        {'value': 5, 'start': '2022-06-20', 'end': '2022-06-30'}
    ],
    [
        {'value': 1, 'start': '2022-06-20', 'end': '2022-06-30'}, 
        {'value': 2, 'start': '2022-06-01', 'end': '2022-06-19'}, 
        {'value': 5, 'start': '2022-05-21', 'end': '2022-05-31'}
    ],
    [
        {'value': 2, 'start': '2022-06-01', 'end': '2022-06-19'}, 
        {'value': 5, 'start': '2022-05-21', 'end': '2022-06-30'}
    ]
]

It has several problems, only the first array is correct :) When I convert dataframe to frozenset it is sorted and order is changed and also if some value appears more than once it is removed.

I will appreciate any idea and guidance. What I want to avoid is iterating the dataframe.

Thank you!

jezrael · Accepted Answer · 2022-03-18 10:18:57Z

You can first transpose DataFrame by DataFrame.T and then aggregate minimal and maximal index with convert values to strings by Series.dt.strftime, last convert to dictionaries by DataFrame.to_dict.

For get consecutive groups is compared shifted values with Series.cumsum.

df1 = df.T.reset_index()
L = [df1.groupby(df1[x].ne(df1[x].shift()).cumsum())
        .agg(value=(x, 'first'),
             start=('index', 'min'),
             end=('index', 'max'))
        .assign(start=lambda x: x['start'].dt.strftime('%Y-%m-%d'),
                end=lambda x: x['end'].dt.strftime('%Y-%m-%d'))
        .to_dict(orient='records') for x in df1.columns.drop('index')]
print (L)
[[{'value': 0, 'start': '2022-05-21', 'end': '2022-05-31'}, 
  {'value': 2, 'start': '2022-06-01', 'end': '2022-06-19'}, 
  {'value': 5, 'start': '2022-06-20', 'end': '2022-06-30'}],
 [{'value': 5, 'start': '2022-05-21', 'end': '2022-05-31'}, 
  {'value': 2, 'start': '2022-06-01', 'end': '2022-06-19'},
  {'value': 1, 'start': '2022-06-20', 'end': '2022-06-30'}],
 [{'value': 5, 'start': '2022-05-21', 'end': '2022-05-31'}, 
  {'value': 2, 'start': '2022-06-01', 'end': '2022-06-19'}, 
  {'value': 5, 'start': '2022-06-20', 'end': '2022-06-30'}]]

The last array should look like this: ` [ {'value': 5, 'start': '2022-05-21', 'end': '2022-05-31'}, {'value': 2, 'start': '2022-06-01', 'end': '2022-06-19'}, {'value': 5, 'start': '2022-06-20', 'end': '2022-06-30'} ] ` See how I have 5 on two places before and after 2. Thank you very much!

Collectives™ on Stack Overflow

Pandas, get first and last column index for row value

1 Answer 1

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Related