1

I have a dataframe with columns code and images.

Column images is a string of urls joined by a comma: <URL>,<URL2>,...

Column code is NOT unique and I need to make it unique but store all images (from all variants) in a new column images_all.

For example:

code something images
1    x         url1,url2,url3
1    x         url1,url4

Result is: code something images_all 1 x url1,url2,url3,url4

I did

grouped = csv.groupby('code')
csv = csv.drop_duplicates(subset=['code'], keep='last')
csv['images_all'] = csv.apply(lambda r:  list(set(
    [image for image in grouped.get_group(r['code'])['images']]
)))

which raises:

KeyError: 'code'

But even if it didn't raise this, the problem is that images wouldn't be [url1,url2,url3,url4] . Instead, it would be ["url1,url2,url3","url1,url4"].

Do you know how to fix it?

EDIT

I also want to keep other columns (they are the same for all rows with the same code, that's why I then just drop_duplicates and keep the last row)

1 Answer 1

1

Use GroupBy.transform with custom function for flatten splitted values, then converted to sets and last join unique values:

f = lambda x: ','.join(set([z for y in x for z in y.split(',')]))
df['images_all'] = df.groupby('code')['images'].transform(f)
print (df)
   code something          images           images_all
0     1         x  url1,url2,url3  url1,url3,url2,url4
1     1         x       url1,url4  url1,url3,url2,url4
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.