1

I have a dataframe (the sample looks like this)

Type          SKU      Description   FullDescription        Size      Price
Variable       2        Boots          Shoes on sale       XL,S,M       
Variation      2.5      Boots XL                             XL       330
Variation      2.6      Boots S                              S        330
Variation      2.7      Boots M                              M        330
Variable       3        Helmet           Helmet Sizes      E42,E41
Variation      3.8      Helmet E42                          E42       89
Variation      3.2      Helmet E41                          E41       89

What I want to do is sort the values based on Size so the final data frame should look like this:

  Type          SKU      Description   FullDescription        Size      Price
    Variable       2        Boots          Shoes on sale       S,M,XL        
    Variation      2.6      Boots S                             S       330
    Variation      2.7      Boots M                             M        330
    Variation      2.5      Boots XL                            XL        330
    Variable       3        Boots           Helmet Sizes       E41,E42
    Variation      3.2      Helmet E41                          E41       89
    Variation      3.8      Helmet E42                          E42       89

I am able to successfully get the results using this code

sizes, dig = ['S','M','XL','L',], ['000','111','333','222'] #make sure dig values do not exist as a substring anywhere in your dataframe
df = (df.assign(Size=df['Size'].replace(sizes, dig, regex=True))
        .assign(grp=(df['Type'] == 'Variable').cumsum()) 
        .sort_values(['grp', 'Type', 'Size']).drop('grp', axis=1))
df['Size'] = df['Size'].apply(lambda x: ','.join(sorted(x.split(',')))).replace(dig, sizes, regex=True)
df

The issue is that the given code dosen't work on dataframe

Type          SKU      Description   FullDescription        Size      Price
Variable       2        Boots          Shoes on sale       XL,S,3XL       
Variation      2.5      Boots XL                             XL       330
Variation      2.6      Boots 3XL                            3XL        330
Variation      2.7      Boots S                              S        330
Variable       3        Helmet           Helmet Sizes      S19, S9
Variation      3.8      Helmet E42                          S19       89
Variation      3.2      Helmet E41                          S9       89

it gives the results 'S,3XL,XL' and 'S19,S9' whereas I want the results as

Type          SKU      Description   FullDescription        Size      Price
Variable       2        Boots          Shoes on sale       S,XL,3XL       
Variation      2.7      Boots S                             S          330
Variation      2.5      Boots XL                            XL        330
Variation      2.6      Boots 3XL                           3XL        330
Variable       3        Helmet           Helmet Sizes      S9,S19
Variation      3.2      Helmet E41                          S9        89
Variation      3.8      Helmet E42                          S19       89

also in case of more sizes, the order should be 'XXS,XS,S,M,L,XL,XXL,3XL,4XL,5XL' and in case of second example, 'S9,S19,M9,M19,L9 and so on'

This is what I have done so far but it's not working and showing the wrong order

sizes, dig = ['XS','S','M','L','XL','XXL','3XL','4XL','5XL'], ['000','111','222','333','444','555','666','777','888'] #make sure dig values do not exist as a substring anywhere in your dataframe
df = (df.assign(Size=df['Size'].replace(sizes, dig, regex=True))
        .assign(grp=(df['Type'] == 'variable').cumsum())
        .sort_values(['grp', 'Type', 'Size']).drop('grp', axis=1))
df['Size'] = df['Size'].apply(lambda x: ','.join(sorted(x.split(',')))).replace(dig, sizes, regex=True)

1 Answer 1

2

Step 1: recreate the data

import pandas as pd

#----------------------#
# Recreate the dataset #
#----------------------#
# raw input data_1 = """ Variable|2|Boots|Shoes on sale|XL,S,M|  
                         Variation|2.5|Boots XL||XL|330 Variation|2.6|Boots S||S|330 
                         Variation|2.7|Boots M||M|330 Variable|3|Helmet|Helmet Sizes|E42,E41| 
                         Variation|3.8|Helmet E42||E42|89 
                         Variation|3.2|Helmet E41||E41|89"""

data_2 = """ Variable|2|Boots|Shoes on sale|XL,S,3XL| 
             Variation|2.5|Boots XL||XL|330 
             Variation|2.6|Boots 3XL||3XL|330 
             Variation|2.7|Boots S||S|330 
             Variable|3|Helmet|Helmet Sizes|S19, S9| 
             Variation|3.8|Helmet E42||S19|89 
             Variation|3.2|Helmet E41||S9|89"""

# Construct 1 data set
data = 'Type|SKU|Description|FullDescription|Size|Price'
data += data_2 # this can also be data_1  or data_1 + data_2

# pre-process: split the lines and values into a list of lists.
data = [row.split('|') for row in data.split('\n')]

#-------------#
# create a df #
#-------------#
df = pd.DataFrame(data[1:], columns=data[0]) df

Temp Result

Type    SKU     Description     FullDescription          Size   Price
0   Variable    2               Boots   Shoes on sale   XL,S,3XL    
1   Variation   2.5             Boots XL                XL          330
2   Variation   2.6             Boots 3XL               3XL         330
3   Variation   2.7             Boots S                 S           330
4   Variable    3               Helmet  Helmet Sizes    S19, S9     
5   Variation   3.8             Helmet E42              S19         89
6   Variation   3.2             Helmet E41              S9          89

Step 2: Create a priority dict

I'm not really into fashion + I'm also a guy --> (I'm only familiar with S M L XL)
But feel free to re order them or add extra sizes into the list

# Prioritize the sizes
# ps, i don't know the order :) 
priority_dict = {k : e for e, k in enumerate([ 'XXS','XS','S','M','L','XL','XXL','3XL','4XL','5XL', 'E41', 'E42', 'S9', 'S19' ])}
priority_dict

Temp Result

{'XXS': 0,
 'XS': 1,
 'S': 2,
 'M': 3,
 'L': 4,
 'XL': 5,
 'XXL': 6,
 '3XL': 7,
 '4XL': 8,
 '5XL': 9,
 'E41': 10,
 'E42': 11,
 'S9': 12,
 'S19': 13}

Step 3: Create a list of tuples from the string of sizes

# Split the string  "SIZE" into a list    "XL,S,M" --> ["XL", "S", "M"]
# And, add the value from our priority dict to it  --> [(5, "XL"), (2, "S"), (3, "M")]
# Last but not least, sort list (by the first value) --> [(2, "S"), (3, "M"), (5, "XL")]
df["TMP_SIZE"] = [ sorted([(priority_dict.get(size.strip()), size.strip())  for size in sizes.split(',')]) for sizes in df.Size]
df

Temp Result

Type    SKU     Description     FullDescription          Size       Price  TMP_SIZE
0   Variable    2               Boots   Shoes on sale   XL,S,3XL           [(2, S), (5, XL), (7, 3XL)]
1   Variation   2.5             Boots XL                XL          330    [(5, XL)]
2   Variation   2.6             Boots 3XL               3XL         330    [(7, 3XL)]
3   Variation   2.7             Boots S                 S           330    [(2, S)]
4   Variable    3               Helmet  Helmet Sizes    S19, S9            [(12, S9), (13, S19)]
5   Variation   3.8             Helmet E42              S19         89     [(13, S19)]
6   Variation   3.2             Helmet E41              S9          89     [(12, S9)]

Step 4: Clean TEMP_SIZE

# Create a new SIZE
# loop over the TMPS_SIZE and create a string from the second value of the tuplelist --> ', '.join( my_list )

df['NEW_SIZE'] = [', '.join([ size[1]for size in sizes ]) for sizes in df["TMP_SIZE"] ]

Temp Result

Type    SKU     Description     ...     Size        Price  TMP_SIZE                       NEW_SIZE
0   Variable    2               ...     XL,S,3XL           [(2, S), (5, XL), (7, 3XL)]  S, XL, 3XL
1   Variation   2.5             ...     XL          330    [(5, XL)]                    XL
2   Variation   2.6             ...     3XL         330    [(7, 3XL)]                   3XL
3   Variation   2.7             ...     S           330    [(2, S)]                     S
4   Variable    3               ...     S19, S9            [(12, S9), (13, S19)]        S9, S19
5   Variation   3.8             ...     S19         89     [(13, S19)]                  S19
6   Variation   3.2             ...     S9          89     [(12, S9)]                   S9

Step 5: grp

add your grp

#grp
df['grp']= (df['Type'] == 'Variable').cumsum()
df

Step 6: sort

in the last step, you can sort everything (I think that you need to sort the TMP_SIZE separately)

# sort the dataset
df = df.sort_values('TMP_SIZE') # notice that we sort on the list of tuples
df.sort_values(by=['grp', 'Type'])
Sign up to request clarification or add additional context in comments.

8 Comments

hI, thanks for the detailed answer, it works perfectly well for the given dataset but doesn't work on original dataset that contains a lot more variance in type of sizeslike EC 40-EC42 and Ladies XS, Ladies XL. so it's throwing me 'TypeError: '<' not supported between instances of 'int' and 'NoneType'' . any way around?please.
I think that it is because of the fact, that EC 40-EC42 is not in your priority dictionary --> { s.strip() for sizes in df.Size for s in [ size for size in sizes.split(',')]} - set(priority_dict) <-- and check if that list is empty? (this means that all the values are in the priority_dict (+ if je doe dict.get( key, 'standard_value') you can always give an standard value with it, if you don't, then that value is None)
+ --> the string " EC 40-EC42 " is that 1 size? or 2 sizes? in that case, you should split with an regex (thus nu only on ',' but also on '-' ) and clean your data further
The sizes are this way '''EC 38-42,EC 43-46', 'EC 38-42', 'EC 43-46', 'EC 46,EC 48', 'EC 38,EC 40', 'EC 39-41,EC 42-XL,EC 45-47,EC 48-51' so it is considered as one value in a column e.g EC 38-42 is in one row and EC 43-46 in other row and so on
@hyeri --> doe you've tried to check this? : { s.strip() for sizes in df.Size for s in [ size for size in sizes.split(',')]} - set(priority_dict) and see if it is empty?
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.