0

I have the following DataFrame:

file     size
abc1.txt  2.1 MB
abc2.txt  1.0 MB
abc3.txt  1.5 MB
abc4.txt  767.9 KB

When I plot these data (plt.plot(df['file'],df['size'])), the values of KB and MB are obviously incorrectly ordered and are messed. How can I sort them so that the sorting would start from KB and would continue with MB?

767.9 KB  1.0 MB  1.5 MB  2.1 MB

2 Answers 2

3
df = pd.DataFrame({'file': [1,2,3,4], 'size': ['2.1 MB', '1.0 MB', '1.5 MB', '767.9 KB']})
cv= {'': 1, 'KB': 1e1, 'MB': 1e6, 'GB': 1e9, 'TB': 1e12}
df['size_bytes'] = df['size'].apply(lambda x: float(x.split()[0])*cv[x.split()[1]] 
                                    if len(x.split())==2 else float(x))
fig, ax = plt.subplots()
plt.plot(df['file'],df['size_bytes'])

And if you want the y axis in human readable form

def to_human_readable(size):
    power = 1000
    n = 0
    mem = {0 : '', 1: 'KB', 2: 'MB', 3: 'GB', 4: 'TB'}
    while size > power:
        size /=  power
        n += 1
    return "{0} {1}".format(size, mem[n])

ax.set_yticklabels([to_human_readable(v) if v >= 0 else ' ' for v in  
                    ax.get_yticks(minor=False)])

enter image description here

(In digital storage 1kb = 1000)

Sign up to request clarification or add additional context in comments.

Comments

1

First it's reading your numbers as a string, so any order wouldn't really make much sense and further the the space between the points is not representative.

Also in general I'd say it's poor practice to have different units on the same axis. Better to convert to the same unit:

import matplotlib.pyplot as plt
import pandas as pd

df = pd.DataFrame([['abc1.txt',  '2.1 MB'],
                   ['abc2.txt',  '1.0 MB'],
                   ['abc3.txt',  '1.5 MB'],
                   ['abc4.txt',  '767.9 KB']], columns=["file", 'size'])

# This is a list comprehension that splits the number out of the string, converts it to a float, 
# and divides it by 1000 if the other part of the string is 'KB'.
df['size_float'] = [float(x[0])/1000 if x[1]=='KB' else float(x[0]) for x in df['size'].str.split()]
plt.plot(df['file'],df['size_float'])

2 Comments

Shouldn't I divide by 1024?
Lol, sure, if you want. :) It looks like your MB numbers are only accurate to one significant figure so it won't make much difference at the end of the day.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.