1

If I had data such as:

data = [[3, 2014], [4, 2014], [6, 2013], [6,2013]] etc...

What is the best way to calculate the sum by year in python?

4
  • Is data a list of lists or a NumPy array? Commented Mar 20, 2015 at 17:25
  • The data is a list of lists. But I would think the easiest way is to convert to NumPy array? Commented Mar 20, 2015 at 17:25
  • Are you wedded to using numpy? This is a simple groupby operation, and while you can do that in numpy it would take less time to do it in pandas than it took to write this sentence. Commented Mar 20, 2015 at 17:32
  • A numpy array would be easy IF the all the years had the same number of entries, and they occurred in a regular pattern. Then you could slice and reshape to produce an array with one year per row. But it things are irregular, a default dictionary or groupby approach is better. Commented Mar 20, 2015 at 17:44

5 Answers 5

3

I would use a dict if you need both the year and sum:

from collections import defaultdict

data = [[3, 2014], [4, 2014], [6, 2013], [6,2013]]
d = defaultdict(int)

for v, k in data:
    d[k] += v
print(d)

Prints:

defaultdict(<type 'int'>, {2013: 12, 2014: 7})
Sign up to request clarification or add additional context in comments.

5 Comments

Thanks!! What about using NumPy would it be easier?
@Ben, not sure it would be easier, if you want to be able to access which year is associated with which sum then a dict seems pretty much exactly what you want. another option would be pandas but I really think a dict is what you want.
all solutions work fine but I just want to point out one thing. sometimes the simplest solution is the fastest. I'm sure speed is not an issue here but I did a quick timeit on my machine for this solution, @Marcus's and my own: for 100, 000 iterations, this one: 0.88s, Marcus: 6.9s, mine: 0.21s
@JulienSpronck, a defaultdict will be more efficient once you test it on something containing more than 4 elements. Try data = [choice(random.data) for _ in range(10000)]
@JulienSpronck, that should be random.choice(data)!
1

Not sure if I understand the question. Here might be a simple answer without added modules.

dic = {}

for dat, year in data:
    if year not in dic:
        dic[year] = dat
    else:
        dic[year] += dat

or if you prefer

dic = {}
for dat, year in data:
    dic[year] = dat if year not in dic else dic[year] + dat

Comments

1

As reported by DSM, using pandas and grouby it seems easy:

import pandas as pd
data = [[3, 2014], [4, 2014], [6, 2013], [6,2013]]
df = pd.DataFrame(data, columns=['value', 'year'])
df.groupby(['year']).sum()

which returns:

      value
year       
2013     12
2014      7

It nice because you can easy get more information like mean, median, std etc..

df.groupby(['year']).mean()
df.groupby(['year']).median() 
df.groupby(['year']).std() 

Comments

1

There's a specific python standard library class for that, Counter:

from collections import Counter
from operator import add

counters = [Counter({row[1]:row[0]}) for row in data]
result = reduce(add, counters)

your result is a dict-behaving object:

{2013: 12, 2014: 7}

Comments

0

You can use counter() and +=.

import collections
data = [[3, 2014], [4, 2014], [6, 2013], [6,2013]]

c = collections.Counter()

for i, j in data:
    c += collections.Counter({j: i})

print(c)

A Counter is a dict subclass for counting hashable objects. It is an unordered collection where elements are stored as dictionary keys and their counts are stored as dictionary values.

You can add Counters, for example:

a = collections.Counter(a=1, b=2)
b = collections.Counter(a=3, c=3)    
print(a+b)

prints Counter({'a': 4, 'c': 3, 'b': 2}).

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.