How to Do Data Parallelization in Python?

Question

So, I have 3 dimentional list. For example:

A=[[[1,2,3],[4,5,6],[7,8,9]],...,[[2,4,1],[1,4,6],[1,2,4]]]

I want to process each 2 dimentional list in A independently but they all have same process. If I do it in sequence, I do:

for i in range(len(A)):
    A[i]=process(A[i])

But, it takes very long time. Could you tell me how to parallel compute by data parallelization in Python?

Parallelization requires threading. That requires a lot of work. Since lists are mutable, you'll likely have to create a copy for each individual thread (if I'm smart at programming). Copying/slicing the list will likely take more time than processing it in a single thread. — Zizouz212
– Zizouz212, Commented Oct 27, 2016 at 1:43
List of options here... wiki.python.org/moin/ParallelProcessing — OneCricketeer
– OneCricketeer, Commented Oct 27, 2016 at 1:45
@Zizouz212 so, there's no way that more efficient that processing it sequentially? — fahadh4ilyas
– fahadh4ilyas, Commented Oct 27, 2016 at 1:56
Threading in Python is actually not parallel at all and would slow things down, because the Global Interpreter Lock only executes one Python instruction at a time regardless of the number of threads. It's only suitable for doing something while waiting for I/O on another thread. True multiprocessing would work, but sharing the data requires locking and would also quite likely slow things down if there's a lot of writing to the array. — Jim Stewart
– Jim Stewart, Commented Oct 27, 2016 at 1:56

niemmi · Accepted Answer · 2016-10-27 03:19:41Z

2

If you have multiple cores and processing each 2 dimensional list is expensive operation you could use Pool from multiprocessing. Here's a short example that squares numbers in different process:

import multiprocessing as mp

A = [[[1,2,3],[4,5,6],[7,8,9]],[[2,4,1],[1,4,6],[1,2,4]]]

def square(l):
    return [[x * x for x in sub] for sub in l]

pool = mp.Pool(processes=mp.cpu_count())
res = pool.map(square, A)

print res

Output:

[[[1, 4, 9], [16, 25, 36], [49, 64, 81]], [[4, 16, 1], [1, 16, 36], [1, 4, 16]]]

Pool.map will behave like built-in map while splitting the iterable to worker processes. It also has third parameter called chunksize that defines how big chunks are submitted to workers.

edited Oct 27, 2016 at 3:19

answered Oct 27, 2016 at 2:16

niemmi

17.3k7 gold badges38 silver badges42 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

fahadh4ilyas Over a year ago

Thank you very much! I'll try it later because my program still running sequentially since 30 minutes ago. So, I don't have to make A into some part and python will part it?

niemmi Over a year ago

@user7077941 No, you don't have to split A. You might want to define third parameter called chunksize though depending on your data.

fahadh4ilyas Over a year ago

Wow, such a simple way. Thank you. ^^

fahadh4ilyas Over a year ago

Hi, I'm confused. I tried using your code but somehow error appeared "AttributeError" in with mp.Pool(processes=mp.cpu_count()) as pool. What should I do?

niemmi Over a year ago

@user7077941 Didn't notice python 2.7 tag so I was running it with 3.5, I've edited the answer to work on Python 2.7.

|

Collectives™ on Stack Overflow

How to Do Data Parallelization in Python?

1 Answer 1

7 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Linked

Related