1

I had the following code running serially to request a bunch of urls:

# takes 5 hours
data = []
for url in urls:
    data.append(request_url(url))

I then converted it to run in parallel using multiprocessing:

# takes 50 minutes
p = Pool(20)
data = p.map(request_url, urls)

If I wanted to improve the speed on this, how would I then spread this process across multiple servers, what would be a good method to do this?

1
  • 2
    There is as yet no "plain vanilla standard" about how best to distribute work among multiple servers. If you totally trust your servers and network, RabbitMQ may offer one approach; if you don't, something more solid such as MapReduce, see e.g github.com/GoogleCloudPlatform/appengine-mapreduce/wiki/… , or its popular dialect Hadoop, may free you from substantial "housekeeping work". But it's important that you consider and express what constraints, exactly, you need to satisfy! Commented Jan 3, 2015 at 3:17

1 Answer 1

2

You could use pathos (and it's sister package pyina), to help you figure out exactly how you wanted to distribute the code in parallel.

pathos provides a unified API for parallel processing across threading, multiprocessing, and sockets. The API provides Pool objects which have blocking, non-blocking iterative, and asynchronous map and pipe methods. pyina extends this API to MPI and schedulers like torque and slurm. You can, generally, nest these constructs so you have heterogeneous and hierarchical parallel distributed computing.

You shouldn't need to modify your code at all to use pathos (and pyina).

There are a few examples of this on SO, including these: Python Multiprocessing with Distributed Cluster Using Pathos and Python Multiprocessing with Distributed Cluster

and in the examples directories found in pathos, pyina, and mystic -- found here: https://github.com/uqfoundation

Sign up to request clarification or add additional context in comments.

1 Comment

I should also mention that Ipython parallel and scoop are both coming along (catching up?) in their hierarchical parallel computing, and may soon provide equivalent methods to what I mention above. The examples I give use pathos.pp (parallel python), while Ipython favors zmq. I've also used rpyc before as an alternative, but didn't find it as lightweight as pp.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.