1

I have a file that I want to process in Python. Each line in this file is a path to an image, and I would like to call a feature extraction algorithm on each image.

I would like to divide the file into smaller chunks and each chunk will be processed in a parallel separate process. What are the good state-of-the-art libraries or solutions for for this kind of multiprocessing in Python?

11
  • docs.python.org/2/library/multiprocessing.html Commented Oct 29, 2014 at 17:47
  • Python has a wiki.python.org/moin/GlobalInterpreterLock which means threading is only useful when each thread spends time waiting (such as for response from a server) ...to actively do work in parallel you need multiprocessing Commented Oct 29, 2014 at 17:49
  • Thanks Anenetropic, I will check your links, I guess I need to divide the the data (the file) explicitly and pass each chunk as argument to a function then. Commented Oct 29, 2014 at 17:52
  • @Anentropic: it is not true if the computation functions can release GIL (e.g., functions from numpy, lxml, regex modules can release GIL (run in parallel) without multiple processes). Here's code example (ctypes releases GIL before calling C functions). Commented Oct 29, 2014 at 18:06
  • true, just as a general guideline you need to be aware of the limitations of the GIL though Commented Oct 29, 2014 at 18:11

1 Answer 1

4

Your description suggests that a simple thread (or process) pool would work:

#!/usr/bin/env python
from multiprocessing.dummy import Pool # thread pool
from tqdm import tqdm # $ pip install tqdm # simple progress report

def mp_process_image(filename):
    try:
       return filename, process_image(filename), None
    except Exception as e:
       return filename, None, str(e)

def main():
    # consider every non-blank line in the input file to be an image path
    image_paths = (line.strip()
                   for line in open('image_paths.txt') if line.strip())
    pool = Pool() # number of threads equal to number of CPUs
    it = pool.imap_unordered(mp_process_image, image_paths, chunksize=100)
    for filename, result, error in tqdm(it):
        if error is not None:
           print(filename, error)

if __name__=="__main__":
    main() 

I assume that process_image() is CPU-bound and it releases GIL i.e., it does the main job in a C extension such OpenCV. If process_image() doesn't release GIL then remove the word .dummy from the Pool import to use processes instead of threads.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks Sebastian for the elegant solution. My last question is that how can I collect the results in order (the result of the first chunk, then the 2nd chunk... etc.)?
@Rami: If you need the results in order then use imap() instead of imap_unordered().

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.