1

I am new to Python, and have never tried Multithreading. My objective is to read set of file and get some specific data from the file. I have already created a code which is doing my task perfectly. But it is taking a lot of time as few files are very large.

final_output = []
for file in os.listdir(file_path):
    final_string = error_collector(file_path, file)
    final_output = final_output + final_string

The error_collector function is reading each line of the file and fetching useful information and returning a list for each file, which I am concatenating with the file list so that I can get all information in a single list.

What I want to achieve is some way by which I can do parallel processing of the files instead of reading one file at a time.

Can someone please help.

5
  • 3
    As a note - threading when disk IO bound is going to be a significant code overhead for most likely very little gain (if any). Commented Aug 7, 2015 at 9:51
  • You might be better off building the list in one go instead of keep resizing it thus avoiding memory allocation overhead - something like: final_output = [error_collector(file_path, file) for file in os.listdir(file_path)] Commented Aug 7, 2015 at 9:54
  • @Jon, even then, if final list ends up not being very long this will not provide substantial improvement. As you pointed out, IO limitations are the bottleneck here so there is little hope of improvement. How about an SSD? Commented Aug 7, 2015 at 9:58
  • Dont write same time, threading is not your solution ! Maybe you need use SSD. Commented Aug 7, 2015 at 10:04
  • 1
    Can you confirm that the process is IO limited? If you can read the file in WITHOUT doing any analysis and then compare the read time to total processing time we may be able to provide better advice. Commented Aug 7, 2015 at 10:06

2 Answers 2

1

Using mmap can improve the speed of reading files.

If the data that is to be read is relatively small compared to the total size of the file, doing this in combination with Pool.map is a good strategy.

Sign up to request clarification or add additional context in comments.

Comments

0

What you want to do is called multiprocessing in Python. Multithreading only uses one cpu core.

One way to do it is:

from multiprocessing import Pool

fl = os.listdir(file_path)

def fun(i):
    final_string = error_collector(file_path, fl[i])
    final_output = final_output + final_string

p = Pool(4)
final_output = p.map(fun, range(len(fl)))
p.terminate()

EDIT: If the bottleneck really is disk I/O, you can store your files in a better format (i.e. use the pickle module and store as binary).

6 Comments

It seems this application is limited by reading the files so multiprocessing will not be helpful in this case.
@tnknepp: But apart from some initial overhead it will not be slower than multithreading.
Agreed. I guess the real question is: how much of the time is spent reading the file from disk and how much is spent processing. Perhaps Shobhit can confirm that this process is IO limited or not.
Good point, I started storing my data in pickles months ago and the IO improvement is substantial. However, if you are doing a "one-and-done" analysis then there is no point.
True. If you read files once, an SSD is the only choice. But on the other hand: receive the data and pickle it. Then when you really need it, unpickle it. I can think of very few usecases where this is not applicable.
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.