Multithreading in Python for reading files

Question

I am new to Python, and have never tried Multithreading. My objective is to read set of file and get some specific data from the file. I have already created a code which is doing my task perfectly. But it is taking a lot of time as few files are very large.

final_output = []
for file in os.listdir(file_path):
    final_string = error_collector(file_path, file)
    final_output = final_output + final_string

The error_collector function is reading each line of the file and fetching useful information and returning a list for each file, which I am concatenating with the file list so that I can get all information in a single list.

What I want to achieve is some way by which I can do parallel processing of the files instead of reading one file at a time.

Can someone please help.

As a note - threading when disk IO bound is going to be a significant code overhead for most likely very little gain (if any). — Jon Clements
– Jon Clements, Commented Aug 7, 2015 at 9:51
You might be better off building the list in one go instead of keep resizing it thus avoiding memory allocation overhead - something like: final_output = [error_collector(file_path, file) for file in os.listdir(file_path)] — Jon Clements
– Jon Clements, Commented Aug 7, 2015 at 9:54
@Jon, even then, if final list ends up not being very long this will not provide substantial improvement. As you pointed out, IO limitations are the bottleneck here so there is little hope of improvement. How about an SSD? — tnknepp
– tnknepp, Commented Aug 7, 2015 at 9:58
Dont write same time, threading is not your solution ! Maybe you need use SSD. — dsgdfg
– dsgdfg, Commented Aug 7, 2015 at 10:04
Can you confirm that the process is IO limited? If you can read the file in WITHOUT doing any analysis and then compare the read time to total processing time we may be able to provide better advice. — tnknepp
– tnknepp, Commented Aug 7, 2015 at 10:06

Roland Smith · Accepted Answer · 2015-08-07 11:34:46Z

1

Using mmap can improve the speed of reading files.

If the data that is to be read is relatively small compared to the total size of the file, doing this in combination with Pool.map is a good strategy.

answered Aug 7, 2015 at 11:34

Roland Smith

43.7k3 gold badges69 silver badges98 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

HeinzKurt · Accepted Answer · 2015-08-07 10:00:24Z

0

What you want to do is called multiprocessing in Python. Multithreading only uses one cpu core.

One way to do it is:

from multiprocessing import Pool

fl = os.listdir(file_path)

def fun(i):
    final_string = error_collector(file_path, fl[i])
    final_output = final_output + final_string

p = Pool(4)
final_output = p.map(fun, range(len(fl)))
p.terminate()

EDIT: If the bottleneck really is disk I/O, you can store your files in a better format (i.e. use the pickle module and store as binary).

edited Aug 7, 2015 at 10:00

answered Aug 7, 2015 at 9:57

HeinzKurt

6129 silver badges19 bronze badges

6 Comments

tnknepp Over a year ago

It seems this application is limited by reading the files so multiprocessing will not be helpful in this case.

HeinzKurt Over a year ago

@tnknepp: But apart from some initial overhead it will not be slower than multithreading.

tnknepp Over a year ago

Agreed. I guess the real question is: how much of the time is spent reading the file from disk and how much is spent processing. Perhaps Shobhit can confirm that this process is IO limited or not.

tnknepp Over a year ago

Good point, I started storing my data in pickles months ago and the IO improvement is substantial. However, if you are doing a "one-and-done" analysis then there is no point.

HeinzKurt Over a year ago

True. If you read files once, an SSD is the only choice. But on the other hand: receive the data and pickle it. Then when you really need it, unpickle it. I can think of very few usecases where this is not applicable.

|

Collectives™ on Stack Overflow

Multithreading in Python for reading files

2 Answers 2

Comments

6 Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

6 Comments

Related