Timeline for General multithreaded file processing
Current License: CC BY-SA 3.0
22 events
| when toggle format | what | by | license | comment | |
|---|---|---|---|---|---|
| May 6, 2016 at 5:18 | comment | added | user3798283 | Sorry , I have one question ? The process_file function is defined with 5 template parameters but called with only ? Don't you think that will be compile time error ? Second what purpose reverse_complement function is serving ? | |
| Sep 11, 2014 at 20:33 | comment | added | Daniel | @LokiAstari I've added the thread safe queue code to the question. | |
| Sep 11, 2014 at 20:32 | history | edited | Daniel | CC BY-SA 3.0 |
Added the thread safe queue code.
|
| Sep 11, 2014 at 16:09 | comment | added | Loki Astari | @Daniel: I'll take a look today see if I can spot anything. | |
| Sep 11, 2014 at 16:08 | comment | added | Loki Astari | @Surt: Its also hard to image the code being slower that serially (when you just pluralize the work units). But also because lock-less queues are notoriously hard to get correct. Ant tiny mistakes tends to make them look like they are working but kill throughput. But you are correct its hard to tell without all the code. Also don't assume SSD yet (lot of machines with rotating drives still out there). | |
| Sep 11, 2014 at 15:19 | comment | added | Surt | @Loki, why do you think it would be the queue that is the problem, if Daniel is using a good SSD he would have some 500MB/sec read/write, 250 MB each, his CPU would have a 125*GB*/s (on a reasonable new system) throughput in his L1 cache and the complexity of the task is O(1) a swap with 7 memory loads, I have difficulty in seeing how it could be the queue unless he makes a task shift at each access. | |
| Sep 10, 2014 at 21:44 | comment | added | Daniel | Chapter 4, page 74. I added an 'emplace' method that takes a variadic template input too, didn't seem to make a whole lot of difference mind. | |
| Sep 10, 2014 at 19:43 | comment | added | Loki Astari | @Daniel: What chapter did you take your queue from. I see several in the book source so its hard to test. | |
| Sep 10, 2014 at 17:11 | comment | added | Loki Astari | @Daniel: Sounds like you have throughput issues with your lock-less queue to me (Which is a common situation). | |
| Sep 10, 2014 at 16:06 | comment | added | Daniel | @LokiAstari I tried your suggestion of doing the reading & writing in the main thread (including creating the worker pool first and detaching), and this was actually slower than the naive single threaded method. | |
| Sep 8, 2014 at 8:59 | comment | added | programmerjake |
if num_threads in process_file is less than or equal to 2 your program won't work properly because either there will be 0 worker threads, or the unsigned will wrap around and it will try to create billions of threads.
|
|
| Sep 8, 2014 at 0:22 | answer | added | Surt | timeline score: 4 | |
| Sep 6, 2014 at 21:52 | comment | added | Loki Astari | Also note: A lock less queue does not mean faster. In fact when used incorrectly they are usually slower than a queue that uses locks. | |
| Sep 6, 2014 at 21:51 | comment | added | Loki Astari | Its probably not overhead from concurrency. You have two serial processes that have not been parallelized. Reading/Writing are still both sequential. Also overlapping Reading/Writing operations does not necessarily make the application faster as the device may not be able to handle parallel operations (try raiding some drives together). So you should measurements in the improvement are flawed. Personally I would not bother to put Read/Write in separate threads. Do all the reading/writing from the main thread. | |
| Sep 6, 2014 at 18:43 | comment | added | Daniel | I tested on something a bit more taxing (Smith-Waterman local alignment) on the same smallish Fasta input (aligned against a pre-defined sequence). Single-threaded ~ 64secs, multi-threaded ~ 40secs. Using 4 cores, so only 2 cores actually doing the processing. I would have hoped for a little better (less than half the runtime of the single threaded version), but I suppose there's always going to be a penalty for concurrency mechanisms. | |
| Sep 6, 2014 at 18:08 | history | edited | Daniel | CC BY-SA 3.0 |
More specific question
|
| Sep 6, 2014 at 18:05 | comment | added | Daniel | Depends on the use case. I mention this at the end of my question. In the example I provide (reverse complementing a Fasta file), the processing step isn't particularly complex, but I still get a small runtime performance gain over a single threaded approach on my 4 core machine. But I imagine this would indeed be more useful in attacking more complex tasks (optimal local alignment comes to mind), and for use on clusters etc. | |
| Sep 6, 2014 at 17:21 | history | edited | Jamal | CC BY-SA 3.0 |
deleted 28 characters in body
|
| Sep 6, 2014 at 13:31 | comment | added | janos |
Is the implementation of process_item CPU intensive? If not, then it's unlikely to see big improvements by multithreading, because I/O will be the bottleneck. The more CPU intensive process_item, the more visible the improvement.
|
|
| Sep 6, 2014 at 11:45 | history | tweeted | twitter.com/#!/StackCodeReview/status/508219490375450625 | ||
| Sep 6, 2014 at 11:24 | review | First posts | |||
| Sep 6, 2014 at 11:43 | |||||
| Sep 6, 2014 at 11:22 | history | asked | Daniel | CC BY-SA 3.0 |