Timeline for General multithreaded file processing

Current License: CC BY-SA 3.0

22 events

when toggle format	what		by	license	comment
May 6, 2016 at 5:18	comment	added	user3798283		Sorry , I have one question ? The process_file function is defined with 5 template parameters but called with only ? Don't you think that will be compile time error ? Second what purpose reverse_complement function is serving ?
Sep 11, 2014 at 20:33	comment	added	Daniel		@LokiAstari I've added the thread safe queue code to the question.
Sep 11, 2014 at 20:32	history	edited	Daniel	CC BY-SA 3.0	Added the thread safe queue code.
Sep 11, 2014 at 16:09	comment	added	Loki Astari		@Daniel: I'll take a look today see if I can spot anything.
Sep 11, 2014 at 16:08	comment	added	Loki Astari		@Surt: Its also hard to image the code being slower that serially (when you just pluralize the work units). But also because lock-less queues are notoriously hard to get correct. Ant tiny mistakes tends to make them look like they are working but kill throughput. But you are correct its hard to tell without all the code. Also don't assume SSD yet (lot of machines with rotating drives still out there).
Sep 11, 2014 at 15:19	comment	added	Surt		@Loki, why do you think it would be the queue that is the problem, if Daniel is using a good SSD he would have some 500MB/sec read/write, 250 MB each, his CPU would have a 125GB/s (on a reasonable new system) throughput in his L1 cache and the complexity of the task is O(1) a swap with 7 memory loads, I have difficulty in seeing how it could be the queue unless he makes a task shift at each access.
Sep 10, 2014 at 21:44	comment	added	Daniel		Chapter 4, page 74. I added an 'emplace' method that takes a variadic template input too, didn't seem to make a whole lot of difference mind.
Sep 10, 2014 at 19:43	comment	added	Loki Astari		@Daniel: What chapter did you take your queue from. I see several in the book source so its hard to test.
Sep 10, 2014 at 17:11	comment	added	Loki Astari		@Daniel: Sounds like you have throughput issues with your lock-less queue to me (Which is a common situation).
Sep 10, 2014 at 16:06	comment	added	Daniel		@LokiAstari I tried your suggestion of doing the reading & writing in the main thread (including creating the worker pool first and detaching), and this was actually slower than the naive single threaded method.
Sep 8, 2014 at 8:59	comment	added	programmerjake		if `num_threads` in `process_file` is less than or equal to 2 your program won't work properly because either there will be 0 worker threads, or the `unsigned` will wrap around and it will try to create billions of threads.
Sep 8, 2014 at 0:22	answer	added	Surt		timeline score: 4
Sep 6, 2014 at 21:52	comment	added	Loki Astari		Also note: A lock less queue does not mean faster. In fact when used incorrectly they are usually slower than a queue that uses locks.
Sep 6, 2014 at 21:51	comment	added	Loki Astari		Its probably not overhead from concurrency. You have two serial processes that have not been parallelized. Reading/Writing are still both sequential. Also overlapping Reading/Writing operations does not necessarily make the application faster as the device may not be able to handle parallel operations (try raiding some drives together). So you should measurements in the improvement are flawed. Personally I would not bother to put Read/Write in separate threads. Do all the reading/writing from the main thread.
Sep 6, 2014 at 18:43	comment	added	Daniel		I tested on something a bit more taxing (Smith-Waterman local alignment) on the same smallish Fasta input (aligned against a pre-defined sequence). Single-threaded ~ 64secs, multi-threaded ~ 40secs. Using 4 cores, so only 2 cores actually doing the processing. I would have hoped for a little better (less than half the runtime of the single threaded version), but I suppose there's always going to be a penalty for concurrency mechanisms.
Sep 6, 2014 at 18:08	history	edited	Daniel	CC BY-SA 3.0	More specific question
Sep 6, 2014 at 18:05	comment	added	Daniel		Depends on the use case. I mention this at the end of my question. In the example I provide (reverse complementing a Fasta file), the processing step isn't particularly complex, but I still get a small runtime performance gain over a single threaded approach on my 4 core machine. But I imagine this would indeed be more useful in attacking more complex tasks (optimal local alignment comes to mind), and for use on clusters etc.
Sep 6, 2014 at 17:21	history	edited	Jamal	CC BY-SA 3.0	deleted 28 characters in body
Sep 6, 2014 at 13:31	comment	added	janos		Is the implementation of `process_item` CPU intensive? If not, then it's unlikely to see big improvements by multithreading, because I/O will be the bottleneck. The more CPU intensive `process_item`, the more visible the improvement.
Sep 6, 2014 at 11:45	history	tweeted			twitter.com/#!/StackCodeReview/status/508219490375450625
Sep 6, 2014 at 11:24	review	First posts
Sep 6, 2014 at 11:43
Sep 6, 2014 at 11:22	history	asked	Daniel	CC BY-SA 3.0

toggle format