Skip to main content

You are not logged in. Your edit will be placed in a queue until it is peer reviewed.

We welcome edits that make the post easier to understand and more valuable for readers. Because community members review edits, please try to make the post substantially better than how you found it, for example, by fixing grammar or adding additional resources and hyperlinks.

Required fields*

4
  • So kernel1 changes image1 and then kernel2 changes the changed image1 for kernel3? Commented Oct 14, 2014 at 11:27
  • Yes, kernel1 changes image1, the resulting image1 is given to kernel2 and then the resulting image1 is given to kernel3. Commented Oct 14, 2014 at 11:34
  • you could use a CPU parallel threading model, like OpenMP, and create one stream for each OMP thread. Place one while loop in each OMP thread, and have the while loops individually draw new images to be processed from a queue. I'd be very surprised if you get much performance improvement this way, unless your kernels are trivially small. Commented Oct 16, 2014 at 2:42
  • Sorry, I was stuck with fine tuning the algorithm itself - which had nothing to do with CUDA, hence the delay in reply. Why do you say - "I'd be very surprised if you get much performance improvement this way, unless your kernels are trivially small." What is the reason for it? Each image of mine has either 230x230 pixels or 16384x7 pixels. So parallel processing multiple images should give me speedup right? (Is there no way to do it without using OpenMP? Commented Oct 22, 2014 at 5:45