I have written a CUDA kernel to process an image. But depending on the output of the processed image, I have to call the kernel again, to re-tune the image.
For example, let us consider an image having 9 pixels
1 2 3
4 5 6
7 8 9
Suppose that, depending on its neighboring values, the value 9 changes to 10. Since the value has changed, I have to re-process the new image, with the same kernel.
1 2 3
4 5 6
7 8 10
I have already written the algorithm to process the image in a single iteration. The way I'm planning to implement the iterations in CUDA is the following:
__global__ void process_image_GPU(unsigned int *d_input, unsigned int *d_output, int dataH, int dataW, unsigned int *val) {
__shared__ unsigned int sh_map[TOTAL_WIDTH][TOTAL_WIDTH];
// Do processing
// If during processing, anywhere any thread changes the value of the image call
{ atomicAdd(val, 1); }
}
int main(int argc, char *argv[]) {
// Allocate d_input, d_output and call cudaMemcpy
unsigned int *x, *val;
x = (unsigned int *)malloc(sizeof(unsigned int));
x[0] = 0;
cudaMalloc((void **)&val, sizeof(unsigned int));
cudaMemcpy((void *)val, (void *)x, sizeof(unsigned int), cudaMemcpyHostToDevice);
process_image_GPU<<<dimGrid, dimBlock>>>(d_input, d_output, rows, cols, val);
cudaMemcpy((void *)x, (void *)val, sizeof(unsigned int), cudaMemcpyDeviceToHost);
if(x != 0)
// Call the kernel again
}
Is it the only way to do this? Is there any other efficient way to implement the same?
Thanks a lot for your time.