GPU optimization for vectorized code

Question

function w=oja(X, varargin)

% get the dimensionality
[m n] = size(X);

% random initial weights
w = randn(m,1);

options = struct( ...
    'rate', .00005, ...
    'niter', 5000, ...
    'delta', .0001);
options = getopt(options, varargin);
success = 0;

% run through all input samples
for iter = 1:options.niter
    y = w'*X;
    for ii = 1:n       
        % y is a scalar, not a vector
        w = w + options.rate*(y(ii)*X(:,ii) - y(ii)^2*w);
    end
end
if (any(~isfinite(w)))
    warning('Lost convergence; lower learning rate?');
end

end

size(X)= 400 153600

This code implements oja's rule and runs slow. I am not able to vectorize it any more. To make it run faster I wanted to do computations on the GPU, therefore I changed

X=gpuArray(X)

But the code instead ran slower. The computation used seems to be compatible with GPU. Please let me know my mistake.

Profile Code Output:

Complete details: https://drive.google.com/file/d/0B16PrXUjs69zRjFhSHhOSTI5RzQ/view?usp=sharing

Remember Early optimization is the root of all evil. Can you profile your code? what part is the slowest part? can you post the result of MATLABs profiler? — Ander Biguri
– Ander Biguri, Commented Nov 10, 2015 at 8:54
@AnderBiguri Hi, Thanks for the reply! Please check Profile Code Output in question now. — Abhishek Bhatia
– Abhishek Bhatia, Commented Nov 10, 2015 at 9:13
You are right. Avoid GPU in here, it makes it go incredibly slow, and I have a TESLA k40. — Ander Biguri
– Ander Biguri, Commented Nov 10, 2015 at 9:39
You have a serial dependency in your inner loop - on each iteration the result w depends on the value of w from the previous iteration. So no parallelisation for you! — Paul R
– Paul R, Commented Nov 10, 2015 at 10:10
Abhishek Bhatia, @PaulR is correct. You have a dependency in using the previous iteration's calculations in the current iteration. You can't make your loop parallel. — rayryeng
– rayryeng, Commented Nov 13, 2015 at 0:18

Ander Biguri · Accepted Answer · 2015-11-10 11:18:01Z

This is not a full answer on how to solve it, but more an explanation why GPUs does not speed up, but actually enormously slow down your code.

GPUs are fantastic to speed up code that is parallel, meaning that they can do A LOT of things at the same time (i.e. my GPU can do 30070 things at the same time, while a modern CPU cant go over 16). However, GPU processors are very slow! Nowadays a decent CPU has around 2~3Ghz speed while a modern GPU has 700Mhz. This means that a CPU is much faster than a GPU, but as GPUs can do lots of things at the same time they can win overall.

Once I saw it explained as: What do you prefer, A million dollar sports car or a scooter? A million dolar car or a thousand scooters? And what if your job is to deliver pizza? Hopefully you answered a thousand scooters for this last one (unless you are a scooter fan and you answered the scooters in all of them, but that's not the point). (source and good introduction to GPU)

Back to your code: your code is incredibly sequential. Every inner iteration depends in the previous one and the same with the outer iteration. You can not run 2 of these in parallel, as you need the result from one iteration to run the next one. This means that you will not get a pizza order until you have delivered the last one, thus what you want is to deliver 1 by 1, as fast as you can (so sports car is better!).

And actually, each of these 1 line equations is incredibly fast! If I run 50 of them in my computer I get 13.034 seconds on that line which is 1.69 microseconds per iteration (7680000 calls).

Thus your problem is not that your code is slow, is that you call it A LOT of times. The GPU will not accelerate this line of code, because it is already very fast, and we know that CPUs are faster than GPUs for these kind of things.

Thus, unfortunately, GPUs suck for sequential code and your code is very sequential, therefore you can not use GPUs to speed up. An HPC will neither help, because every loop iteration depends in the previous one (no parfor :( ).

So, as far I can say, you will need to deal with it.

Thanks for the detailed answer. I wonder if somehow it could be vectorized more to reduce the calls. The dependency between iterations limits me.
@AbhishekBhatia as you say, the dependency limits you. I tried to think of a vectorized code, but actually the equation is computed already at a very very high speed. I honestly dont think there is any way of speeding up this.
Yeah you are right. I wanted to reduce the function calls instead. They could maybe speed it up.

Collectives™ on Stack Overflow

GPU optimization for vectorized code

1 Answer 1

4 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Linked

Related