I am trying to understand automatic parallelization and a special case of that is auto vectorization. As I understand it auto-vectorization is more or less:
The compiler takes parts of the code which are "serial" and transforms them to a vector. For example:
X1=Y1+Z1
X2=Y2+Z2
Xn=Yn+Zn
is transformed to
for(int i=0;i<n;i++)
X[i]=Y[i]+Z[i]
Now we all know that the slowest part of code execution is branches. And if N is really really big then our code would really slow down in this case. So the compiler in this case would unroll our loop to something more like
for(int i=0;i<n;i+=4)
X[i]=Y[i]+Z[i]
X[i+1]=Y[i+1]+Z[i+1]
X[i+2]=Y[i+2]+Z[i+2]
X[i+3]=Y[i+3]+Z[i+3]
So my question is why would the compiler vectorize my code if in so many cases of actions performed within loop on arrays are unrolled to multiple actions? Isn't it counter productive to roll code in a loop inside arrays and then unroll that loop?