3

I am trying to understand automatic parallelization and a special case of that is auto vectorization. As I understand it auto-vectorization is more or less:

The compiler takes parts of the code which are "serial" and transforms them to a vector. For example:

X1=Y1+Z1
X2=Y2+Z2
Xn=Yn+Zn

is transformed to

for(int i=0;i<n;i++)
X[i]=Y[i]+Z[i]

Now we all know that the slowest part of code execution is branches. And if N is really really big then our code would really slow down in this case. So the compiler in this case would unroll our loop to something more like

for(int i=0;i<n;i+=4) 
X[i]=Y[i]+Z[i]
X[i+1]=Y[i+1]+Z[i+1]
X[i+2]=Y[i+2]+Z[i+2]
X[i+3]=Y[i+3]+Z[i+3]

So my question is why would the compiler vectorize my code if in so many cases of actions performed within loop on arrays are unrolled to multiple actions? Isn't it counter productive to roll code in a loop inside arrays and then unroll that loop?

6
  • 5
    The fact that loop unrolling exists and is sometimes useful in no way contradicts the fact that vectorization is a technique that is sometimes useful. Obviously these times will generally be different times. Just think of it as the equivalent of owning both an umbrella and a shower! Commented Nov 11, 2014 at 8:27
  • I would need a bit more explanation and use cases of each in order to understand Commented Nov 11, 2014 at 8:28
  • 1
    @KilianFoth Actually, I'd go further than that: the unrolling of that loop is a necessary precondition for auto vectorization working: auto vectorization does not transform blocks of code into loops, it transforms groups of similar statements into single instructions. Commented Nov 11, 2014 at 8:29
  • @Jules Well, not quite, there is also SLP vectorization which transforms straight-line code without loops (basically, basic blocks). Commented Nov 11, 2014 at 9:13
  • 2
    Single instruction, multiple data (SIMD) e.g. using SSE on modern x86 processors. Commented Sep 13, 2015 at 13:55

2 Answers 2

10

Auto vectorization doesn't actually work quite the way you think it does. It does not turn your initial set of instructions into a loop: see, for example, the results of GCC compiling your code.

In general, auto-vectorization takes an unrolled loop, and transforms it so that the multiple statements you show in your example of an unrolled loop are implemented using a single assembly-language level instruction.

8
  • Could you also explain me a bit the assembly code in the link so I can understand what happens there? (or simply explain my in high level what is being done in general instead of explaining code bit by bit) Commented Nov 11, 2014 at 8:42
  • The code generated there isn't vectorized. You can see that if you remove the vectorize option from the compile command, it generates the same code. All that assembly code is doing is doing a bunch of single adds. Commented Nov 11, 2014 at 8:51
  • To be fair, vectorization is implied by -O3 (I only included the option so that it was obvious it was turned on). Commented Nov 11, 2014 at 8:55
  • 6
    Vectorized code uses a single instruction to perform multiple operations. In the case of code like your sample it would use the "paddd" instruction in sse2, which performs a simultaneous addition of 4 pairs of 32 bit integers. Commented Nov 11, 2014 at 9:30
  • 2
    Yes, although most processors are, these days. Even high end phones these days (NEON is a set of vector instructions supported by ARM Cortex processors, I believe). Commented Nov 11, 2014 at 10:00
8

I think you are missing the most important step. After the compiler turns your multiple statements into a loop, it then uses vector instructions such as SSE instructions to do multiple iterations of the loop in parallel. If your hardware doesn't have vector instructions, then there would be no point. So instead of unrolling the loop like you showed, it would be more like this:

for (i=0;i<n;i+=8)
    __vector_add8(&x[i], &y[i], &z[i]);

Where __vector_add8 compiles to a vector instruction on your hardware that does 8 adds in parallel.

3
  • so basically vectorization is transforming lines of code in vectors in order to exploit vector instructions? So __vector_add8 would translate to 8 different additions? How? I do not mean how it's done, I mean what it's done. Does it translate to adding the first 8 parts of of the arrays? Similarly to let's say unrolling a basic loop with only one action to 8 actions for each loop count? Commented Nov 11, 2014 at 9:22
  • 1
    Yes essentially that's it. Most of the time, your code will already be in a loop of some sort. The compiler can then combine loop iterations (such as the 8 to 1 in my example), taking advantage of the fact that there are hardware vector instructions that do multiple operations in parallel. Take SSE, for example, on x86 platforms. There is an SSE instruction called addps (add packed scalar) that adds one 128-bit vector to another (in other words 4 floating points at a time). Other chip families have their own versions of vector instructions but they are all similar. Commented Nov 11, 2014 at 9:56
  • en.wikipedia.org/wiki/SIMD Commented Jun 3, 2017 at 6:31

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.