Do we really need auto-vectorization?

Question

I am trying to understand automatic parallelization and a special case of that is auto vectorization. As I understand it auto-vectorization is more or less:

The compiler takes parts of the code which are "serial" and transforms them to a vector. For example:

X1=Y1+Z1
X2=Y2+Z2
Xn=Yn+Zn

is transformed to

for(int i=0;i<n;i++)
X[i]=Y[i]+Z[i]

Now we all know that the slowest part of code execution is branches. And if N is really really big then our code would really slow down in this case. So the compiler in this case would unroll our loop to something more like

for(int i=0;i<n;i+=4) 
X[i]=Y[i]+Z[i]
X[i+1]=Y[i+1]+Z[i+1]
X[i+2]=Y[i+2]+Z[i+2]
X[i+3]=Y[i+3]+Z[i+3]

So my question is why would the compiler vectorize my code if in so many cases of actions performed within loop on arrays are unrolled to multiple actions? Isn't it counter productive to roll code in a loop inside arrays and then unroll that loop?

The fact that loop unrolling exists and is sometimes useful in no way contradicts the fact that vectorization is a technique that is sometimes useful. Obviously these times will generally be different times. Just think of it as the equivalent of owning both an umbrella and a shower! — Kilian Foth
– Kilian Foth, Commented Nov 11, 2014 at 8:27
I would need a bit more explanation and use cases of each in order to understand — John Demetriou
– John Demetriou, Commented Nov 11, 2014 at 8:28
@KilianFoth Actually, I'd go further than that: the unrolling of that loop is a necessary precondition for auto vectorization working: auto vectorization does not transform blocks of code into loops, it transforms groups of similar statements into single instructions. — Jules
– Jules, Commented Nov 11, 2014 at 8:29
@Jules Well, not quite, there is also SLP vectorization which transforms straight-line code without loops (basically, basic blocks). — user7043
– user7043, Commented Nov 11, 2014 at 9:13
Single instruction, multiple data (SIMD) e.g. using SSE on modern x86 processors. — CodesInChaos
– CodesInChaos, Commented Sep 13, 2015 at 13:55

Michael Kohne · Accepted Answer · 2015-01-08 12:52:10Z

10

Auto vectorization doesn't actually work quite the way you think it does. It does not turn your initial set of instructions into a loop: see, for example, the results of GCC compiling your code.

In general, auto-vectorization takes an unrolled loop, and transforms it so that the multiple statements you show in your example of an unrolled loop are implemented using a single assembly-language level instruction.

edited Jan 8, 2015 at 12:52

Michael Kohne

10.2k1 gold badge39 silver badges46 bronze badges

answered Nov 11, 2014 at 8:39

Jules

17.9k2 gold badges40 silver badges66 bronze badges

Could you also explain me a bit the assembly code in the link so I can understand what happens there? (or simply explain my in high level what is being done in general instead of explaining code bit by bit)

John Demetriou
– John Demetriou

2014-11-11 08:42:45 +00:00
Commented Nov 11, 2014 at 8:42
The code generated there isn't vectorized. You can see that if you remove the vectorize option from the compile command, it generates the same code. All that assembly code is doing is doing a bunch of single adds.

JS1
– JS1

2014-11-11 08:51:34 +00:00
Commented Nov 11, 2014 at 8:51
To be fair, vectorization is implied by -O3 (I only included the option so that it was obvious it was turned on).

Jules
– Jules

2014-11-11 08:55:14 +00:00
Commented Nov 11, 2014 at 8:55
6

Vectorized code uses a single instruction to perform multiple operations. In the case of code like your sample it would use the "paddd" instruction in sse2, which performs a simultaneous addition of 4 pairs of 32 bit integers.

Jules
– Jules

2014-11-11 09:30:22 +00:00
Commented Nov 11, 2014 at 9:30
2

Yes, although most processors are, these days. Even high end phones these days (NEON is a set of vector instructions supported by ARM Cortex processors, I believe).

Jules
– Jules

2014-11-11 10:00:11 +00:00
Commented Nov 11, 2014 at 10:00

| Show 3 more comments

JS1 · Accepted Answer · 2014-11-11 08:43:20Z

8

I think you are missing the most important step. After the compiler turns your multiple statements into a loop, it then uses vector instructions such as SSE instructions to do multiple iterations of the loop in parallel. If your hardware doesn't have vector instructions, then there would be no point. So instead of unrolling the loop like you showed, it would be more like this:

for (i=0;i<n;i+=8)
    __vector_add8(&x[i], &y[i], &z[i]);

Where __vector_add8 compiles to a vector instruction on your hardware that does 8 adds in parallel.

edited Nov 11, 2014 at 8:43

answered Nov 11, 2014 at 8:36

JS1

2161 silver badge7 bronze badges

so basically vectorization is transforming lines of code in vectors in order to exploit vector instructions? So __vector_add8 would translate to 8 different additions? How? I do not mean how it's done, I mean what it's done. Does it translate to adding the first 8 parts of of the arrays? Similarly to let's say unrolling a basic loop with only one action to 8 actions for each loop count?

John Demetriou
– John Demetriou

2014-11-11 09:22:42 +00:00
Commented Nov 11, 2014 at 9:22
1

Yes essentially that's it. Most of the time, your code will already be in a loop of some sort. The compiler can then combine loop iterations (such as the 8 to 1 in my example), taking advantage of the fact that there are hardware vector instructions that do multiple operations in parallel. Take SSE, for example, on x86 platforms. There is an SSE instruction called addps (add packed scalar) that adds one 128-bit vector to another (in other words 4 floating points at a time). Other chip families have their own versions of vector instructions but they are all similar.

JS1
– JS1

2014-11-11 09:56:11 +00:00
Commented Nov 11, 2014 at 9:56
en.wikipedia.org/wiki/SIMD

phuclv
– phuclv

2017-06-03 06:31:33 +00:00
Commented Jun 3, 2017 at 6:31

Add a comment |

Stack Exchange Network

Do we really need auto-vectorization?

2 Answers 2

Hot Network Questions

Do we really need auto-vectorization?

2 Answers 2

Related

Hot Network Questions