Skip to main content
Added info about environment
Source Link
user1118321
  • 11.9k
  • 1
  • 20
  • 46
#include <immintrin.h>
typedef struct RGBA8Pixel {
    uint8_t red;
    uint8_t green;
    uint8_t blue;
    uint8_t alpha;
} RGBA8Pixel;

typedef union intVec8 {
    __m512i ivec;
    int64_t vec[8];
} intVec8;

typedef union doubleVec8 {
    __m512d dvec;
    double vec[8];
} doubleVec8;

I'm working on macOS using Xcode 10.2.1 using the Intel data types and intrinsics found in <immintrin.h>.

typedef struct RGBA8Pixel {
    uint8_t red;
    uint8_t green;
    uint8_t blue;
    uint8_t alpha;
} RGBA8Pixel;

typedef union intVec8 {
    __m512i ivec;
    int64_t vec[8];
} intVec8;

typedef union doubleVec8 {
    __m512d dvec;
    double vec[8];
} doubleVec8;
#include <immintrin.h>
typedef struct RGBA8Pixel {
    uint8_t red;
    uint8_t green;
    uint8_t blue;
    uint8_t alpha;
} RGBA8Pixel;

typedef union intVec8 {
    __m512i ivec;
    int64_t vec[8];
} intVec8;

typedef union doubleVec8 {
    __m512d dvec;
    double vec[8];
} doubleVec8;

I'm working on macOS using Xcode 10.2.1 using the Intel data types and intrinsics found in <immintrin.h>.

Tweeted twitter.com/StackCodeReview/status/1137645749125222401
edited tags
Link
200_success
  • 145.6k
  • 22
  • 191
  • 481
Source Link
user1118321
  • 11.9k
  • 1
  • 20
  • 46

SIMD Mandelbrot calculation

I was messing around with GPU compute shaders the other day and created a Mandelbrot shader. Unfortunately, Metal doesn't support double-precision in compute shaders, so beyond a certain zoom level, I need to switch back to the CPU. In doing so, I decided to try writing SIMD code for the calculations to make it faster.

In the code below I'm using AVX512 instructions, and I do get a speedup over the scalar code. I break the image into 64x64 pixel tiles and farm them out to available cores. For the scalar code on one particular test image, the average time to calculate a tile is 0.757288 seconds. For the SIMD version below it's 0.466437. That's about a 33% increase, which is OK. Given that I'm calculating 8 times as many pixels at once, I was hoping for more.

These are some useful types I use in the code.

typedef struct RGBA8Pixel {
    uint8_t red;
    uint8_t green;
    uint8_t blue;
    uint8_t alpha;
} RGBA8Pixel;

typedef union intVec8 {
    __m512i ivec;
    int64_t vec[8];
} intVec8;

typedef union doubleVec8 {
    __m512d dvec;
    double vec[8];
} doubleVec8;

And here's my function for calculating 1 64x64 tile:

- (void)calculateSIMDFromRow:(int)startPixelRow
                       toRow:(int)endPixelRow
                     fromCol:(int)startPixelCol
                       toCol:(int)endPixelCol;
{
    if (!_keepRendering)
    {
        return;
    }
    
    const doubleVec8 k0s = {
        .vec[0] = 0.0,
        .vec[1] = 0.0,
        .vec[2] = 0.0,
        .vec[3] = 0.0,
        .vec[4] = 0.0,
        .vec[5] = 0.0,
        .vec[6] = 0.0,
        .vec[7] = 0.0,
    };
    
    const intVec8   k1s = {
        .vec[0] = 1,
        .vec[1] = 1,
        .vec[2] = 1,
        .vec[3] = 1,
        .vec[4] = 1,
        .vec[5] = 1,
        .vec[6] = 1,
        .vec[7] = 1,
    };
    
    const doubleVec8    k2s = {
        .vec[0] = 2.0,
        .vec[1] = 2.0,
        .vec[2] = 2.0,
        .vec[3] = 2.0,
        .vec[4] = 2.0,
        .vec[5] = 2.0,
        .vec[6] = 2.0,
        .vec[7] = 2.0,
    };
    
    const doubleVec8    k4s = {
        .vec[0] = 4.0,
        .vec[1] = 4.0,
        .vec[2] = 4.0,
        .vec[3] = 4.0,
        .vec[4] = 4.0,
        .vec[5] = 4.0,
        .vec[6] = 4.0,
        .vec[7] = 4.0,
    };
    
    UInt64      maxIterations   = [self maxIterations];
    NSSize      viewportSize    = [self viewportSize];
    for (int row = startPixelRow; (row < endPixelRow) && (_keepRendering); ++row)
    {
        RGBA8Pixel* nextPixel   = _outputBitmap + (row * (int)viewportSize.width) + startPixelCol;
        double      yCoord      = _yCoords [ row ];
        doubleVec8  yCoords;
        for (int i = 0; i < 8; i++)
        {
            yCoords.vec [ i ] = yCoord;
        }
        double*     nextXCoord  = &_xCoords [ startPixelCol ];
        for (int col = startPixelCol; (col < endPixelCol) && (_keepRendering); col += 8)
        {
            __m512d as = _mm512_load_pd(nextXCoord);
            nextXCoord += 8;
            __m512d bs = yCoords.dvec;
            __m512d cs = as;
            __m512d ds = bs;
            
            UInt64 scalarIters = 1;
            __m512i iterations  = k1s.ivec;
            __m512d dists       = k0s.dvec;
            __mmask8 allDone    = 0;
            while ((allDone != 0xFF) && (_keepRendering))
            {
                // newA = a * a - b * b + c
                __m512d newA;
                __m512d newB;
                newA = _mm512_mul_pd(as, as);
                newA = _mm512_sub_pd(newA, _mm512_mul_pd(bs, bs));
                newA = _mm512_add_pd(newA, cs);
                
                //double    newB    = 2 * a * b + d;
                newB = _mm512_mul_pd(_mm512_mul_pd(k2s.dvec, as), bs);
                newB = _mm512_add_pd(newB, ds);
                
                as = newA;
                bs = newB;
                
                dists = _mm512_mul_pd(newB, newB);
                dists = _mm512_add_pd(_mm512_mul_pd(newA, newA), dists);
                __mmask8 escaped = _mm512_cmplt_pd_mask(dists, k4s.dvec);

                iterations = _mm512_mask_add_epi64(iterations, escaped, iterations, k1s.ivec);
                scalarIters++;
                __mmask8 hitMaxIterations = (scalarIters == maxIterations) ? 0xFF : 0;
                
                allDone = ~escaped | hitMaxIterations;
            }
            
            intVec8 iters = { .ivec = iterations };
            for (int i = 0; i < 8; i++)
            {
                UInt64  nextIteration = iters.vec [ i ];
                if (nextIteration == maxIterations)
                {
                    *nextPixel = kBlack;
                }
                else
                {
                    *nextPixel = kPalette [ nextIteration % kPaletteSize ];
                }
                nextPixel++;
            }
        }
    }
}

I'm new to Intel SIMD instructions and frankly find them quite confusing. If there are better ways to do any of the above, please let me know. I tried using the fused multiply-add and multiply-add-negate instructions, and they made the code significantly slower than using 2 or 3 separate instructions, unfortunately.