Skip to main content
1 of 7
user555045
  • 12.4k
  • 1
  • 19
  • 39

The biggest performance problem, and a common mistake, is doing this:

_mm256_extract_epi32(vec_multi, 0) + _mm256_extract_epi32(vec_multi, 1) +_mm256_extract_epi32(vec_multi, 2) +_mm256_extract_epi32(vec_multi, 3) +_mm256_extract_epi32(vec_multi, 4) +_mm256_extract_epi32(vec_multi, 5) +_mm256_extract_epi32(vec_multi, 6) +_mm256_extract_epi32(vec_multi, 7);

This dominates not just horizontal screen space, but also the performance (or lack thereof) of the loop.

Actual extracts (an extract with an index of 0 is turned into vmovd by a reasonable compiler, which is less of a problem) have a throughput of only 1 per cycle on typical CPUs. There are 7 here, so even with just this horizontal addition the loop body could only execute one every 7 cycles (but there is other stuff in there too so it's worse).

There are slightly faster ways to do horizontal sums, for example (not really tuned or anything, just some simple "better than totally naive" hsum)

int hsum_epi32(__m256i x)
{
    __m128i l = _mm256_extracti128_si256(x, 0); //implicit
    __m128i h = _mm256_extracti128_si256(x, 1);
    l = _mm_add_epi32(l, h);
    l = _mm_hadd_epi32(l, l);
    return _mm_extract_epi32(l, 0) + _mm_extract_epi32(l, 2);
}

But that is not great either. Ideally there should just be no horizontal operation in the inner loop. Even more ideally, not anywhere. And that can be arranged: a block of 8 results can be computed by rearranging the computation so that horizontal operations turn into broadcasts. Broadcasting from memory is pretty cheap (naturally it sort of "wastes" the load by loading only one thing, but it's not a slow operation), and there are much fewer of them, so that's probably better.

Sort of like this (to show the general idea, not tested)

for (int i = 0; i < N; i++) {
  for (int j = 0; j < N; j += 8) {
    auto sum = _mm256_setzero_si256();
    for (int k = 0; k < N; k++) {
      auto bc_mat1 = _mm256_set1_epi32(mat1[i][k]);
      auto vec_mat2 = _mm256_loadu_si256((__m256i*)&mat2[k][j]);
      auto prod = _mm256_mullo_epi32(bc_mat1, vec_mat2);
      sum = _mm256_add_epi32(sum, prod);
    }
    _mm256_storeu_si256((__m256i*)&result[i][j], sum);
  }
}

There is some room for improvement. The memory access pattern through mat2 is not ideal, the inner loop iterates over the rows and only half of a cache line is used every time, for a large enough matrix that means half of every cache miss is wasted. It's probably better to unroll middle loop by 2 (or 4) again. As a bonus, we get to re-use the broadcast from mat1. Sort of like this (not tested, showing 2x)

for (int i = 0; i < N; i++) {
  for (int j = 0; j < N; j += 16) {
    auto sumA = _mm256_setzero_si256();
    auto sumB = _mm256_setzero_si256();
    for (int k = 0; k < N; k++) {
      auto bc_mat1 = _mm256_set1_epi32(mat1[i][k]);
      auto vecA_mat2 = _mm256_loadu_si256((__m256i*)&mat2[k][j]);
      auto vecB_mat2 = _mm256_loadu_si256((__m256i*)&mat2[k][j + 8]);
      auto prodA = _mm256_mullo_epi32(bc_mat1, vecA_mat2);
      auto prodB = _mm256_mullo_epi32(bc_mat1, vecB_mat2);
      sumA = _mm256_add_epi32(sumA, prodA);
      sumB = _mm256_add_epi32(sumB, prodB);
    }
    _mm256_storeu_si256((__m256i*)&result[i][j], sumA);
    _mm256_storeu_si256((__m256i*)&result[i][j + 8], sumB);
  }
}

But, it's probably still possible to improve it, and I didn't test anything.

user555045
  • 12.4k
  • 1
  • 19
  • 39