I have the following code:
function TSliverHelper.SlowNorth: TSlice;
var
i: integer;
begin
// Add pixels 0,1,2
// This means expanding every bit into a byte
// Or rather every byte into an int64;
for i:= 0 to 7 do begin
Result.Data8[i]:= TSuperSlice.Lookup012[Self.bytes[i]];
end;
end;
This uses a straight forward lookup table, but obviously LUT's are slow and clobber the cache. This takes about 2860 millisecs for 100.000.000 items.
The following approach is a bit faster (1797 MS, or 37% faster):
function TSliverHelper.North: TSlice;
const
SliverToSliceMask: array[0..7] of byte = ($01,$02,$04,$08,$10,$20,$40,$80);
asm
//RCX = @Self (a pointer to an Int64)
//RDX = @Result (a pointer to an array[0..63] of byte)
movq xmm0,[rcx] //Get the sliver
mov r9,$8040201008040201
movq xmm15,r9 //[rip+SliverToSliceMask] //Get the mask
movlhps xmm15,xmm15 //extend it
mov r8,$0101010101010101 //Shuffle mask
movq xmm14,r8 //00 00 00 00 00 00 00 00 01 01 01 01 01 01 01 01
pslldq xmm14,8 //01 01 01 01 01 01 01 01 00 00 00 00 00 00 00 00
movdqa xmm1,xmm0 //make a copy of the sliver
//bytes 0,1
pshufb xmm1,xmm14 //copy the first two bytes across
pand xmm1,xmm15 //Mask off the relevant bits
pcmpeqb xmm1,xmm15 //Expand a bit into a byte
movdqu [rdx],xmm1
//bytes 2,3
psrldq xmm0,2 //shift in the next two bytes
movdqa xmm2,xmm0
pshufb xmm2,xmm14 //copy the next two bytes across
pand xmm2,xmm15 //Mask off the relevant bits
pcmpeqb xmm2,xmm15 //Expand a bit into a byte
movdqu [rdx+16],xmm2
//bytes 4,5
psrldq xmm0,2 //shift in the next two bytes
movdqa xmm3,xmm0
pshufb xmm3,xmm14 //copy the next two bytes across
pand xmm3,xmm15 //Mask off the relevant bits
pcmpeqb xmm3,xmm15 //Expand a bit into a byte
movdqu [rdx+32],xmm3
//bytes 6,7
psrldq xmm0,2 //shift in the next two bytes
movdqa xmm4,xmm0
pshufb xmm4,xmm14 //copy the final two bytes across
pand xmm4,xmm15 //Mask off the relevant bits
pcmpeqb xmm4,xmm15 //Expand a bit into a byte
//Store the data
movdqu [rdx+48],xmm4
end;
However, that is a lot of code. I'm hoping there's a way to do with less processing that's going to work faster.
The way the code works (in prose) is simple.
First we clone the input byte 8 times. Next the bit is masked off using the 01,02,04... mask and an AND operation. Finally this randomish bit is expanded into a byte using the compare-equal-to-mask (pcmpeqb).
The opposite operation is a simple PMSKMOVB.
I can use AVX1 code, but not AVX2.