These sorts of problems are a fun challenge, and you are right, shifting is a great solution. I have played with it, and can get up to 5X performance improvements by applying a few tricks.
The first trick, is to split the long values up in to collections of independent 2-bit groups... in bits, it would be the two groups:
00xx00xx00xx00xx.....00xx00xx
xx00xx00xx00xx00.....xx00xx00
by taking these alternate groups, there's some tricks we can play. For a start, we can do some bit manipulation and arithmetic on the 2-bit groups without overflowing in to the neighbours (because the neighbour is at least 2 bits away).
Additionally, we can shift the second part by two bits, and use the exact same code to process the second part... it's easier to show, than to explain... but, more explanation first...
With the 2-bit blocks, we can turn the a - b in to a + (-b) using two's compliment. To do this, we take our 2-bit number and make it a 3-bit number (it's OK, we have the space because our neighbour is far away...).
So, our b bits become a 3-bit 2's compliment:
00 -> 000
01 -> 111
10 -> 110
11 -> 101
We then add that to our a bits, using regular addition. Some of these may overflow in to the 4th bit, but we will truncate that overflow. For example, if the a-bits were 11 and the b-bits were 10, the whole process is:
start
a 0011
b 0010
b's 2's comp
a 0011
b' 0110
add a+b'
a 0011
b' 0110
s 1001
now we truncate the overflow to leave a 3-bit value of:
s 001
which is the difference.
Note that the math is more complicated if the b value is larger (a negative result). Compare the above with a 10 and b 11:
start
a 0010
b 0011
b's 2's comp
a 0010
b' 0101
add a+b'
a 0010
b' 0101
s 0111
Now s (the sum), is truncated back to 3 bits, and the sign bit is set, and the difference is thus -1.
If the value is negative, we negate it back to positive (abs value). This can be done by running the 2's compliment again if the value is negative.
This all sounds awfully complicated, but, the point is that, in a long, you can do 16 2-bit processes at the same time... here's the code:
private static final long COMP = 0x1111111111111111L;
private static final long TWOBITS = 0x3333333333333333L;
private static final long THREEBITS = 0x7777777777777777L;
private static final long SIGNBIT = 0x4444444444444444L;
private static final long FOURSET = 0x0f0f0f0f0f0f0f0fL;
static long twoBitDiff(final long x, long y) {
// two's compliment on 2-bit sets of y - to make 3bit values.
// third bit is the sign bit.
y &= TWOBITS;
y ^= THREEBITS;
y += COMP;
long diff = (x & TWOBITS) + y;
// negate any negative results - (abs value)
// two's complement again
long sign = (diff & SIGNBIT) >>> 2;
diff ^= sign | (sign << 1);
diff += sign;
return diff & TWOBITS;
}
So, the above method will calculate the difference between the value on alternating 2-bit blocks, leaving the difference at their relative positions in the output.
Put this together with a control-method, that first computes the differences on one set of 2-bits, then the other set of 2-bits, and efficiently adds up the results:
static int shiftySum(long x, long y) {
long diff = twoBitDiff(x, y) + twoBitDiff(x >>> 2, y >>> 2);
diff += diff >>> 4;
// ensure low-order unused bits are clean.
diff &= FOURSET;
diff += diff >>> 8;
diff += diff >>> 16;
diff += diff >>> 32;
return (int)(diff & 0xff);
}
Running the above through some performance benchmarks, comparing the funnySum with the shiftySum, and maaartySum, and now freakySum, I get the performance results:
Task TBSums -> Funny: (Unit: MILLISECONDS)
Count : 136 Average : 36.8288
Fastest : 35.8241 Slowest : 46.3444
95Pctile : 38.5454 99Pctile : 41.3839
TimeBlock : 37.836 37.312 36.661 36.396 36.788 36.555 36.394 36.482 36.212 37.607
Histogram : 136
Task TBSums -> Freaky: (Unit: MILLISECONDS)
Count : 833 Average : 6.0031
Fastest : 5.7492 Slowest : 14.5596
95Pctile : 6.2877 99Pctile : 8.4362
TimeBlock : 5.941 5.871 5.890 5.872 5.888 6.282 6.128 6.061 6.051 6.050
Histogram : 832 1
Task TBSums -> Shifty: (Unit: MILLISECONDS)
Count : 857 Average : 5.8345
Fastest : 5.5854 Slowest : 11.3062
95Pctile : 6.1534 99Pctile : 7.7529
TimeBlock : 5.800 5.663 5.771 5.701 5.740 5.995 5.929 5.838 5.912 5.999
Histogram : 854 3
Task TBSums -> Maaarty: (Unit: MILLISECONDS)
Count : 1000 Average : 3.6067
Fastest : 3.4531 Slowest : 6.9772
95Pctile : 3.8111 99Pctile : 4.5753
TimeBlock : 3.581 3.510 3.522 3.524 3.698 3.672 3.641 3.655 3.658 3.606
Histogram : 999 1
There is something to be said for maaartinus's new implementation...