Skip to main content
added 395 characters in body
Source Link
gnasher729
  • 3k
  • 14
  • 13

If you try this in MacOS, you’ll have an extreme fight on your hands. MacOS will at boot time install code optimised for your particular processor in a fixed place, this is done for memcpy, memmove , memset plus memset for two, four or eight byte values, and for some atomic operations.

The memcpy on my current computer uses vector instructions, uses caching instructions not available in C, and all the tricks in the book. You basically have no chance beating it. And if you beat it on one computer, it won’t work on another.

As far as your code is concerned: You should try to align the pointers first.

If count >= 1 and dst is not two-byte aligned -> copy 1 byte. 
If count >= 2 and dst is not four-byte aligned -> copy 2 byte.
If count >= 4 and dst is not eight-byte aligned -> copy 4 byte.

Then you copy eight bytes at a time, then another 4 if needed, another 2, and another byte.

If you try this in MacOS, you’ll have an extreme fight on your hands. MacOS will at boot time install code optimised for your particular processor in a fixed place, this is done for memcpy, memmove , memset plus memset for two, four or eight byte values, and for some atomic operations.

The memcpy on my current computer uses vector instructions, uses caching instructions not available in C, and all the tricks in the book. You basically have no chance beating it. And if you beat it on one computer, it won’t work on another.

If you try this in MacOS, you’ll have an extreme fight on your hands. MacOS will at boot time install code optimised for your particular processor in a fixed place, this is done for memcpy, memmove , memset plus memset for two, four or eight byte values, and for some atomic operations.

The memcpy on my current computer uses vector instructions, uses caching instructions not available in C, and all the tricks in the book. You basically have no chance beating it. And if you beat it on one computer, it won’t work on another.

As far as your code is concerned: You should try to align the pointers first.

If count >= 1 and dst is not two-byte aligned -> copy 1 byte. 
If count >= 2 and dst is not four-byte aligned -> copy 2 byte.
If count >= 4 and dst is not eight-byte aligned -> copy 4 byte.

Then you copy eight bytes at a time, then another 4 if needed, another 2, and another byte.

Source Link
gnasher729
  • 3k
  • 14
  • 13

If you try this in MacOS, you’ll have an extreme fight on your hands. MacOS will at boot time install code optimised for your particular processor in a fixed place, this is done for memcpy, memmove , memset plus memset for two, four or eight byte values, and for some atomic operations.

The memcpy on my current computer uses vector instructions, uses caching instructions not available in C, and all the tricks in the book. You basically have no chance beating it. And if you beat it on one computer, it won’t work on another.