Maybe quite naive, but why use `mm_sfence` if size >= L2 cache size? https://github.com/skywind3000/FastMemcpy/blob/master/FastMemcpy.h#L680 And what if L2 cache size (0x200000) is not actually L2 cache size, is there any impact on performance?