Timeline for Solving an easy LeetCode "Merge Strings Alternately"

Current License: CC BY-SA 4.0

15 events

when toggle format	what		by	license	comment
Feb 28 at 1:59	comment	added	Peter Cordes		@nocomment: Indeed, x86-64 GCC vectorizes with `punpckl/hbw`; AArch64 Clang vectorizes with `st2` interleaving stores (NEON or SVE). godbolt.org/z/64sTEPMrc So yes, for inputs of similar length that are 16 bytes or longer, there's a huge speedup, especially when `min_len % 16` is small.
Feb 28 at 1:54	comment	added	Peter Cordes		@nocomment: Yes, I'd certainly hope compilers would auto-vectorize this with x86 `punpcklbw` or ARM `zip` or `uzp`. That's what makes this answer 8x to 16x faster than the others for medium-sized strings that fit in L1d or L2 cache: compilers will only auto-vectorize loops whose trip-count can be calculated ahead of the first iteration. (We need the size ahead of the first iteration so even manually vectorizing we need two passes over the inputs, first for strlen then for copying, unless we over-allocate and shrink, or we rely on a good realloc to use e.g. Linux `mremap`)
Feb 28 at 0:36	comment	added	user272752		@chux If you're accepting kibitzing from the peanut gallery: The `for()` could be replaced with a downcount of `len_min` (no need for `i`), then the ternary using `(*word1)? word1 : word2;` instead of comparing `len`s.... Cheers! `:-)`...
Feb 27 at 23:43	comment	added	no comment		Then again, maybe the compiler unrolls your loop and does something like `for (size_t i = 0; i + 8 < len_min; i+=8) { copy 8 pairs without checking for ends }`? Completely saving comparisons and jumps like that would be better, I guess.
Feb 27 at 23:30	comment	added	no comment		Yeah, I have no idea about compiler optimizations. I just suspect that they recognize the duplicate `*word1` access and only do it once, so that we get that almost for free. Hmm, I don't love my lengthy swap, but I think I still prefer it over your code duplication.
Feb 27 at 23:15	comment	added	chux		@nocomment OK. then is comes down to `while (word1)` vs `for (size_t i = 0; i < len_min; i++) {` vs other variations. In general, I have found compilers are very good* with `for(i=0; i<n; i++) {` loops. Still hard to beat `while (ptr)`. Perhaps `char dest = merged; if (len1 < len2) { while (word1) { dest++ = word1++; dest++ = word2++; } strcpy(dest, word2); } else { while (word2) { dest++ = word1++; dest++ = word2++; } strcpy(dest, word1); }`?
Feb 27 at 22:59	comment	added	no comment		You wouldn't need your extra `i` and `len_min`. The loop would be `while (*word1)`. Like this.
Feb 27 at 22:51	comment	added	chux		@nocomment Ok, then please explain further "You could reduce the loop further by ensuring that word1 isn't longer than word2" with more detail. Since this answer's loop only iterates for the minimum of `len1` and `len2`, how does also "ensuring that word1 isn't longer than word2" help?
Feb 27 at 22:44	comment	added	no comment		I don't think I will. It's a potential optimization, so I wouldn't want to post it as an answer without a benchmark, and I'm not competent enough with C to do that properly.
Feb 27 at 22:36	comment	added	chux		@no comment Interesting. Consider posting your idea as an answer.
Feb 27 at 22:27	comment	added	no comment		You could reduce the loop further by ensuring that word1 isn't longer than word2. If it is, then write its first char and swap the words. Then do the loop without `i` and stop it when word1 ends. (Though I have no idea whether that's faster.)
Feb 27 at 21:27	history	edited	chux	CC BY-SA 4.0	Code correction
Feb 27 at 14:57	history	edited	chux	CC BY-SA 4.0	added 110 characters in body
Feb 27 at 14:36	history	edited	chux	CC BY-SA 4.0	edited body
Feb 27 at 14:31	history	answered	chux	CC BY-SA 4.0