Timeline for Solving an easy LeetCode "Merge Strings Alternately"
Current License: CC BY-SA 4.0
15 events
| when toggle format | what | by | license | comment | |
|---|---|---|---|---|---|
| Feb 28 at 1:59 | comment | added | Peter Cordes |
@nocomment: Indeed, x86-64 GCC vectorizes with punpckl/hbw; AArch64 Clang vectorizes with st2 interleaving stores (NEON or SVE). godbolt.org/z/64sTEPMrc So yes, for inputs of similar length that are 16 bytes or longer, there's a huge speedup, especially when min_len % 16 is small.
|
|
| Feb 28 at 1:54 | comment | added | Peter Cordes |
@nocomment: Yes, I'd certainly hope compilers would auto-vectorize this with x86 punpcklbw or ARM zip or uzp. That's what makes this answer 8x to 16x faster than the others for medium-sized strings that fit in L1d or L2 cache: compilers will only auto-vectorize loops whose trip-count can be calculated ahead of the first iteration. (We need the size ahead of the first iteration so even manually vectorizing we need two passes over the inputs, first for strlen then for copying, unless we over-allocate and shrink, or we rely on a good realloc to use e.g. Linux mremap)
|
|
| Feb 28 at 0:36 | comment | added | user272752 |
@chux If you're accepting kibitzing from the peanut gallery: The for() could be replaced with a downcount of len_min (no need for i), then the ternary using (*word1)? word1 : word2; instead of comparing lens.... Cheers! :-)...
|
|
| Feb 27 at 23:43 | comment | added | no comment |
Then again, maybe the compiler unrolls your loop and does something like for (size_t i = 0; i + 8 < len_min; i+=8) { copy 8 pairs without checking for ends }? Completely saving comparisons and jumps like that would be better, I guess.
|
|
| Feb 27 at 23:30 | comment | added | no comment |
Yeah, I have no idea about compiler optimizations. I just suspect that they recognize the duplicate *word1 access and only do it once, so that we get that almost for free. Hmm, I don't love my lengthy swap, but I think I still prefer it over your code duplication.
|
|
| Feb 27 at 23:15 | comment | added | chux |
@nocomment OK. then is comes down to while (*word1) vs for (size_t i = 0; i < len_min; i++) { vs other variations. In general, I have found compilers are very good with for(i=0; i<n; i++) { loops. Still hard to beat while (*ptr). Perhaps char *dest = merged; if (len1 < len2) { while (*word1) { *dest++ = *word1++; *dest++ = *word2++; } strcpy(dest, word2); } else { while (*word2) { *dest++ = *word1++; *dest++ = *word2++; } strcpy(dest, word1); }?
|
|
| Feb 27 at 22:59 | comment | added | no comment |
You wouldn't need your extra i and len_min. The loop would be while (*word1). Like this.
|
|
| Feb 27 at 22:51 | comment | added | chux |
@nocomment Ok, then please explain further "You could reduce the loop further by ensuring that word1 isn't longer than word2" with more detail. Since this answer's loop only iterates for the minimum of len1 and len2, how does also "ensuring that word1 isn't longer than word2" help?
|
|
| Feb 27 at 22:44 | comment | added | no comment | I don't think I will. It's a potential optimization, so I wouldn't want to post it as an answer without a benchmark, and I'm not competent enough with C to do that properly. | |
| Feb 27 at 22:36 | comment | added | chux | @no comment Interesting. Consider posting your idea as an answer. | |
| Feb 27 at 22:27 | comment | added | no comment |
You could reduce the loop further by ensuring that word1 isn't longer than word2. If it is, then write its first char and swap the words. Then do the loop without i and stop it when word1 ends. (Though I have no idea whether that's faster.)
|
|
| Feb 27 at 21:27 | history | edited | chux | CC BY-SA 4.0 |
Code correction
|
| Feb 27 at 14:57 | history | edited | chux | CC BY-SA 4.0 |
added 110 characters in body
|
| Feb 27 at 14:36 | history | edited | chux | CC BY-SA 4.0 |
edited body
|
| Feb 27 at 14:31 | history | answered | chux | CC BY-SA 4.0 |