Timeline for When to optimize for memory vs performance speed for a method?
Current License: CC BY-SA 4.0
32 events
| when toggle format | what | by | license | comment | |
|---|---|---|---|---|---|
| Sep 8, 2018 at 3:06 | comment | added | Phil Frost | @Kevin JIT warmup has already been discussed at length in the comments. | |
| Sep 8, 2018 at 1:20 | comment | added | Kevin | Re why order matters for PyPy: Did you burn through the JIT warmup period? | |
| Sep 7, 2018 at 15:28 | comment | added | Peter Cordes |
@chux: gcc also has to undo the abs and turn it back into a range-check (like I commented on another answer that suggested the abs() "optimization", then apply the unsigned-compare trick: (unsigned)(a+b+999) <= 1998U. (I can't repro the a+b+1000 < 2000 output on Godbolt with -O1 or -O3 with a few gcc versions. godbolt.org/z/d0isuc). Other compilers (like clang) don't manage to undo it.
|
|
| Sep 7, 2018 at 14:39 | comment | added | chux |
!(a + b > 1000 or a + b < -1000) and abs(a + b) <= 1000 can produce different functionality when a+b == INT_MIN given the usual trouble with abs(INT_MIN). This compiler happened to produce desired code, yet was not oblige to with abs(a + b) <= 1000.
|
|
| Sep 7, 2018 at 12:00 | comment | added | Phil Frost |
@PeterCordes timeit times running the callable number times, and then it repeats that test repeat times, and I print the fastest of those repeats, the assumption being the fastest time is the intrinsic speed of the function without any JIT overhead, interruptions by other processes on the host, cache misses, etc. The cpython timeit documentation concurs, but for some reason pypy has decided this is "misleading" but they don't say why. Maybe it's re-compiling the function periodically even though nothing has changed. Maybe it's a bug. I don't really know.
|
|
| Sep 7, 2018 at 11:40 | comment | added | Phil Frost | @maple_shaft I seriously doubt it. The code and the input is the same every time, and yet if I run it a few million times it somehow becomes up to three orders of magnitude slower. | |
| Sep 7, 2018 at 11:35 | comment | added | maple_shaft♦ | I suspect the reason why the order of your benchmarking seems to matter might have to do with branch prediction in the CPU. stackoverflow.com/questions/11227809/… | |
| S Sep 7, 2018 at 11:30 | history | suggested | user142543 | CC BY-SA 4.0 |
Add Python syntax highlighting
|
| Sep 7, 2018 at 10:39 | review | Suggested edits | |||
| S Sep 7, 2018 at 11:30 | |||||
| Sep 7, 2018 at 5:39 | comment | added | Doc Brown | @Corey: this answer is actually telling you exactly what I wrote in my answer: there is no difference when you use a decent compiler, and instead focus on readibilty. Of course, it looks better founded - maybe you believe me now. | |
| Sep 7, 2018 at 0:27 | comment | added | Peter Cordes |
My point was that the PyPy doc says it prints average and standard deviation, so internally timeit must still record timestamps around this near-trivial thing. It doesn't explain this crazy "cool-down" effect, but it does mean there's huge overhead. (See clflush to invalidate cache line via C function and my answer on Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths for how hard it is to time very short intervals, even with non-portable raw rdtsc
|
|
| Sep 6, 2018 at 22:36 | comment | added | Phil Frost |
No, the number parameter times calling the passed function a lot of times, in my most recent tests, 100000000 times. I grok JIT warmup and all that, the baffiling thing is the first thing to get benchmarked is the fastest. That's not warm-up, it's cool-down.
|
|
| Sep 6, 2018 at 22:34 | comment | added | Peter Cordes |
Oh, timeit is still trying to time each call separately? This function is way too short for that, call overhead will dominate. To realistically let a JIT do anything, you need to call it in a loop over an array, or with input feeding output, or something, and time the whole loop.
|
|
| Sep 6, 2018 at 22:33 | comment | added | Peter Cordes |
The first one runs after import timeit, while the others run right after a print returns. Perhaps there's some warm-up effect there, or influence on what the JIT does? What if you collect all 3 results before printing anything?
|
|
| Sep 6, 2018 at 22:32 | comment | added | Phil Frost |
@PeterCordes Yeah, increasing the iterations by 100x, whichever thing is tested first is still an order of magnitude faster. Oddly, if I benchmark all three implementations once, then again, then a third time, on each iteration the same implementation is slower than it was the first time it was benchmarked. I kinda wonder if timeit is somehow different under pypy. doc.pypy.org/en/latest/cpython_differences.html says "The timeit module behaves differently under PyPy: it prints the average time and the standard deviation, instead of the minimum, since the minimum is often misleading."
|
|
| Sep 6, 2018 at 22:26 | comment | added | Peter Cordes |
Anyway, if you just crank up the repeat-count iterations so they take about 0.5 to 1.0 seconds, do the results stay similar? Your PyPy results are definitely surprising. (I don't normally look at Python, though, so IDK what kind of gotchas might exist for timeit on such a short function.) Oh, I just tried it. Those times are per-call averages, not the actual total measurement interval. So IDK.
|
|
| Sep 6, 2018 at 22:25 | comment | added | Peter Cordes | Yeah, the ramp-up effect would normally make the earlier stuff slower. The times aren't so short that the ~8us of pause while the CPU switches frequency and voltage (Lost Cycles on Intel? An inconsistency between rdtsc and CPU_CLK_UNHALTED.REF_TSC) shouldn't be making anything slower. I just mentioned it as another reason why your bench interval is too short. Perhaps it just interprets at first, and then stops to JIT, like the HotSpot JVM? If it decided to stop and JIT-compile right before the last iteration, that would suck. | |
| Sep 6, 2018 at 21:49 | comment | added | Phil Frost | @PeterCordes Wouldn't that mean the first run is slower, not faster? Not sure how much optimization the pypy jit is capable of -- there's a module that can inspect the generated machine code but I haven't had time to play with it. | |
| Sep 6, 2018 at 20:40 | comment | added | Peter Cordes | Those PyPy total times seem very short for a language that stops and JIT-compiles. It's also short enough that CPU frequency ramp-up from idle to max turbo might be a factor, especially if you're not using a Skylake (with hardware P-states). Benchmarking in seconds instead of core clock cycles requires controlling for CPU frequency. As a sanity check, does the time scale linearly with the repeat count? If not, you're measuring overhead. Also, won't a JIT be able to inline and hoist the calc out of a loop? Unless you use the output of one as input to the next, measuring latency not tput. | |
| Sep 6, 2018 at 15:05 | history | edited | Phil Frost | CC BY-SA 4.0 |
added 247 characters in body
|
| Sep 6, 2018 at 14:56 | comment | added | Phil Frost | @Corey see edit. | |
| Sep 6, 2018 at 14:56 | history | edited | Phil Frost | CC BY-SA 4.0 |
answer followup question
|
| Sep 6, 2018 at 14:12 | comment | added | Alex Celeste | Readability can potentially make a program easier to optimize too. The compiler can easily rewrite to use equivalent logic like it is above, only if it can actually figure out what you're trying to do. If you use a lot of old-school bithacks, cast back and forth between ints and pointers, reuse mutable storage etc. it may be much harder for the compiler to prove that a transformation is equivalent, and it'll just leave what you wrote, which may be suboptimal. | |
| Sep 6, 2018 at 13:57 | comment | added | Pieter B | @VisualMelon funilly enough the positive check: "return (((a + b) >= -1000) && ((a+b) <= 1000)); " gives a different result. :sharplab.io/… | |
| Sep 6, 2018 at 13:56 | comment | added | Corey P | @PhilFrost This was an incredible answer and exactly what I was looking for. Follow up question, what changes when this code is in an interpreted language instead of compiled? Then, does the optimization matter or does it have the same result? | |
| Sep 6, 2018 at 13:41 | history | edited | Deduplicator | CC BY-SA 4.0 |
added 3 characters in body
|
| Sep 6, 2018 at 13:39 | history | edited | Phil Frost | CC BY-SA 4.0 |
added 32 characters in body
|
| Sep 6, 2018 at 13:27 | history | edited | Phil Frost | CC BY-SA 4.0 |
added 75 characters in body
|
| Sep 6, 2018 at 13:26 | history | edited | Deduplicator | CC BY-SA 4.0 |
added syntax-highlighting
|
| Sep 6, 2018 at 13:13 | vote | accept | Corey P | ||
| Sep 6, 2018 at 12:53 | comment | added | VisualMelon | To provide an example in C#: SharpLab produces identical asm for both methods (Desktop CLR v4.7.3130.00 (clr.dll) on x86) | |
| Sep 6, 2018 at 12:37 | history | answered | Phil Frost | CC BY-SA 4.0 |