Skip to main content
32 events
when toggle format what by license comment
Sep 8, 2018 at 3:06 comment added Phil Frost @Kevin JIT warmup has already been discussed at length in the comments.
Sep 8, 2018 at 1:20 comment added Kevin Re why order matters for PyPy: Did you burn through the JIT warmup period?
Sep 7, 2018 at 15:28 comment added Peter Cordes @chux: gcc also has to undo the abs and turn it back into a range-check (like I commented on another answer that suggested the abs() "optimization", then apply the unsigned-compare trick: (unsigned)(a+b+999) <= 1998U. (I can't repro the a+b+1000 < 2000 output on Godbolt with -O1 or -O3 with a few gcc versions. godbolt.org/z/d0isuc). Other compilers (like clang) don't manage to undo it.
Sep 7, 2018 at 14:39 comment added chux !(a + b > 1000 or a + b < -1000) and abs(a + b) <= 1000 can produce different functionality when a+b == INT_MIN given the usual trouble with abs(INT_MIN). This compiler happened to produce desired code, yet was not oblige to with abs(a + b) <= 1000.
Sep 7, 2018 at 12:00 comment added Phil Frost @PeterCordes timeit times running the callable number times, and then it repeats that test repeat times, and I print the fastest of those repeats, the assumption being the fastest time is the intrinsic speed of the function without any JIT overhead, interruptions by other processes on the host, cache misses, etc. The cpython timeit documentation concurs, but for some reason pypy has decided this is "misleading" but they don't say why. Maybe it's re-compiling the function periodically even though nothing has changed. Maybe it's a bug. I don't really know.
Sep 7, 2018 at 11:40 comment added Phil Frost @maple_shaft I seriously doubt it. The code and the input is the same every time, and yet if I run it a few million times it somehow becomes up to three orders of magnitude slower.
Sep 7, 2018 at 11:35 comment added maple_shaft I suspect the reason why the order of your benchmarking seems to matter might have to do with branch prediction in the CPU. stackoverflow.com/questions/11227809/…
S Sep 7, 2018 at 11:30 history suggested user142543 CC BY-SA 4.0
Add Python syntax highlighting
Sep 7, 2018 at 10:39 review Suggested edits
S Sep 7, 2018 at 11:30
Sep 7, 2018 at 5:39 comment added Doc Brown @Corey: this answer is actually telling you exactly what I wrote in my answer: there is no difference when you use a decent compiler, and instead focus on readibilty. Of course, it looks better founded - maybe you believe me now.
Sep 7, 2018 at 0:27 comment added Peter Cordes My point was that the PyPy doc says it prints average and standard deviation, so internally timeit must still record timestamps around this near-trivial thing. It doesn't explain this crazy "cool-down" effect, but it does mean there's huge overhead. (See clflush to invalidate cache line via C function and my answer on Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths for how hard it is to time very short intervals, even with non-portable raw rdtsc
Sep 6, 2018 at 22:36 comment added Phil Frost No, the number parameter times calling the passed function a lot of times, in my most recent tests, 100000000 times. I grok JIT warmup and all that, the baffiling thing is the first thing to get benchmarked is the fastest. That's not warm-up, it's cool-down.
Sep 6, 2018 at 22:34 comment added Peter Cordes Oh, timeit is still trying to time each call separately? This function is way too short for that, call overhead will dominate. To realistically let a JIT do anything, you need to call it in a loop over an array, or with input feeding output, or something, and time the whole loop.
Sep 6, 2018 at 22:33 comment added Peter Cordes The first one runs after import timeit, while the others run right after a print returns. Perhaps there's some warm-up effect there, or influence on what the JIT does? What if you collect all 3 results before printing anything?
Sep 6, 2018 at 22:32 comment added Phil Frost @PeterCordes Yeah, increasing the iterations by 100x, whichever thing is tested first is still an order of magnitude faster. Oddly, if I benchmark all three implementations once, then again, then a third time, on each iteration the same implementation is slower than it was the first time it was benchmarked. I kinda wonder if timeit is somehow different under pypy. doc.pypy.org/en/latest/cpython_differences.html says "The timeit module behaves differently under PyPy: it prints the average time and the standard deviation, instead of the minimum, since the minimum is often misleading."
Sep 6, 2018 at 22:26 comment added Peter Cordes Anyway, if you just crank up the repeat-count iterations so they take about 0.5 to 1.0 seconds, do the results stay similar? Your PyPy results are definitely surprising. (I don't normally look at Python, though, so IDK what kind of gotchas might exist for timeit on such a short function.) Oh, I just tried it. Those times are per-call averages, not the actual total measurement interval. So IDK.
Sep 6, 2018 at 22:25 comment added Peter Cordes Yeah, the ramp-up effect would normally make the earlier stuff slower. The times aren't so short that the ~8us of pause while the CPU switches frequency and voltage (Lost Cycles on Intel? An inconsistency between rdtsc and CPU_CLK_UNHALTED.REF_TSC) shouldn't be making anything slower. I just mentioned it as another reason why your bench interval is too short. Perhaps it just interprets at first, and then stops to JIT, like the HotSpot JVM? If it decided to stop and JIT-compile right before the last iteration, that would suck.
Sep 6, 2018 at 21:49 comment added Phil Frost @PeterCordes Wouldn't that mean the first run is slower, not faster? Not sure how much optimization the pypy jit is capable of -- there's a module that can inspect the generated machine code but I haven't had time to play with it.
Sep 6, 2018 at 20:40 comment added Peter Cordes Those PyPy total times seem very short for a language that stops and JIT-compiles. It's also short enough that CPU frequency ramp-up from idle to max turbo might be a factor, especially if you're not using a Skylake (with hardware P-states). Benchmarking in seconds instead of core clock cycles requires controlling for CPU frequency. As a sanity check, does the time scale linearly with the repeat count? If not, you're measuring overhead. Also, won't a JIT be able to inline and hoist the calc out of a loop? Unless you use the output of one as input to the next, measuring latency not tput.
Sep 6, 2018 at 15:05 history edited Phil Frost CC BY-SA 4.0
added 247 characters in body
Sep 6, 2018 at 14:56 comment added Phil Frost @Corey see edit.
Sep 6, 2018 at 14:56 history edited Phil Frost CC BY-SA 4.0
answer followup question
Sep 6, 2018 at 14:12 comment added Alex Celeste Readability can potentially make a program easier to optimize too. The compiler can easily rewrite to use equivalent logic like it is above, only if it can actually figure out what you're trying to do. If you use a lot of old-school bithacks, cast back and forth between ints and pointers, reuse mutable storage etc. it may be much harder for the compiler to prove that a transformation is equivalent, and it'll just leave what you wrote, which may be suboptimal.
Sep 6, 2018 at 13:57 comment added Pieter B @VisualMelon funilly enough the positive check: "return (((a + b) >= -1000) && ((a+b) <= 1000)); " gives a different result. :sharplab.io/…
Sep 6, 2018 at 13:56 comment added Corey P @PhilFrost This was an incredible answer and exactly what I was looking for. Follow up question, what changes when this code is in an interpreted language instead of compiled? Then, does the optimization matter or does it have the same result?
Sep 6, 2018 at 13:41 history edited Deduplicator CC BY-SA 4.0
added 3 characters in body
Sep 6, 2018 at 13:39 history edited Phil Frost CC BY-SA 4.0
added 32 characters in body
Sep 6, 2018 at 13:27 history edited Phil Frost CC BY-SA 4.0
added 75 characters in body
Sep 6, 2018 at 13:26 history edited Deduplicator CC BY-SA 4.0
added syntax-highlighting
Sep 6, 2018 at 13:13 vote accept Corey P
Sep 6, 2018 at 12:53 comment added VisualMelon To provide an example in C#: SharpLab produces identical asm for both methods (Desktop CLR v4.7.3130.00 (clr.dll) on x86)
Sep 6, 2018 at 12:37 history answered Phil Frost CC BY-SA 4.0