Timeline for When to optimize for memory vs performance speed for a method?

Current License: CC BY-SA 4.0

32 events

when toggle format	what		by	license	comment
Sep 8, 2018 at 3:06	comment	added	Phil Frost		@Kevin JIT warmup has already been discussed at length in the comments.
Sep 8, 2018 at 1:20	comment	added	Kevin		Re why order matters for PyPy: Did you burn through the JIT warmup period?
Sep 7, 2018 at 15:28	comment	added	Peter Cordes		@chux: gcc also has to undo the `abs` and turn it back into a range-check (like I commented on another answer that suggested the `abs()` "optimization", then apply the unsigned-compare trick: `(unsigned)(a+b+999) <= 1998U`. (I can't repro the `a+b+1000 < 2000` output on Godbolt with -O1 or -O3 with a few gcc versions. godbolt.org/z/d0isuc). Other compilers (like clang) don't manage to undo it.
Sep 7, 2018 at 14:39	comment	added	chux		`!(a + b > 1000 or a + b < -1000)` and `abs(a + b) <= 1000` can produce different functionality when `a+b == INT_MIN` given the usual trouble with `abs(INT_MIN)`. This compiler happened to produce desired code, yet was not oblige to with `abs(a + b) <= 1000`.
Sep 7, 2018 at 12:00	comment	added	Phil Frost		@PeterCordes `timeit` times running the callable `number` times, and then it repeats that test `repeat` times, and I print the fastest of those repeats, the assumption being the fastest time is the intrinsic speed of the function without any JIT overhead, interruptions by other processes on the host, cache misses, etc. The cpython timeit documentation concurs, but for some reason pypy has decided this is "misleading" but they don't say why. Maybe it's re-compiling the function periodically even though nothing has changed. Maybe it's a bug. I don't really know.
Sep 7, 2018 at 11:40	comment	added	Phil Frost		@maple_shaft I seriously doubt it. The code and the input is the same every time, and yet if I run it a few million times it somehow becomes up to three orders of magnitude slower.
Sep 7, 2018 at 11:35	comment	added	maple_shaft♦		I suspect the reason why the order of your benchmarking seems to matter might have to do with branch prediction in the CPU. stackoverflow.com/questions/11227809/…
S Sep 7, 2018 at 11:30	history	suggested	user142543	CC BY-SA 4.0	Add Python syntax highlighting
Sep 7, 2018 at 10:39	review	Suggested edits
S Sep 7, 2018 at 11:30
Sep 7, 2018 at 5:39	comment	added	Doc Brown		@Corey: this answer is actually telling you exactly what I wrote in my answer: there is no difference when you use a decent compiler, and instead focus on readibilty. Of course, it looks better founded - maybe you believe me now.
Sep 7, 2018 at 0:27	comment	added	Peter Cordes		My point was that the PyPy doc says it prints average and standard deviation, so internally `timeit` must still record timestamps around this near-trivial thing. It doesn't explain this crazy "cool-down" effect, but it does mean there's huge overhead. (See clflush to invalidate cache line via C function and my answer on Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths for how hard it is to time very short intervals, even with non-portable raw `rdtsc`
Sep 6, 2018 at 22:36	comment	added	Phil Frost		No, the `number` parameter times calling the passed function a lot of times, in my most recent tests, 100000000 times. I grok JIT warmup and all that, the baffiling thing is the first thing to get benchmarked is the fastest. That's not warm-up, it's cool-down.
Sep 6, 2018 at 22:34	comment	added	Peter Cordes		Oh, `timeit` is still trying to time each call separately? This function is way too short for that, call overhead will dominate. To realistically let a JIT do anything, you need to call it in a loop over an array, or with input feeding output, or something, and time the whole loop.
Sep 6, 2018 at 22:33	comment	added	Peter Cordes		The first one runs after `import timeit`, while the others run right after a `print` returns. Perhaps there's some warm-up effect there, or influence on what the JIT does? What if you collect all 3 results before printing anything?
Sep 6, 2018 at 22:32	comment	added	Phil Frost		@PeterCordes Yeah, increasing the iterations by 100x, whichever thing is tested first is still an order of magnitude faster. Oddly, if I benchmark all three implementations once, then again, then a third time, on each iteration the same implementation is slower than it was the first time it was benchmarked. I kinda wonder if `timeit` is somehow different under pypy. doc.pypy.org/en/latest/cpython_differences.html says "The timeit module behaves differently under PyPy: it prints the average time and the standard deviation, instead of the minimum, since the minimum is often misleading."
Sep 6, 2018 at 22:26	comment	added	Peter Cordes		Anyway, if you just crank up the repeat-count iterations so they take about 0.5 to 1.0 seconds, do the results stay similar? Your PyPy results are definitely surprising. (I don't normally look at Python, though, so IDK what kind of gotchas might exist for `timeit` on such a short function.) Oh, I just tried it. Those times are per-call averages, not the actual total measurement interval. So IDK.
Sep 6, 2018 at 22:25	comment	added	Peter Cordes		Yeah, the ramp-up effect would normally make the earlier stuff slower. The times aren't so short that the ~8us of pause while the CPU switches frequency and voltage (Lost Cycles on Intel? An inconsistency between rdtsc and CPU_CLK_UNHALTED.REF_TSC) shouldn't be making anything slower. I just mentioned it as another reason why your bench interval is too short. Perhaps it just interprets at first, and then stops to JIT, like the HotSpot JVM? If it decided to stop and JIT-compile right before the last iteration, that would suck.
Sep 6, 2018 at 21:49	comment	added	Phil Frost		@PeterCordes Wouldn't that mean the first run is slower, not faster? Not sure how much optimization the pypy jit is capable of -- there's a module that can inspect the generated machine code but I haven't had time to play with it.
Sep 6, 2018 at 20:40	comment	added	Peter Cordes		Those PyPy total times seem very short for a language that stops and JIT-compiles. It's also short enough that CPU frequency ramp-up from idle to max turbo might be a factor, especially if you're not using a Skylake (with hardware P-states). Benchmarking in seconds instead of core clock cycles requires controlling for CPU frequency. As a sanity check, does the time scale linearly with the repeat count? If not, you're measuring overhead. Also, won't a JIT be able to inline and hoist the calc out of a loop? Unless you use the output of one as input to the next, measuring latency not tput.
Sep 6, 2018 at 15:05	history	edited	Phil Frost	CC BY-SA 4.0	added 247 characters in body
Sep 6, 2018 at 14:56	comment	added	Phil Frost		@Corey see edit.
Sep 6, 2018 at 14:56	history	edited	Phil Frost	CC BY-SA 4.0	answer followup question
Sep 6, 2018 at 14:12	comment	added	Alex Celeste		Readability can potentially make a program easier to optimize too. The compiler can easily rewrite to use equivalent logic like it is above, only if it can actually figure out what you're trying to do. If you use a lot of old-school bithacks, cast back and forth between ints and pointers, reuse mutable storage etc. it may be much harder for the compiler to prove that a transformation is equivalent, and it'll just leave what you wrote, which may be suboptimal.
Sep 6, 2018 at 13:57	comment	added	Pieter B		@VisualMelon funilly enough the positive check: "return (((a + b) >= -1000) && ((a+b) <= 1000)); " gives a different result. :sharplab.io/…
Sep 6, 2018 at 13:56	comment	added	Corey P		@PhilFrost This was an incredible answer and exactly what I was looking for. Follow up question, what changes when this code is in an interpreted language instead of compiled? Then, does the optimization matter or does it have the same result?
Sep 6, 2018 at 13:41	history	edited	Deduplicator	CC BY-SA 4.0	added 3 characters in body
Sep 6, 2018 at 13:39	history	edited	Phil Frost	CC BY-SA 4.0	added 32 characters in body
Sep 6, 2018 at 13:27	history	edited	Phil Frost	CC BY-SA 4.0	added 75 characters in body
Sep 6, 2018 at 13:26	history	edited	Deduplicator	CC BY-SA 4.0	added syntax-highlighting
Sep 6, 2018 at 13:13	vote	accept	Corey P
Sep 6, 2018 at 12:53	comment	added	VisualMelon		To provide an example in C#: SharpLab produces identical asm for both methods (Desktop CLR v4.7.3130.00 (clr.dll) on x86)
Sep 6, 2018 at 12:37	history	answered	Phil Frost	CC BY-SA 4.0

toggle format