I have a large calculation to do. While I can utilize all cores, I thought is there any reason to leave off 1 core and not to utilize it? (calculation cpu only no IO). Or am i underestimating the OS that it would not know to handle and do proper context switching even if I utilize all cores?
- 
        8Utilizing all cores is a good start, and some superstition about the OS behaving better with "-1 cores" is probably just - superstition, but you should actually profile it, how it behaves for your calculation, your hardware, your operating system.Doc Brown– Doc Brown2017-08-13 08:31:49 +00:00Commented Aug 13, 2017 at 8:31
 - 
        In many cases, using #cores+1 makes a lot of sense. If you just use #cores, then any unexpected blocking (such as a page fault) needlessly forces a core to be idle.David Schwartz– David Schwartz2017-08-14 05:00:02 +00:00Commented Aug 14, 2017 at 5:00
 
5 Answers
Major operating systems are mature enough to know how to handle processes which use every available core. Other processes may (and often will) be affected, but the computation won't become slower because you used every available core.
The choice of the number of cores depends more on your intention of doing something else while the calculation is being performed.
If, on a desktop machine, you want to be able to use your web browser or watch a video while the computation is being done, you'll better keep one core free for it. In the same way, if the server is doing two things (such as doing computations and, at the same time, processing and reporting its metrics), keeping a core free for the side task could be a good idea.
On the other hand, if your priority is to make the computation as fast as possible, you have to use all the cores.
- 
        7Modern OS schedulers are actually pretty good at keeping interactive programs interactive when there's high CPU usage, as long as the interactive programs aren't also using a lot of CPU (which, granted, can be a problem with modern bloated web apps)James_pic– James_pic2017-08-13 22:23:51 +00:00Commented Aug 13, 2017 at 22:23
 - 
        Note: even on servers, if you want to be able to ssh and get a snappy answer, leaving core 0 alone might be useful.Matthieu M.– Matthieu M.2017-08-14 06:58:46 +00:00Commented Aug 14, 2017 at 6:58
 
It depends.
If the machine is dedicated to this computation, you should use all cores – unused computing resources don't speed things up.
If you are using a realtime scheduler, a non-preemptive scheduler, or processor affinity then you should be a bit more careful because it's easy to to accidentally starve other processes from all computing resources. However you would have to manually change these settings for something to go wrong, so by default there's no problem here on most OSes.
If the machine is not dedicated to the computation, giving 100% to the computation may not be ideal. For example, if you're using a web browser while the computation is running. Because the load of your machine will occasionally peak above 100%, it will feel sluggish. Throughput-oriented tasks like the computation will not really be slowed down, but latency-sensitive tasks like GUIs will not react as quickly. It is then sensible to only start NPROC-1 threads/processes for the computation. Alternatively, explicitly using a lower priority for the computation than for normal tasks could solve this problem, in which case the computation should use NPROC processes to not waste any resources.
- 
        3"if you're using a web browser while the computation is running […] it will feel sluggish. Throughput-oriented tasks like the computation will not really be slowed down, but latency-sensitive tasks like GUIs will not react as quickly. […] explicitly using a lower priority for the computation than for normal tasks could solve this problem" – And that is why the process priority value on Unix is called "niceness" and is configured using a utility named
nice.Jörg W Mittag– Jörg W Mittag2017-08-13 11:12:25 +00:00Commented Aug 13, 2017 at 11:12 - 
        2"unused computing resources don't speed things up" technically, they could. Using less cores may allow a higher clock rate, and reduce synchronisation, that may or may not speed things up.Davidmh– Davidmh2017-08-13 14:54:49 +00:00Commented Aug 13, 2017 at 14:54
 - 
        2In addition to @Davidmh notes usually on CPU side L1$ and L2$ are shared to some extent between threads and L3$ is shared across all of socket so using more threads might cause increased $ misses slowing down processes. Especially if the process is memory bound instead of processor bound.Maja Piechotka– Maja Piechotka2017-08-13 18:57:03 +00:00Commented Aug 13, 2017 at 18:57
 - 
        IF you set thread/process priority levels appropriately, you can mitigate the impact of background work on interactive processes. I've run distributed computing apps on my personal machine for over a decade; and with CPU compute tasks running at low priority my ability to use browsers and other normal desktop apps is unimpaired. Resource sharing on the GPU isn't as advanced, and I've ran into occasional problems with GPU accelerated HTML5 video (never mind games) while running GPU compute in the background. Multi-threaded games can be problematic even with light GFX; win starves threads 2+Dan Is Fiddling By Firelight– Dan Is Fiddling By Firelight2017-08-13 19:29:26 +00:00Commented Aug 13, 2017 at 19:29
 
I'm somewhat circumspect about agreeing with @motoDrizzt, below, due to his negative votes:), but that's indeed been my actual experience -- more is better, even beyond the actual number of cores (but not thousands). For example, take a look at http://www.forkosh.com/images/avoronoi.gif where each 2D-plane of that 3D-voronoi_diagram can be generated independently. And the program takes an nfork=n query_string attribute to fork off the calculations for n planes "simultaneously".
With a four-core processor, the (user) time to complete the diagram decreases pretty much linearly with nfork, up till about nfork=8 (four cores hyperthreaded). But beyond 8, time still decreases, although more slowly. And beyond about 16, or so, no further noticeable improvement. I haven't analyzed this behavior at all, but naively attribute it to the os (linux slackware 14.2x64 in this case) juggling processes to even further reduce overall idle time.
The best choice is system dependant. So what you want to do is to run both versions on a real system, and then check how the system responds. Can you still use browser, text editor, other things on your system? And is performance better when using n threads and not n-1? What happens if you run the app together with another app that tries to use all CPUs?
And then you need to consider hyperthreading. With four cores plus hyperthreading, you could use 8 cores, or 7 cores. Again, try out responsiveness of the system and time to finish.
And finally, consider splitting your work into more blocks than threads. The reason is that different threads will finish the job at different times, and then you want some work left over to hand to the faster threads. Otherwise you'll have to wait until the last thread is finished.
PS. "Hyperthreading can't help with FPU intensive code because there is only one FPU". Absolutely wrong. It is incredibly difficult, even with FPU intensive code, to make full use of the FPU due to latencies. Hyperthreading helps because there are twice as many independent operations available for scheduling.
I don't know how to write this in a way that do not sound "bad", so just take it as a friendly remark, ok?
Given that an average PC already has usually thousand or more threads, what makes you think that using 8 vs 7 will make any difference? :-)
Use as many threads as possible. And if you don't have to care about OS response, and your threads run for quite a long time (more than a second), you can even experiment in using twice the number of cores.
- 
        3But most of these thousands of threads don't use 100 % CPU, do they?Andreas Rejbrand– Andreas Rejbrand2017-08-13 10:41:27 +00:00Commented Aug 13, 2017 at 10:41
 - 
        1Using twice the number of cores does not generally improve computation times. In fact, using more than the number of physical cores is not generally beneficial, even if you have more logical cores (through HyperThreading etc; although this may depend on the exact task you're performing). Source: experience from the past, using MATLAB Parallel Processing.Sanchises– Sanchises2017-08-13 11:40:01 +00:00Commented Aug 13, 2017 at 11:40
 - 
        1@Sanchises This is because hyperthreading leverages quasi-parallel instruction interleaving - it's effective for branchy and memory heavy code. Matrix calculations are very FPU intense and there is only one FPU per physical core so hyperthreading can't help you.J...– J...2017-08-13 13:42:34 +00:00Commented Aug 13, 2017 at 13:42