add an example for multi-arch builds #118

mattkretz · 2016-04-07T09:26:07Z

The example should have one file compiled with the vc_compile_for_all_implementations macro. It should then call into the correct one at runtime determined by results from cpuid.

milianw · 2018-12-03T22:17:41Z

what I gathered so far:

documentation on the cmake macro: https://vcdevel.github.io/Vc-1.4/buildsystem.html#buildsystem_macros
in the code you compile via the macro, you should use Vc::CurrentImplementation::current() as a template tag
then you can dispatch at runtime in a switch over the currently supported implementation using e.g. Vc::bestImplementationSupported()

Here's some Kita code snippets that I found for this purpose:

milianw · 2018-12-03T22:29:24Z

@mattkretz: from what I understand, the above wouldn't be sufficient to get the best performance for AVX2, like noted in the READMe for the N-vortex solver:

N-vortex solver showing simdized iteration over many std::vector. Note how important the -march flag is, compared to plain -mavx2 -mfma.

If we just dispatch on AVX2 + FMA, then we miss out on the skylake optimizations :( How big are they, i.e. would you say it's worthwhile to add per-arch dispatch?

mattkretz · 2018-12-06T10:33:24Z

Thanks for adding more details and examples to this issue!

The exact type to dispatch on is a trade-off between code size bloat and perfect support for all possible hardware. It depends on the algorithms and performance requirements what the right trade-off for a specific application is. In any case, just using the CurrentImplementation helper type is a good starting point.

Regarding -m<ISA extension> vs. -march=<CPU name>:

The difference can be very significant. In the N-vortex solver case it's due to the loop being limited on loads and stores. And since older CPUs couldn't do 256-bit loads, the generic optimization of GCC splits AVX loads/stores into two 128-bit loads/stores. This hurts a lot because the CPU is limited by the number of load instructions (i.e. 2 loads + 1 store per clock cycle), not the number of Bytes.
In other cases there might be no (or no noticeable) difference at all.
My recommendation nowadays is to compile for different -march and dispatch on those. But Vc 1.4 doesn't implement it this way yet. I'd be open for patches that don't break Krita (but allow Krita to easily switch). Specifically the logic starting at https://github.com/VcDevel/Vc/blob/1.4/cmake/VcMacros.cmake#L449 would need to be duplicated using -march.

mattkretz added type: enhancement type: documentation labels Apr 7, 2016

Sep	OCT	Nov
	13
2019	2020	2021

VcDevel / Vc

add an example for multi-arch builds #118

add an example for multi-arch builds #118

mattkretz commented Apr 7, 2016

milianw commented Dec 3, 2018 •

edited

milianw commented Dec 3, 2018 •

edited

mattkretz commented Dec 6, 2018

VcDevel / Vc

Join GitHub today

add an example for multi-arch builds #118

add an example for multi-arch builds #118

Comments

mattkretz commented Apr 7, 2016

milianw commented Dec 3, 2018 • edited

milianw commented Dec 3, 2018 • edited

mattkretz commented Dec 6, 2018

Essential cookies

Always active

Analytics cookies

milianw commented Dec 3, 2018 •

edited

milianw commented Dec 3, 2018 •

edited