The Wayback Machine - https://web.archive.org/web/20201013120951/https://github.com/VcDevel/Vc/issues/118
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add an example for multi-arch builds #118

Open
mattkretz opened this issue Apr 7, 2016 · 3 comments
Open

add an example for multi-arch builds #118

mattkretz opened this issue Apr 7, 2016 · 3 comments

Comments

@mattkretz
Copy link
Member

@mattkretz mattkretz commented Apr 7, 2016

The example should have one file compiled with the vc_compile_for_all_implementations macro. It should then call into the correct one at runtime determined by results from cpuid.

@milianw
Copy link

@milianw milianw commented Dec 3, 2018

what I gathered so far:

  • documentation on the cmake macro: https://vcdevel.github.io/Vc-1.4/buildsystem.html#buildsystem_macros
  • in the code you compile via the macro, you should use Vc::CurrentImplementation::current() as a template tag
  • then you can dispatch at runtime in a switch over the currently supported implementation using e.g. Vc::bestImplementationSupported()

Here's some Kita code snippets that I found for this purpose:

@milianw
Copy link

@milianw milianw commented Dec 3, 2018

@mattkretz: from what I understand, the above wouldn't be sufficient to get the best performance for AVX2, like noted in the READMe for the N-vortex solver:

N-vortex solver showing simdized iteration over many std::vector. Note how important the -march flag is, compared to plain -mavx2 -mfma.

If we just dispatch on AVX2 + FMA, then we miss out on the skylake optimizations :( How big are they, i.e. would you say it's worthwhile to add per-arch dispatch?

@mattkretz
Copy link
Member Author

@mattkretz mattkretz commented Dec 6, 2018

Thanks for adding more details and examples to this issue!

The exact type to dispatch on is a trade-off between code size bloat and perfect support for all possible hardware. It depends on the algorithms and performance requirements what the right trade-off for a specific application is. In any case, just using the CurrentImplementation helper type is a good starting point.

Regarding -m<ISA extension> vs. -march=<CPU name>:

  • The difference can be very significant. In the N-vortex solver case it's due to the loop being limited on loads and stores. And since older CPUs couldn't do 256-bit loads, the generic optimization of GCC splits AVX loads/stores into two 128-bit loads/stores. This hurts a lot because the CPU is limited by the number of load instructions (i.e. 2 loads + 1 store per clock cycle), not the number of Bytes.
  • In other cases there might be no (or no noticeable) difference at all.
  • My recommendation nowadays is to compile for different -march and dispatch on those. But Vc 1.4 doesn't implement it this way yet. I'd be open for patches that don't break Krita (but allow Krita to easily switch). Specifically the logic starting at https://github.com/VcDevel/Vc/blob/1.4/cmake/VcMacros.cmake#L449 would need to be duplicated using -march.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.