The Wayback Machine - https://web.archive.org/web/20220127220539/https://github.com/AccelerateHS/accelerate/issues/140
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance of highly skewed multidimensional reductions #140

Open
tmcdonell opened this issue Dec 19, 2013 · 6 comments
Open

Performance of highly skewed multidimensional reductions #140

tmcdonell opened this issue Dec 19, 2013 · 6 comments

Comments

@tmcdonell
Copy link
Member

@tmcdonell tmcdonell commented Dec 19, 2013

Performance of multidimensional reductions is not good when the array is highly skewed. For example, a fold where the number of columns is (innermost dimension) is very small. See also this thread:

https://groups.google.com/forum/#!topic/accelerate-haskell/KAFYUz4Sjsk

Multidimensional reduction uses one thread block per reduction; so an (Z :. m :. n) sized matrix uses m thread blocks. If n is very small, then many threads in the block sit idle. We could change this to a warp-per-reduction style, which is actually the strategy segmented fold uses. This will likely have a negative impact if m is small and n large.

It would be possible to generate both variants and choose dynamically which to execute. That implies compiling four kernels per reduction (because fusion; initial vs. recursive step).

@yongqli
Copy link

@yongqli yongqli commented Dec 8, 2014

I have ran into this problem as well, is there a temporary workaround?

@yongqli
Copy link

@yongqli yongqli commented Dec 8, 2014

For now, we are using this function, adapted from David Darais in the above thread, as a replacement for fold:

fold2 :: forall sh a. (Elt a, Shape sh, Slice sh)
      => (Exp a -> Exp a -> Exp a)
      -> Exp a
      -> Acc (Array (sh :. Int) a)
      -> Acc (Array sh a)
fold2 f x0 xs =
  generate (indexTail $ shape xs) $ \i -> sfoldl f x0 i xs
@tmcdonell
Copy link
Member Author

@tmcdonell tmcdonell commented Dec 8, 2014

I have not yet had the time to improve the code generation for this type of problem. If you have a good non-synthetic test case you don't mind sharing that I can use to optimise this, please do send it our way.

@yongqli
Copy link

@yongqli yongqli commented Dec 8, 2014

Actually, what is wrong with the generate approach? If the number of columns is, say, only 5, we would want a single thread to compute it, correct? What is the mapping strategy for generate?

@tmcdonell
Copy link
Member Author

@tmcdonell tmcdonell commented Jan 11, 2018

@RasmusWL had interesting work on this at FHPC'17 in the context of Futhark; we should steal his ideas!

@tmcdonell tmcdonell removed this from the _|_ milestone Jan 11, 2018
@RasmusWL
Copy link

@RasmusWL RasmusWL commented Jan 11, 2018

@tmcdonell go ahead! See my thesis and the paper for details ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment