Performance of highly skewed multidimensional reductions #140

tmcdonell · 2013-12-19T06:33:10Z

Performance of multidimensional reductions is not good when the array is highly skewed. For example, a fold where the number of columns is (innermost dimension) is very small. See also this thread:

https://groups.google.com/forum/#!topic/accelerate-haskell/KAFYUz4Sjsk

Multidimensional reduction uses one thread block per reduction; so an (Z :. m :. n) sized matrix uses m thread blocks. If n is very small, then many threads in the block sit idle. We could change this to a warp-per-reduction style, which is actually the strategy segmented fold uses. This will likely have a negative impact if m is small and n large.

It would be possible to generate both variants and choose dynamically which to execute. That implies compiling four kernels per reduction (because fusion; initial vs. recursive step).

The text was updated successfully, but these errors were encountered:

yongqli · 2014-12-08T13:17:23Z

I have ran into this problem as well, is there a temporary workaround?

yongqli · 2014-12-08T13:20:43Z

For now, we are using this function, adapted from David Darais in the above thread, as a replacement for fold:

fold2 :: forall sh a. (Elt a, Shape sh, Slice sh)
      => (Exp a -> Exp a -> Exp a)
      -> Exp a
      -> Acc (Array (sh :. Int) a)
      -> Acc (Array sh a)
fold2 f x0 xs =
  generate (indexTail $ shape xs) $ \i -> sfoldl f x0 i xs

tmcdonell · 2014-12-08T15:53:56Z

I have not yet had the time to improve the code generation for this type of problem. If you have a good non-synthetic test case you don't mind sharing that I can use to optimise this, please do send it our way.

yongqli · 2014-12-08T17:37:03Z

Actually, what is wrong with the generate approach? If the number of columns is, say, only 5, we would want a single thread to compute it, correct? What is the mapping strategy for generate?

tmcdonell · 2018-01-11T00:47:10Z

@RasmusWL had interesting work on this at FHPC'17 in the context of Futhark; we should steal his ideas!

RasmusWL · 2018-01-11T10:13:39Z

@tmcdonell go ahead! See my thesis and the paper for details ;)

tmcdonell mentioned this issue Nov 14, 2016

accelerate-cuda: launch time-out when fusing fold1 with replicate #193

Closed

tmcdonell added this to the _|_ milestone Apr 14, 2017

tmcdonell removed the cuda backend label Jan 11, 2018

tmcdonell removed this from the _|_ milestone Jan 11, 2018

tmcdonell added llvm-ptx help wanted labels Jan 11, 2018

tmcdonell added the good first issue label Jan 11, 2018

Dec	JAN	Feb
	27
2021	2022	2023

AccelerateHS / accelerate Public

Performance of highly skewed multidimensional reductions #140

Performance of highly skewed multidimensional reductions #140

tmcdonell commented Dec 19, 2013

yongqli commented Dec 8, 2014

yongqli commented Dec 8, 2014

tmcdonell commented Dec 8, 2014

yongqli commented Dec 8, 2014

tmcdonell commented Jan 11, 2018

RasmusWL commented Jan 11, 2018

AccelerateHS / accelerate Public

Performance of highly skewed multidimensional reductions #140

Performance of highly skewed multidimensional reductions #140

Comments

tmcdonell commented Dec 19, 2013

yongqli commented Dec 8, 2014

yongqli commented Dec 8, 2014

tmcdonell commented Dec 8, 2014

yongqli commented Dec 8, 2014

tmcdonell commented Jan 11, 2018

RasmusWL commented Jan 11, 2018