Skip to main content
2 of 4
added 763 characters in body
ErikR
  • 1.9k
  • 10
  • 7

The program consists of a pipeline of operations:

  • breaking up the input file into lines
  • parsing each line as a complex number
  • updating the histogram vector
  • writing out the PPM file

The first step to optimizing the program is to measure how long each of these steps is taking.

criterion is a great tool, but because it runs the test function multiple times it isn't well suited for operations which take many seconds. Also, criterion is designed to get precise timings, and we just need ballpark figures. For our timings we're just going to use the shell's time command.

I first created a test file containing a million random complex numbers, and ran the entire pipeline to get a base line reading. Creating the the PPM from the million complex numbers took 33 secs. To understand if that was a reasonable number or not I wrote a quick perl script to just read in and add up the real parts of the complex numbers:

my $sum = 0;
while (<>) {
  if (m/^(.*?)[+]/) { $sum += $1; }
}
print $sum, "\n";

and it only took 1.7 seconds. Clearly there is a big problem somewhere with the Haskell program.

The next step it to write alternative main functions which cut off the pipeline at various stages, e.g.:

A main which just prints out the number of lines read:

mainLength path =  do
    rawData <- liftA LB.words (LB.readFile path)
    let formatedData = map (extr.LP.parse parseComplex) rawData
    print $ length formatedData

A main which just adds up the real parts of the parsed numbers:

mainSum path = do
    rawData <- liftA LB.words (LB.readFile path)
    let formatedData = map (extr.LP.parse parseComplex) rawData
        rpart (Complex r _) = r
    print $ sum $ map rpart $ formatedData

and the timings I got are as follows:

mainLength   0.18 secs
mainSum     17.37 secs

So what work is being done in mainSum which is not being done in mainLength? Due to laziness it is the conversion of the parsed bytestrings to Double values.

A quick scan of the code reveals that you are using read to perform this conversion. read is notoriously slow, and should be avoided for performance critical code.

A replacement for read :: ByteString -> Double

If you search Google you can find this Stackoverflow article:

http://stackoverflow.com/questions/4489518/efficient-number-reading-in-haskell

Mostly because the answer was submitted by Don Stewart I decided on using the bytestring-lexing module, and came up with this version of parseComplex:

parseComplex' = do
  r <- P.takeTill (== '+')
  P.char '+'
  i <- P.takeTill (== '*')
  P.string iT  -- or just skip this
  return $ Complex (toDouble r) (toDouble i)
  where toDouble s = case (readSigned readDecimal) s of
                       Nothing -> 0
                       Just (d, _) -> d

The timing for mainSum using parseComplex' is:

mainSum'  0.6 secs

So now the first two stages of the pipeline are very performant. The next step is to figure out why updating the histogram vector and writing out the PPM file is taking so long.

Update:

I think you'll find that improving the Double parsing is about the best you can do for a single threaded program. To scale to 600M points you are going to have to use multiple threads / machines. Fortunately this is a classic map-reduce problem, so there's a lot of tools and libraries (not necessarily in Haskell) that you can draw upon.

If you just want to scale on a single machine using multiple threads, you can put together your own solution using a module like Control.Monad.Par. For a clustering solution you'll probably have to use a third-party framework like Hadoop in which case you might be interested in the hadron package - there is also a video describing it here: https://vimeo.com/90189610

ErikR
  • 1.9k
  • 10
  • 7