1

I've got some strings that I want to parse into a list of "chunks". My strings look like this

"some text [[anchor]] some more text, [[another anchor]]. An isolated ["

And I expect to get back something like this

[
   TextChunk "some text ",
   Anchor "anchor",
   TextChunk " some more text, "
   Anchor "another anchor",
   TextChunk ". An isolated ["
]

I've managed to write a function and types that do what I need, but they seems overly ugly. Is there a nicer way to do this?

data Token = TextChunk String | Anchor String deriving (Show)
data TokenizerMode = EatString | EatAnchor deriving (Show)

tokenize::[String] -> [Token]
tokenize xs =  
  let (_,_,tokens) = tokenize' (EatString, unlines xs, [TextChunk ""])
  in reverse tokens

tokenize' :: (TokenizerMode, String, [Token]) -> (TokenizerMode, String,[Token])
-- If we're starting an anchor, add a new anchor and switch modes
tokenize' (EatString, '[':'[':xs, tokens) = tokenize' (EatIdentifier, xs, (Identifier ""):tokens )
-- If we're ending an anchor ass a new text chunk and switch modes
tokenize' (EatAnchor, ']':']':xs, tokens) = tokenize' (EatString, xs, (TextChunk ""):tokens )
-- Otherwise if we've got stuff to consume append it
tokenize' (EatString, x:xs, (TextChunk t):tokens) = tokenize'( EatString, xs, (TextChunk (t++[x])):tokens)
tokenize' (EatAnchor, x:xs, (Identifier t):tokens) = tokenize'( EatAnchor, xs, (Identifier (t++[x])):tokens)
--If we've got nothing more to consume we're done.
tokenize' (EatString, [], tokens) = ( EatString, [], tokens)
--We'll only get here if we're given an invalid string
tokenize' xx = error ("Error parsing .. so far " ++ (show xx))
4
  • 2
    It's not really tokenizing, it's parsing. And for all your parsing needs, Parsec. Commented Apr 16, 2012 at 4:38
  • @CatPlusPlus agreed that its parsing .. updated text and title to match. Commented Apr 16, 2012 at 4:47
  • @CatPlusPlus Can you show me how this would look using parsec? I'm finding the docs/tutes a little obscure for my liking. Commented Apr 16, 2012 at 5:02
  • This is definitely tokenizing - you have a input string and you produce a list of tokens as output. I wouldn't use Parsec for this, I'd write it along the lines of the original code but instead of having an enumerated type for state use two mutually recursive functions for consuming plain_text and anchors. Commented Apr 16, 2012 at 7:32

3 Answers 3

11

This should work, including lone brackets:

import Control.Applicative ((<$>), (<*), (*>))
import Text.Parsec

data Text = TextChunk String
          | Anchor String
          deriving Show

chunkChar = noneOf "[" <|> try (char '[' <* notFollowedBy (char '[')) 
chunk     = TextChunk <$> many1 chunkChar
anchor    = Anchor <$> (string "[[" *> many (noneOf "]") <* string "]]")
content   = many (chunk <|> anchor)

parseS :: String -> Either ParseError [Text]
parseS input = parse content "" input

Note the use of try to allow backtracking when the chunkChar parser matches two opening brackets. Without try, the first bracket would have been consumed at that point.

Sign up to request clarification or add additional context in comments.

Comments

4

Here is a simplistic version using two mutually recursive functions.

module Tokens where

data Token = TextChunk String | Anchor String deriving (Show)

tokenize :: String -> [Token]
tokenize = textChunk emptyAcc


textChunk :: Acc -> String -> [Token]
textChunk acc []           = [TextChunk $ getAcc acc]
textChunk acc ('[':'[':ss) = TextChunk (getAcc acc) : anchor emptyAcc ss 
textChunk acc (s:ss)       = textChunk (snocAcc acc s) ss

anchor :: Acc -> String -> [Token]
anchor acc []              = error $ "Anchor not terminated" 
anchor acc (']':']':ss)    = Anchor (getAcc acc) : textChunk emptyAcc ss
anchor acc (s:ss)          = anchor (snocAcc acc s) ss


-- This is a Hughes list (also called DList) which allows 
-- efficient 'Snoc' (adding to the right end).
--
type Acc = String -> String

emptyAcc :: Acc
emptyAcc = id

snocAcc :: Acc -> Char -> Acc
snocAcc acc c = acc . (c:)

getAcc :: Acc -> String
getAcc acc = acc []

This version has a problem that it will generate empty TextChunks if the input starts or ends with an Anchor or if there are two contiguous anchors in the text.

It is straight-forward to add checks to not generate a TextChunk if the accumulator is empty but it makes the code about twice as long - maybe I would reach for Parsec after all...

2 Comments

If I cared about empty TextChunks I could filter out the empty TextChunks as a post process pretty easily.
Thanks for the performance pointer about the append to list, and the DList work around.
1

Solution using monadic Parsec.

import Text.ParserCombinators.Parsec

data Text = TextChunk String
          | Anchor String
          deriving Show

inputString = "some text [[anchor]] some more text, [[another anchor]]."  

content :: GenParser Char st [Text]
content = do
    s1 <- many (noneOf "[")
    string "[["
    s2 <- many (noneOf "]")
    string "]]"
    s3 <- many (noneOf "[")
    string "[["
    s4 <- many (noneOf "]")
    string "]]."
    return $ [TextChunk s1, Anchor s2, TextChunk s3, Anchor s4]


parseS :: String -> Either ParseError [Text]
parseS input = parse content "" input

How it works:

> parseS inputString 
Right [TextChunk "some text ",Anchor "anchor",TextChunk " some more text, ",Anchor "another anchor"]
it :: Either ParseError [Text]

2 Comments

More generally, you can write content = many (chunk <|> anchor) with chunk = TextChunk <$> many1 (noneOf "[") and anchor = Anchor <$> (string "[[" *> many (noneOf "]") <* string "]]") (using some shortcuts from Control.Applicative). That should work with any combination of text chunks and anchors
@hammar, that is almost OK, but I'm guessing it doesn't allow a '[' in the text. I'll add that to my example string to make it clearer, that I only want "[[ stuff ]]" to be treated as an anchor and anything else to be stuck in a Text Chunk.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.