2

I'm trying to use Haskell to process some data I wish to analyse. This data is mostly structured, but inconsistently so. Dates may have a number of representations, though always ocuring in the same place (the documents are XML).

The differing formats I have seen thus far are:

"25th February 1971"

"Thursday. 22nd June. 1972."

"3rd July. 1973."

"Thursday 17th October \r\n 1974."

"Friday, 5th March, 1976."

"25th April \r\n 1977."

"Tuesday 6th December 1983"

" 10 May 1988"

"October 20th 1988"

I don't really know where to start - any individual format I could deal with, but I'm not sure how to deal with all of them. I would like a function String -> Maybe Day.

2 Answers 2

1

There are several libraries on hackage for parsing dates:

You could then chain several such parsers together. Here is a hand-rolled "alternative" operator:

    -- Chain operator: if p1 returns Nothing, then return p2
    p1 <||> p2 = case p1 of
                   Nothing -> p2
                   Just r -> Just r

So you would write a parsing function for each format:

    p1 :: String -> Maybe Day

Then combine these like this:

    parseDate :: String -> Maybe Day
    parseDate = p1 <||> p2 <||> p3

If you write a proper Parser you get this alternative operator (<|>) for free from Control.Applicative. Here's a tutorial on writing your own parsers.

I would also recommend preprocessing the raw text by eliminating punctuation and maybe even the "rd" formats to make it more robust and cut down on the number of parsing functions you would have to write. Also consider using Data.Text if you need better performance.

Sign up to request clarification or add additional context in comments.

Comments

1

First thing solve each problem at a time and restrict yourself to one of those parsers. Start by writing some tests for this parser.

Parsing in Haskell is quite different to parsing in other languages one usually uses regular expressions or other means. In haskell we have excellent libraries that provide parser combinatiors. The ones I have used are parsec and attoparsec.

  • Making datatypes for each or use the existing time-package.

  • write a parser for each month (Jan or Feb..) and then combine them. But watch out as both March and May start with the same letter you need more than simple combination. Same is true for January, June and July

  • it is quite helpful to have again some tests for the simple parsers (for both the positive and negative cases)
  • write a parser for each day (1st or 2nd or 3rd or nth)
  • combine them again be careful - 11th and 12th start both with '1'
  • write a parser for years

Now you should have Parser Day, Parser Month and Parser Year at hand and maybe even Parser Weekday.

  • combine those parsers to form the parser you have restricted yourself to get a Parser Day
  • now you should have enough utilities at hand to implement the rest yourself

On a last note, there are plenty of tutorials for parsec/attoparsec out there just use the search engine of your least mistrust out there.

4 Comments

if you're not writing this as an exercise, i'd recommend a a look at hackage.haskell.org if someone else has implemented this already and make your job more easily.
btw - if this is not enough information, just leave a comment - I will elaborate some more!
This isn't as an exercise, and I would love to use someone elses library! Do you mean look for some pre-written parsers for Day Month etc on hackage?
yes - I think someone has already solved this problem, but here is not the place to ask for recommendations, so I guess you can ask in the #haskell irc channel or maybe on the mailinglist. this is but one candidate

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.