0

I'm working on a parser in PHP (which is very new to me) to search through the following source:

http://web2.uconn.edu/wdlcalendar/index.php/month/list/2010-11-02/All/All/UConn_Master_Calendar1/

The goal of the parser is to store the desired information in a database on the local machine: we want the date (eg November 1), the name of the event, the time of the event, and the link to a "more info" page for that particular event (which is embedded in the name of the event as a hyperlink).

The first part: I am getting the date, eg November 1 by using getElementsByTagName("h3"). However, there are two other elements that occur before the dates in the HTML doc that I DO NOT want.

QUESTION: is there a way to tell the parser to look after a particular part of the code, or a particular string?

Second Part: the other problem I'm having is that the link to the event page and the string representing the name of the event are lumped together within the same HTML tag. How do I pull that information out separately, since the getElementsByTagName() function (in my understanding) would not be able to do this. Here is a portion of the HTML:

<a class="smoothbox" href="http://web2.uconn.edu/wdlcalendar/index.php/occurrence/57237">
WEAR RED DAY
<em>All Day</em>
</a>
</li>

The idea is I'd like to have "WEAR RED DAY" (name) "All Day" (time) and "http://web2.uconn.edu/wdlcalendar/index.php/occurrence/57237" (link) as separate elements to store in our database. HOW?!

3

1 Answer 1

4

If you are writing a parser by hand you are doing it wrong. My suggestion is that you make use of an existing HTML parser. The other option is to attempt to make use of regular expressions to solve your issue, but it's more likely to be a brittle and temporary solution if anything changes in your page format.

Sign up to request clarification or add additional context in comments.

3 Comments

Suggested third party alternatives to SimpleHtmlDom that actually use DOM instead of String Parsing: phpQuery, Zend_Dom, QueryPath and FluentDom.
Suggesting SimpleHTMLDom and Regex is like telling the OP to choose between Plague and Cholera.
Fair enough, upvoting your additional suggestions. The main point here is that suggesting he continue on his path of trying to write his own parser is worse.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.