0

I'm using PHP Simple HTML DOM to parse a web page with the following HTML. Notice the extra </span>-tags in each <li>.

<li>
  <span class="name">
    <a href="">Link</a> asdasd
  </span>
  </span>
</li>
<li>
  <span class="name">
    <a href="">Link</a> asdasd2
  </span>
  </span>
</li>

My queries are:

$lis = $dom->find('li');
foreach ($lis as $li) {
  $spans = $li->find('span');
  foreach ($spans as $span) {
    echo $span->plaintext."<br>";
  }
}

My output is:

Link asdasd 
Link asdasd2
-----------
Link asdasd2 
-----------

As you can see the find('span') finds two spans as children to the first <li> and getting the value from the next <span> it can find (even though it's a child of the next <li>). Removing the trailing </span> fixes the problem.

My questions are:

  1. Why is this happening?

  2. How I can solve this particular case? Everything else works well and I'm not in a position to make big changes to my script. I can change the DOM queries easily though if needed.

I am thinking about counting start and closing tags and stripping one </span> if there are too many of them. Since they will always be <span>s, are there a smart way to check it with regexp?

2
  • 1
    1. Garbage In, Garbage Out. The class you're using isn't as robust as it claims. 2. For this particular case, fix the HTML. For a more general case use a more robust HTML parser: DOMDocument Commented Aug 5, 2013 at 0:23
  • I started doing this with DOMDocument, finally ending up in an error where I needed to compare string-lengths and I couldn't get the data into plaintext. The node data contained a lot of garbage and tags and stuff. This seemed much easier. I can not change the input HTML. Commented Aug 5, 2013 at 0:25

2 Answers 2

1

1) Simple is trying to fix your extra </span> by adding a <span> somewhere. So now you have an extra span that shouldn't be there. For the record, DomDocument would do the same thing, although perhaps in a more predictable way.

2) Simplify:

foreach ($dom->find('li > span') as $span) {
  echo $span->plaintext."<br>";
}
//     Link asdasd    <br>     Link asdasd2    <br>

Now you've told it you only want the span that is a child of a li. Even better, do something like:

foreach ($dom->find('span.name') as $span) {
  echo $span->plaintext."<br>";
}

Use those attributes, that's what they're good for.

Sign up to request clarification or add additional context in comments.

1 Comment

The problem I wrote here was simplified quite a bit so it was more readable. I needed the plaintext data and some other stuff from the li:s as well. However, I solved the whole thing using your tip and some tricking with $f->parent(). Thanks!
1
$newTxt = preg_replace('/\<\/span\>[\S]*\<\/span\>/','</span>',$txt);

The method 'find(x)' is an overloaded function that can return the equivalents of:

$e->getElementById(x);
$e->getElementsById(x);
$e->getElementByTagName(x); and
$e->getElementsByTagName(x);

In your first call makes it use of the last call. In the second $li of the third possibility. It is probably a method of optimization which question you were asking according to the API. I guess you have found a bug in the API, because you were asking in both cases the use of the third call:

$e->getElementByTagName();

2 Comments

Thanks! I think I understood your english :)
Yeah, I considered the regex the most important part of my contribution, because the situation you described is definitely a simplification. (class='name' and <a>Link</a>). The silent suggestion I made was if that if things would not work out with Simple HTML DOM, that those basic methods were a good alternative. I never have much patience with tools that do not deliver what they say they do. It makes them unpredictable. In the long run is the best solution getting rid of the invalid HTML. That is not always possible, therefor I am glad you have found a non intrusive solution.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.