PHP Simple HTML DOM parser give faulty data

Question

I'm using PHP Simple HTML DOM to parse a web page with the following HTML. Notice the extra -tags in each <li>.

<li>
  <span class="name">
    <a href="">Link</a> asdasd
  </span>
  </span>
</li>
<li>
  <span class="name">
    <a href="">Link</a> asdasd2
  </span>
  </span>
</li>

My queries are:

$lis = $dom->find('li');
foreach ($lis as $li) {
  $spans = $li->find('span');
  foreach ($spans as $span) {
    echo $span->plaintext."<br>";
  }
}

My output is:

Link asdasd 
Link asdasd2
-----------
Link asdasd2 
-----------

As you can see the find('span') finds two spans as children to the first <li> and getting the value from the next  it can find (even though it's a child of the next <li>). Removing the trailing  fixes the problem.

My questions are:

Why is this happening?
How I can solve this particular case? Everything else works well and I'm not in a position to make big changes to my script. I can change the DOM queries easily though if needed.

I am thinking about counting start and closing tags and stripping one  if there are too many of them. Since they will always be s, are there a smart way to check it with regexp?

1. Garbage In, Garbage Out. The class you're using isn't as robust as it claims. 2. For this particular case, fix the HTML. For a more general case use a more robust HTML parser: DOMDocument — user1864610
– user1864610, Commented Aug 5, 2013 at 0:23
I started doing this with DOMDocument, finally ending up in an error where I needed to compare string-lengths and I couldn't get the data into plaintext. The node data contained a lot of garbage and tags and stuff. This seemed much easier. I can not change the input HTML. — Mattis
– Mattis, Commented Aug 5, 2013 at 0:25

pguardiario · Accepted Answer · 2013-08-05 01:48:58Z

1

1) Simple is trying to fix your extra  by adding a  somewhere. So now you have an extra span that shouldn't be there. For the record, DomDocument would do the same thing, although perhaps in a more predictable way.

2) Simplify:

foreach ($dom->find('li > span') as $span) {
  echo $span->plaintext."<br>";
}
//     Link asdasd    <br>     Link asdasd2    <br>

Now you've told it you only want the span that is a child of a li. Even better, do something like:

foreach ($dom->find('span.name') as $span) {
  echo $span->plaintext."<br>";
}

Use those attributes, that's what they're good for.

edited Aug 5, 2013 at 1:48

answered Aug 5, 2013 at 1:38

pguardiario

55.2k21 gold badges130 silver badges169 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Mattis Over a year ago

The problem I wrote here was simplified quite a bit so it was more readable. I needed the plaintext data and some other stuff from the li:s as well. However, I solved the whole thing using your tip and some tricking with $f->parent(). Thanks!

Loek Bergman · Accepted Answer · 2013-08-05 00:37:17Z

1

$newTxt = preg_replace('/\<\/span\>[\S]*\<\/span\>/','</span>',$txt);

The method 'find(x)' is an overloaded function that can return the equivalents of:

$e->getElementById(x);
$e->getElementsById(x);
$e->getElementByTagName(x); and
$e->getElementsByTagName(x);

In your first call makes it use of the last call. In the second $li of the third possibility. It is probably a method of optimization which question you were asking according to the API. I guess you have found a bug in the API, because you were asking in both cases the use of the third call:

$e->getElementByTagName();

answered Aug 5, 2013 at 0:37

Loek Bergman

2,19520 silver badges18 bronze badges

2 Comments

Mattis Over a year ago

Thanks! I think I understood your english :)

Loek Bergman Over a year ago

Yeah, I considered the regex the most important part of my contribution, because the situation you described is definitely a simplification. (class='name' and <a>Link</a>). The silent suggestion I made was if that if things would not work out with Simple HTML DOM, that those basic methods were a good alternative. I never have much patience with tools that do not deliver what they say they do. It makes them unpredictable. In the long run is the best solution getting rid of the invalid HTML. That is not always possible, therefor I am glad you have found a non intrusive solution.

Collectives™ on Stack Overflow

PHP Simple HTML DOM parser give faulty data

2 Answers 2

1 Comment

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

2 Comments

Related