I'm using PHP Simple HTML DOM to parse a web page with the following HTML. Notice the extra </span>-tags in each <li>.
<li>
<span class="name">
<a href="">Link</a> asdasd
</span>
</span>
</li>
<li>
<span class="name">
<a href="">Link</a> asdasd2
</span>
</span>
</li>
My queries are:
$lis = $dom->find('li');
foreach ($lis as $li) {
$spans = $li->find('span');
foreach ($spans as $span) {
echo $span->plaintext."<br>";
}
}
My output is:
Link asdasd
Link asdasd2
-----------
Link asdasd2
-----------
As you can see the find('span') finds two spans as children to the first <li> and getting the value from the next <span> it can find (even though it's a child of the next <li>). Removing the trailing </span> fixes the problem.
My questions are:
Why is this happening?
How I can solve this particular case? Everything else works well and I'm not in a position to make big changes to my script. I can change the DOM queries easily though if needed.
I am thinking about counting start and closing tags and stripping one </span> if there are too many of them. Since they will always be <span>s, are there a smart way to check it with regexp?