0

I am trying to parse the content of any url. Which should not content any html code. This works fine, but gives bunch of error while reading the content on url given. How to remove this warning?

<?php
$url= 'http://stackoverflow.com/questions/12097352/how-can-i-parse-dynamic-content-from-a-web-page';
$doc = new DOMDocument();
$doc->loadHTMLFile($url);
$xpath = new DOMXPath($doc);
foreach($xpath->query("//script") as $script) {
    $script->parentNode->removeChild($script);
}
$textContent = $doc->textContent; //inherited from DOMNode
echo $textContent;
?>

Warnings:

content-from-a-web-page, line: 255 in /opt/lampp/htdocs/FB/ec2/test.php on line 13

Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://stackoverflow.com/questions/12097352/how-can-i-parse-dynamic-content-from-a-web-page, line: 255 in /opt/lampp/htdocs/FB/ec2/test.php on line 13

Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://stackoverflow.com/questions/12097352/how-can-i-parse-dynamic-content-from-a-web-page, line: 273 in /opt/lampp/htdocs/FB/ec2/test.php on line 13

Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://stackoverflow.com/questions/12097352/how-can-i-parse-dynamic-content-from-a-web-page, line: 273 in /opt/lampp/htdocs/FB/ec2/test.php on line 13

Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://stackoverflow.com/questions/12097352/how-can-i-parse-dynamic-content-from-a-web-page, line: 412 in /opt/lampp/htdocs/FB/ec2/test.php on line 13

Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://stackoverflow.com/questions/12097352/how-can-i-parse-dynamic-content-from-a-web-page, line: 412 in /opt/lampp/htdocs/FB/ec2/test.php on line 13

Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://stackoverflow.com/questions/12097352/how-can-i-parse-dynamic-content-from-a-web-page, line: 551 in /opt/lampp/htdocs/FB/ec2/test.php on line 13

Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://stackoverflow.com/questions/12097352/how-can-i-parse-dynamic-content-from-a-web-page, line: 551 in /opt/lampp/htdocs/FB/ec2/test.php on line 13

Warning: DOMDocument::loadHTMLFile(): ID display-name already defined in http://stackoverflow.com/questions/12097352/how-can-i-parse-dynamic-content-from-a-web-page, line: 731 in /opt/lampp/htdocs/FB/ec2/test.php on line 13
1

1 Answer 1

2

You can use libxml_use_internal_errors() and do the following:

libxml_use_internal_errors(true);
$doc->loadHTMLFile($url);
libxml_clear_errors();

As Peehaa noted in the comments below, it's a good idea to reset the state of errors. You can do it as below:

$errors = libxml_use_internal_errors(true); //store
$doc->loadHTMLFile($url);
libxml_clear_errors();
libxml_use_internal_errors($errors); //reset back to previous state

Here's how it works:

  • libxml_use_internal_errors() tells libxml to handle the errors and warnings internally, and that it shouldn't be outputted to the browser. Also store the current state of errors in a variable
  • then you load the HTML file with loadHTML() method
  • clear the error buffer with libxml_clear_errors
  • restores the old state of error values

Demo!

Sign up to request clarification or add additional context in comments.

5 Comments

Note that it is considered good practice to store the current state of libxml_use_internal_errors and reset it afterwards.
@PeeHaa: Good idea. I've added it to the answer :)
@AmalMurali: thanks a lot. Can you please explain me the code difference?
@Karimkhan: added an explanation to my answer.
If you call libxml_use_internal_errors again, you don't need to call libxml_clear_errors.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.