1

So I try to scrap this URL: xxxx.fr with cURL, but impossible to get access to the page HTML code, both header and body are empty. HTTP code return is 200 I tried with other URL (different domain) and it works like a charm. I also try with different User Agent and Referer

Do you know what is wrong ? At lest can someone try this code on your own server and let me know if you have the same issue ?

Thank you

Below is my code:

  $url = 'http://www.xxxx.fr';

  $header[] = "Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
  $header[] = "Cache-Control: max-age=0";
  $header[] = "Connection: keep-alive";
  $header[] = "Keep-Alive: timeout=5, max=100";
  $header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
  $header[] = "Accept-Language: en-us,en;q=0.5";
  $header[] = ""; // BROWSERS USUALLY LEAVE BLANK

  $curl = curl_init ();
  curl_setopt($curl, CURLOPT_URL, $url);
  curl_setopt($curl, CURLOPT_HTTPHEADER, $header);
  curl_setopt($curl, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:37.0) Gecko/20100101 Firefox/37.0");
  curl_setopt($curl, CURLOPT_ENCODING, 'gzip,deflate');
  curl_setopt($curl, CURLOPT_REFERER, "http://www.google.fr");
  curl_setopt($curl, CURLOPT_HEADER, 1);
  curl_setopt($curl, CURLINFO_HEADER_OUT, 1);
  curl_setopt($curl, CURLOPT_VERBOSE, 1);
  curl_setopt($curl, CURLOPT_COOKIEFILE, getcwd().'/cookies.txt');
  curl_setopt($curl, CURLOPT_COOKIEJAR, getcwd().'/cookies.txt');
  curl_setopt($curl, CURLOPT_TIMEOUT, 30);
  curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
  $curlData = curl_exec($curl);

  $infos = curl_getinfo($curl);
  print_r($infos);

  curl_close ( $curl );

  echo "<hr>Page:<br />";
  echo htmlentities($curlData);

and here is the result from the print_r($infos):

Array ( 
[url] => http://www.xxxx.fr 
[content_type] => text/html 
[http_code] => 200 
[header_size] => 625 
[request_size] => 465 
[filetime] => -1 
[ssl_verify_result] => 0
[redirect_count] => 0 
[total_time] => 0.032535 
[namelookup_time] => 0.001488 
[connect_time] => 0.002581 
[pretransfer_time] => 0.002639 
[size_upload] => 0 
[size_download] => 10234 
[speed_download] => 314553 
[speed_upload] => 0 
[download_content_length] => -1 
[upload_content_length] => 0 
[starttransfer_time] => 0.032088 
[redirect_time] => 0 
[certinfo] => Array ( ) 
[primary_ip] => xxx 
[primary_port] => 80 
[local_ip] => xxx 
[local_port] => 37319 
[redirect_url] => 
[request_header] => GET / HTTP/1.1 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:37.0) Gecko/20100101 Firefox/37.0 Host: www.xxxx.fr Accept-Encoding: gzip,deflate Referer: http://www.google.fr Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5 Cache-Control: max-age=0 Connection: keep-alive Keep-Alive: timeout=5, max=100 Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7 Accept-Language: en-us,en;q=0.5 
) 
3
  • So $curlData is empty for you? That is coming back with HTML for me when I execute your code. Commented May 12, 2015 at 11:42
  • also when i try curl -v command on my server I can get HTML code, so I think there is something blocking in my PHP / cURL config, but not sure what.... Commented May 12, 2015 at 12:26
  • actually, if I do a var_dump($curlData), it print the HTML code, but otherwise it's empty. It's strange... Commented May 12, 2015 at 12:55

1 Answer 1

3

//EDIT

htmlentities($curlData) returns empty string because encoding of source is non UTF-8 string see this link

that should works:

 htmlentities($curlData, ENT_QUOTES,'ISO-8859-1' );

in PHP 5.4 release, htmlspecialchars() doesn’t use ISO-8859-1 as default encoding. In fact htmlspecialchars() as of PHP 5.4 uses UTF-8. You might expect, that htmlspecialchars() would just skip non-UTF-8 byte sequences or translate them to a ‘no found’ character. In fact, htmlspecialchars() returns a blank string: No error gets generated, no errorcode gets returned, no exception gets raised, just a blank string gets returned if non-valid UTF-8 sequences get passed in

Sign up to request clarification or add additional context in comments.

8 Comments

This isn't an answer. $curlData is populated. Also an "answer" with just "use" and is pretty unhelpful. The OP knows nothing about what the code is doing or why they should be using it.
Did you run the code as is? It works. htmlentities() is not the issue, it will just bring the code back entitized; not empty as OP states is the issue both header and body are empty. Snippet of return: &lt;/script&gt; &lt;/body&gt; &lt;/html&gt;
When i run his code and var_dump(htmlentities($curlData)); return string(0)"" so i assume that is the problem
I get back string(49035) "HTTP/1.1 200 OK with htmlentities and string(37714) "HTTP/1.1 200 OK with html_entity_decode. None of which are empty so a function is not the issue as I see it.
@chris85 i updated my answare, i assume you test his code with same charset as source, problem apears when using different encoding for script and source
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.