1

I want to use PHP simple HTML DOM parser to scrape from a website. Source code is so random like that :

      <font face="Arial" color="#ff0000">
      <p>Parameters</p>
      </font><font face="Arial" size="2" color="#ff0000">
      <p>Param1</p>
      </font><font face="Arial" size="2" color="#0000ff">
      <p>Details. (Lob., </font><i><font face="Arial"
      size="2" color="#ff0000">Co v</font><font face="Arial" size="2"
      color="#0000ff">.)</p>

Instead of putting directly "Details. (Lob., Co v.)" inside < p> < /p> , it's put using < font> and < i>. When I use this code

foreach($html->find('p') as $p) 
{
  echo $p->plaintext.'<br>';
}

I find "Details. (Lob.," it stops when finding < i > or < font >. How can I extract the whole line "Details. (Lob., Co v.)"

Thank you for your answer

2
  • Do you mean "scrape"? Just making sure. Commented Jan 23, 2017 at 21:13
  • Yes sorry, I mean scrape Commented Jan 23, 2017 at 21:54

1 Answer 1

1

You can use strip_tags() function to remove the unnecessary tags. after removing unnecessary tags, you can use DOM parser.

The strip_tags() function strips a string from HTML, XML, and PHP tags.

string strip_tags ( string $str [, string $allowable_tags ] )

You can read more about strip_tags() function on php.net

Example:

$html = '<font face="Arial" color="#ff0000">
    <p>Parameters</p>
    </font><font face="Arial" size="2" color="#ff0000">
    <p>Param1</p>
    </font><font face="Arial" size="2" color="#0000ff">
    <p>Details. (Lob., </font><i><font face="Arial"
    size="2" color="#ff0000">Co v</font><font face="Arial" size="2"
    color="#0000ff">.)</p>';

$html = strip_tags($string, '<p>');
echo $html;

Result:

  <p>Parameters</p>

  <p>Param1</p>

  <p>Details. (Lob., Co v.)</p>
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.