Get text from html in PHP

Question

I want to get plain text from Html in PHP. Have tried https://github.com/mtibben/html2text library but seems to be failing in some scenarios. I'll be header tags, paragraph and div tags in my html and need to just return plain text from it.

Below is the code I tried

require_once('class.html2text.inc');
// The “source” HTML you want to convert.
$html = '<div class="mozaik-inner" style="font-family:Arial, Helvetica, sans-serif;font-size:14px;line-height:22.4px;color:rgb(68,68,68);padding:0px 30px;margin:0px auto;width:600px;background-color:rgb(250,250,250);"><h2 style="font-family:Arial, Helvetica, sans-serif;font-size:18px;line-height:28.8px;color:#444444;padding:0px;margin:0px;">Account Details for $account_name :</h2><p style="font-family:Arial, Helvetica, sans-serif;font-size:14px;line-height:22.4px;color:#444444;padding:0px;margin:0px;">TOID: $account_to_id_c</p><p style="font-family:Arial, Helvetica, sans-serif;font-size:14px;line-height:22.4px;color:#444444;padding:0px;margin:0px;"> </p></div>';

// Instantiate a new instance of the class. Passing the string
// variable automatically loads the HTML for you.
$h2t =& new html2text($html);

// Simply call the get_text() method for the class to convert
// the HTML to the plain text. Store it into the variable.
$text = $h2t->get_text();
echo $text;

The issue is my html has Account Details for $account_name which is incorrectly converted to all uppercase as well as removes $account_name

I need a way to get text from html. div,p,heading tags might be converted to new lines.

Expected Output :

Account Details for $account_name :
TOID: $account_to_id_c

you need to use a dom parser, php.net/manual/en/book.simplexml.php or php.net/manual/en/class.domdocument.php — andrew
– andrew, Commented May 9, 2018 at 5:45
Have tried strip_tags but I need to have new lines on div, p and heading tags. — user3286692
– user3286692, Commented May 9, 2018 at 5:50
I would probably go with @andrew's suggestion and use DOMDocument, if you need to check visibility of the containing element as well. — M. Eriksson
– M. Eriksson, Commented May 9, 2018 at 5:53

Nigel Ren · Accepted Answer · 2018-05-09 06:17:21Z

It's difficult to know if a solution will always work, but with the sample HTML you've included and the general principle being in the code, this should help...

// The “source” HTML you want to convert.
$html = '<div class="mozaik-inner" style="font-family:Arial, Helvetica, sans-serif;font-size:14px;line-height:22.4px;color:rgb(68,68,68);padding:0px 30px;margin:0px auto;width:600px;background-color:rgb(250,250,250);"><h2 style="font-family:Arial, Helvetica, sans-serif;font-size:18px;line-height:28.8px;color:#444444;padding:0px;margin:0px;">Account Details for $account_name :</h2><p style="font-family:Arial, Helvetica, sans-serif;font-size:14px;line-height:22.4px;color:#444444;padding:0px;margin:0px;">TOID: $account_to_id_c</p><p style="font-family:Arial, Helvetica, sans-serif;font-size:14px;line-height:22.4px;color:#444444;padding:0px;margin:0px;"> </p></div>';

// Instantiate a new instance of the class. Passing the string
// variable automatically loads the HTML for you.
$h2t = new DOMDocument();
$h2t->loadHTML($html);

$contents = $h2t->getElementsByTagName('div');
$text = '';
foreach ( $contents[0]->childNodes as $content )   {
    $nodeType = $content->nodeName;
    if ( strtolower($nodeType[0]) == 'h' ){
        $text .= $content->textContent.PHP_EOL;
    }
    else    {
        $text .= $content->textContent;
    }
}
echo $text;

Which outputs...

Account Details for $account_name :
TOID: $account_to_id_c

The getElementsByTagName() call fetches the only <div> tag ) in this instance, so using [0] as the function returns a list of nodes. Then just iterate over the child nodes.

If the tag name starts with a 'h' (so <h1>, <h2>), then put a new line after the text. You could adapt this to pick out certain tags and do something specific with different content types.

If your content is part of a larger page, you could narrow the way you find the content (for example) by using XPath...

$h2t = new DOMDocument();
$h2t->loadHTML($html);
$xp = new DOMXPath($h2t);

//$contents = $h2t->getElementsByTagName('div');
$contents = $xp->query("//div[@class='mozaik-inner']");

This finds a <div> tag with the class 'mozaik-inner'. The rest of the code stays the same, just a case of how to find the HTML to work with changes.

Collectives™ on Stack Overflow

Get text from html in PHP

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related