0

I want to get plain text from Html in PHP. Have tried https://github.com/mtibben/html2text library but seems to be failing in some scenarios. I'll be header tags, paragraph and div tags in my html and need to just return plain text from it.

Below is the code I tried

require_once('class.html2text.inc');
// The “source” HTML you want to convert.
$html = '<div class="mozaik-inner" style="font-family:Arial, Helvetica, sans-serif;font-size:14px;line-height:22.4px;color:rgb(68,68,68);padding:0px 30px;margin:0px auto;width:600px;background-color:rgb(250,250,250);"><h2 style="font-family:Arial, Helvetica, sans-serif;font-size:18px;line-height:28.8px;color:#444444;padding:0px;margin:0px;">Account Details for $account_name :</h2><p style="font-family:Arial, Helvetica, sans-serif;font-size:14px;line-height:22.4px;color:#444444;padding:0px;margin:0px;">TOID: $account_to_id_c</p><p style="font-family:Arial, Helvetica, sans-serif;font-size:14px;line-height:22.4px;color:#444444;padding:0px;margin:0px;"> </p></div>';

// Instantiate a new instance of the class. Passing the string
// variable automatically loads the HTML for you.
$h2t =& new html2text($html);

// Simply call the get_text() method for the class to convert
// the HTML to the plain text. Store it into the variable.
$text = $h2t->get_text();
echo $text;

The issue is my html has Account Details for $account_name which is incorrectly converted to all uppercase as well as removes $account_name

I need a way to get text from html. div,p,heading tags might be converted to new lines.

Expected Output :

Account Details for $account_name :
TOID: $account_to_id_c 
7
  • you need to use a dom parser, php.net/manual/en/book.simplexml.php or php.net/manual/en/class.domdocument.php Commented May 9, 2018 at 5:45
  • 1
    Have you tried strip_tags()? Demo: 3v4l.org/qpMkO Commented May 9, 2018 at 5:46
  • Have tried strip_tags but I need to have new lines on div, p and heading tags. Commented May 9, 2018 at 5:50
  • @MagnusEriksson just added expected output, thanks! Commented May 9, 2018 at 5:52
  • I would probably go with @andrew's suggestion and use DOMDocument, if you need to check visibility of the containing element as well. Commented May 9, 2018 at 5:53

1 Answer 1

1

It's difficult to know if a solution will always work, but with the sample HTML you've included and the general principle being in the code, this should help...

// The “source” HTML you want to convert.
$html = '<div class="mozaik-inner" style="font-family:Arial, Helvetica, sans-serif;font-size:14px;line-height:22.4px;color:rgb(68,68,68);padding:0px 30px;margin:0px auto;width:600px;background-color:rgb(250,250,250);"><h2 style="font-family:Arial, Helvetica, sans-serif;font-size:18px;line-height:28.8px;color:#444444;padding:0px;margin:0px;">Account Details for $account_name :</h2><p style="font-family:Arial, Helvetica, sans-serif;font-size:14px;line-height:22.4px;color:#444444;padding:0px;margin:0px;">TOID: $account_to_id_c</p><p style="font-family:Arial, Helvetica, sans-serif;font-size:14px;line-height:22.4px;color:#444444;padding:0px;margin:0px;"> </p></div>';

// Instantiate a new instance of the class. Passing the string
// variable automatically loads the HTML for you.
$h2t = new DOMDocument();
$h2t->loadHTML($html);

$contents = $h2t->getElementsByTagName('div');
$text = '';
foreach ( $contents[0]->childNodes as $content )   {
    $nodeType = $content->nodeName;
    if ( strtolower($nodeType[0]) == 'h' ){
        $text .= $content->textContent.PHP_EOL;
    }
    else    {
        $text .= $content->textContent;
    }
}
echo $text;

Which outputs...

Account Details for $account_name :
TOID: $account_to_id_c 

The getElementsByTagName() call fetches the only <div> tag ) in this instance, so using [0] as the function returns a list of nodes. Then just iterate over the child nodes.

If the tag name starts with a 'h' (so <h1>, <h2>), then put a new line after the text. You could adapt this to pick out certain tags and do something specific with different content types.

If your content is part of a larger page, you could narrow the way you find the content (for example) by using XPath...

$h2t = new DOMDocument();
$h2t->loadHTML($html);
$xp = new DOMXPath($h2t);

//$contents = $h2t->getElementsByTagName('div');
$contents = $xp->query("//div[@class='mozaik-inner']"); 

This finds a <div> tag with the class 'mozaik-inner'. The rest of the code stays the same, just a case of how to find the HTML to work with changes.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.