5

I've got a bunch of HTML data that I'm writing to a PDF file using PHP. In the PDF, I want all of the HTML to be stripped and cleaned up. So for instance:

<ul>
    <li>First list item</li>
    <li>Second list item which is quite a bit longer</li>
    <li>List item with apostrophe 's 's</li>
</ul>

Should become:

First list item
Second list item which is quite a bit longer
List item with apostrophe 's 's

However, if I simply use strip_tags(), I get something like this:

   First list item&#8232;

   Second list item which is quite a bit
longer&#8232;

   List item with apostrophe &rsquo;s &rsquo;s

Also note the indentation of the output.

Any tips on how to properly cleanup the HTML to nice, clean strings without messy whitespace and odd characters?

Thanks :)

4
  • 2
    I doubt that strip_tags() alone will encode your entities. Are you sure you're not missing a call to htmlentities somewhere? Commented May 4, 2012 at 7:28
  • 1
    The indenting is exactly what I'd expect, PHP is stripping the tags, but not the extra text around them. Commented May 4, 2012 at 7:30
  • Do you mean I should or shouldn't use htmlentities() somewhere? At this moment I'm not. The HTML data comes straight from a database. Commented May 4, 2012 at 7:32
  • htmlentities is responsible for these things &#8232; (e.g.), so if you don't want them, you should not use it. Commented May 4, 2012 at 7:33

3 Answers 3

5

The characters seems to be html entities. Try:

html_entity_decode( strip_tags( $my_html_code ) );
Sign up to request clarification or add additional context in comments.

1 Comment

Perfect, this worked a treat for what I was having an issue with.
3

you can decode the result of strip_tags using html_entity_decode or remove them using preg_replace:

$text = strip_tags($html_text);
$content = preg_replace("/&#?[a-z0-9]{2,8};/i","",$text );

and to remove white spaces from the beginning of your lines use ltrim:

$content = join("\n", array_map("ltrim", explode("\n", $content )));

to keep apostrophes use this instead:

$text = strip_tags($html_text);
$text = str_replace("&rsquo;","'", $text); 
$content = preg_replace("/&#?[a-z0-9]{2,8};/i","",$text );

2 Comments

That's great! Almost there. The only thing is that the apostrophes are now completely gone. Can that be fixed with a minor adjustment?
I used preg_replace like in your answer.
0

use PHP Tidy library to clean your html. But in your case I'd use DOMDocument class to get data from html.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.