Cleanup HTML with PHP to create clean string

Question

I've got a bunch of HTML data that I'm writing to a PDF file using PHP. In the PDF, I want all of the HTML to be stripped and cleaned up. So for instance:

<ul>
    <li>First list item</li>
    <li>Second list item which is quite a bit longer</li>
    <li>List item with apostrophe 's 's</li>
</ul>

Should become:

First list item
Second list item which is quite a bit longer
List item with apostrophe 's 's

However, if I simply use strip_tags(), I get something like this:

   First list item&#8232;

   Second list item which is quite a bit
longer&#8232;

   List item with apostrophe &rsquo;s &rsquo;s

Also note the indentation of the output.

Any tips on how to properly cleanup the HTML to nice, clean strings without messy whitespace and odd characters?

Thanks :)

I doubt that strip_tags() alone will encode your entities. Are you sure you're not missing a call to htmlentities somewhere? — Yoshi
– Yoshi, Commented May 4, 2012 at 7:28
The indenting is exactly what I'd expect, PHP is stripping the tags, but not the extra text around them. — scragar
– scragar, Commented May 4, 2012 at 7:30
Do you mean I should or shouldn't use htmlentities() somewhere? At this moment I'm not. The HTML data comes straight from a database. — Rein
– Rein, Commented May 4, 2012 at 7:32
htmlentities is responsible for these things   (e.g.), so if you don't want them, you should not use it. — Yoshi
– Yoshi, Commented May 4, 2012 at 7:33

xCander · Accepted Answer · 2012-05-04 07:33:18Z

5

The characters seems to be html entities. Try:

html_entity_decode( strip_tags( $my_html_code ) );

answered May 4, 2012 at 7:33

xCander

1,3378 silver badges16 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Mark Railton Over a year ago

Perfect, this worked a treat for what I was having an issue with.

Mouna Cheikhna · Accepted Answer · 2012-05-04 08:28:02Z

3

you can decode the result of strip_tags using html_entity_decode or remove them using preg_replace:

$text = strip_tags($html_text);
$content = preg_replace("/&#?[a-z0-9]{2,8};/i","",$text );

and to remove white spaces from the beginning of your lines use ltrim:

$content = join("\n", array_map("ltrim", explode("\n", $content )));

to keep apostrophes use this instead:

$text = strip_tags($html_text);
$text = str_replace("&rsquo;","'", $text); 
$content = preg_replace("/&#?[a-z0-9]{2,8};/i","",$text );

edited May 4, 2012 at 8:28

answered May 4, 2012 at 7:37

Mouna Cheikhna

39.8k10 gold badges50 silver badges69 bronze badges

2 Comments

Rein Over a year ago

That's great! Almost there. The only thing is that the apostrophes are now completely gone. Can that be fixed with a minor adjustment?

Rein Over a year ago

I used preg_replace like in your answer.

s.webbandit · Accepted Answer · 2012-05-04 07:28:05Z

0

use PHP Tidy library to clean your html. But in your case I'd use DOMDocument class to get data from html.

answered May 4, 2012 at 7:28

s.webbandit

17.2k17 gold badges60 silver badges82 bronze badges

Collectives™ on Stack Overflow

Cleanup HTML with PHP to create clean string

3 Answers 3

1 Comment

2 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

2 Comments

Comments

Related