0

I want to use DOMDocument to parse sting came from Rich-Text-Editor, exactly what I need are:

1) Allow only (div, p, span, b, ul, ol, li, blockquotem br) tags, remove others tags with its content

Edit: I'm using strip_tags() for this

2) allow only these styles:

  • style="font-weight:bold"
  • style="font-style: italic"
  • style="text-decoration: underline"

3) remove any attributes in the allowed tags like class, id ...etc except align attribute only

any ideas ?

3
  • 1
    Re your second point: What happens if an element has both bold and italic styles? And what if it's a <ul> or <li> element, because changing it to a <b> or and <i> tag would change how it works. Finally, I would point out that the <u> tag is deprecated; it is recommended to use the text-decoration:underline style instead. Commented Jul 30, 2011 at 20:18
  • @Spudley nice I edit the question to only allow these styles only Commented Jul 30, 2011 at 20:23
  • See question stackoverflow.com/questions/4979836/domdocument-in-php/… Commented Mar 12, 2014 at 8:29

4 Answers 4

1

I would recommend against trying to filter HTML input using DOMDocument for security reasons, in particular, due to the risk of cross-site scripting. You can easily take care of your requirements in 1 and 3 with a filter library like HTML Purifier. For the reasons Spudley mentions, number 2 is a little more difficult. I'd start by whitelisting those style attributes in HTML Purifier and then using some logic to scan for them after filtering, adding the appropriate tags inside that element.

Here's an example for using HTML Purifier how you want (taken from basic.php). The only things I've changed are the HTML.AllowedAttributes and HTML.AllowedElements settings.

<?php
// replace this with the path to the HTML Purifier library
require_once 'library/HTMLPurifier.auto.php';

$config = HTMLPurifier_Config::createDefault();

// configuration goes here:
$config->set('Core.Encoding', 'UTF-8'); // replace with your encoding
$config->set('HTML.Doctype', 'XHTML 1.0 Transitional'); // replace with your doctype
$config->set('HTML.AllowedAttributes', '*.style, align');
$config->set('HTML.AllowedElements', 'div, p, span, b, ul, ol, li, blockquote, br');
$config->set('CSS.AllowedProperties', 'font-weight, font-style, text-decoration');


$purifier = new HTMLPurifier($config);

$html = '<div align="center" style="font-style:italic; color: red" title="removeme">Allowed</div> <img src="not_allowed.jpg" /> <script>not allowed</script>';

$filteredHtml = $purifier->purify($html);
echo '<pre>' . htmlspecialchars($filteredHtml) . '</pre>';

Which outputs:

<div align="center" style="font-style:italic;">Allowed</div>, 
Sign up to request clarification or add additional context in comments.

6 Comments

HTML Purifier comes with some basic sample code (docs/example/basic.php) to get you started, and there's plenty of documentation online.
Just please post sample code of using HTML Purifier to make sure that will do what I want
@D3VELOPER: Added an example using your tags. You should be able to figure it out from there. If you want to change any other configurations, look at this.
but this allow all style rules I just want to allow just 3 style rules
Added to the code snippet above - please look at the documentation I linked. It's in the CSS section.
|
0

Since you only want to allow a small number of HTML elements, you could consider cleaning up the HTML code with the PHP strip_tags() function prior to giving it to the DOMDocument classes.

This will certainly be easier than parsing the DOM yourself to find elements than need to be stripped.

That should deal with part 1 of your question.

It won't deal with parts 2 or 3, but it's a good start.

2 Comments

I'm already did that :) nice I hope to find answer for part 2 & 3
@D3VELOPER - see my comment on the question for the flaws I see in your plans for part 2.
0

I have a code exactly to do this but its rather undocumented and uses some code I do not own but in public domain. Its pretty easy to use and it ensures all tags are closed so they do not affect your code, use fix_html function for that. It can also limit use of tags and attributes strip_tags_attributes for this, also use strip_javascript to remove javascript functionality of any sort. I used this extensively but to be honest I do not know if this one is from production. For your second answer, I guess its best to remove styles all together so they can use <i> or <b> as they like. And please dont let anyone to use underline.

function findNodeValue($parent, $node) {
    $nodes=array();
    if(!is_a($parent, "DOMElement")) return NULL;

    foreach($parent->childNodes as $child)
        if($child->nodeName==$node) $nodes[]=$child;

    if(count($nodes)==0) return NULL;
    if(count($nodes)==1) return $nodes[0]->nodeValue;
    else {
        $ret=array();
        foreach($nodes as $node)
            $ret[]=$node->nodeValue;

        return $ret;
    }
}

function strip_javascript($filter){ 

    // realign javascript href to onclick 
    $filter = preg_replace("/href=(['\"]).*?javascript:(.*)?\\1/i", "onclick=' $2 '", $filter);

    //remove javascript from tags 
    while( preg_match("/<(.*)?javascript.*?\(.*?((?>[^()]+)|(?R)).*?\)?\)(.*)?>/i", $filter)) 
        $filter = preg_replace("/<(.*)?javascript.*?\(.*?((?>[^()]+)|(?R)).*?\)?\)(.*)?>/i", "<$1$3$4$5>", $filter); 

    // dump expressions from contibuted content 
    $filter = preg_replace("/:expression\(.*?((?>[^(.*?)]+)|(?R)).*?\)\)/i", "", $filter); 
    $filter = preg_replace("/<iframe.*?>/", "", $filter);
    $filter = preg_replace("/<\/iframe>/", "", $filter);

    while( preg_match("/<(.*)?:expr.*?\(.*?((?>[^()]+)|(?R)).*?\)?\)(.*)?>/i", $filter)) 
        $filter = preg_replace("/<(.*)?:expr.*?\(.*?((?>[^()]+)|(?R)).*?\)?\)(.*)?>/i", "<$1$3$4$5>", $filter); 

    // remove all on* events    
    while( preg_match("/<(.*)?\s?on[^>\s]+?=\s?.+?(['\"]).*?\\2\s?(.*)?>/i", $filter, $match) ) {
        $filter = preg_replace("/<(.*)?\s?on[^>\s]+?=\s?.+?(['\"]).*?\\2\s?(.*)?>/i", "<$1$3>", $filter); 
    }

    return $filter; 
}

function html2a ( $html ) {
  ini_set('pcre.backtrack_limit', 10000);
  ini_set('pcre.recursion_limit', 10000);

  if ( !preg_match_all( '@\<\s*?(\w+)((?:\b(?:\'[^\']*\'|"[^"]*"|[^\>])*)?)\>((?:(?>[^\<]*)|(?R))*)\<\/\s*?\\1(?:\b[^\>]*)?\>|\<\s*(\w+)(\b(?:\'[^\']*\'|"[^"]*"|[^\>])*)?\/?\>@uxis', $html = trim($html), $m, PREG_OFFSET_CAPTURE | PREG_SET_ORDER) )
    return $html;
  $i = 0;
  $ret = array();
  foreach ($m as $set) {
    if ( strlen( $val = trim( substr($html, $i, $set[0][1] - $i) ) ) )
      $ret[] = $val;
    $val = $set[1][1] < 0 
      ? array( 'tag' => strtolower($set[4][0]) )
      : array( 'tag' => strtolower($set[1][0]), 'val' => html2a($set[3][0]) );
    if ( preg_match_all( '/(\w+)\s*(?:=\s*(?:"([^"]*)"|\'([^\']*)\'|(\w+)))?/usix', isset($set[5]) && $set[2][1] < 0 ? $set[5][0] : $set[2][0],$attrs, PREG_SET_ORDER ) ) {
      foreach ($attrs as $a) {
        $val['attr'][$a[1]]=$a[count($a)-1];
      }
    }
    $ret[] = $val;
    $i = $set[0][1]+strlen( $set[0][0] );
  }
  $l = strlen($html);
  if ( $i < $l )
    if ( strlen( $val = trim( substr( $html, $i, $l - $i ) ) ) )
      $ret[] = $val;
  return $ret;
}

function a2html ( $a, $in = "" ) {
  if ( is_array($a) ) {
    $s = "";
    foreach ($a as $t)
      if ( is_array($t) ) {
        $attrs=""; 
        if ( isset($t['attr']) )
          foreach( $t['attr'] as $k => $v )
            $attrs.=" ${k}=".( strpos( $v, '"' )!==false ? "'$v'" : "\"$v\"" );
        $s.= $in."<".$t['tag'].$attrs.( isset( $t['val'] ) ? ">\n".a2html( $t['val'], $in).$in."</".$t['tag'] : "/" ).">";
      } else
        $s.= $in.$t."";
  } else {
    $s = empty($a) ? "" : $in.$a."";
  }
  return $s;
}

function remove_unclosed(&$a, $allowunclosed) {
    if(!is_array($a)) return;

    foreach($a as $k=>$tag) {
        if(is_array($tag)) {
            if(!isset($tag["val"]) && !in_array($tag["tag"],$allowunclosed)) {
                //var_dump($tag["tag"]);
                unset($a[$k]);
            } elseif(is_array(@$tag["val"]))
                remove_unclosed($a[$k]["val"], $allowunclosed);
        }
    }
}

function fix_html($html, $allowunclosed=array("br")) {
    $a = html2a($html);
    remove_unclosed($a, $allowunclosed);
    return a2html($a);
}

function strip_tags_ex($str,$allowtags) { 
    $strs=explode('<',$str); 
    $res=$strs[0]; 
    for($i=1;$i<count($strs);$i++) 
    { 
        if(!strpos($strs[$i],'>')) 
            $res = $res.'&lt;'.$strs[$i]; 
        else 
            $res = $res.'<'.$strs[$i]; 
    } 
    return strip_tags($res,$allowtags);    
}

function strip_tags_attributes($string,$allowtags=allowedtags,$allowattributes=allowedattributes){
    $string=strip_javascript($string);

    $string = strip_tags_ex($string,$allowtags); 

    if (!is_null($allowattributes)) { 
        if(!is_array($allowattributes)) 
            $allowattributes = explode(",",$allowattributes); 
        if(is_array($allowattributes)) 
            $allowattributes = implode(")(?<!",$allowattributes); 
        if (strlen($allowattributes) > 0) 
            $allowattributes = "(?<!".$allowattributes.")"; 
        $string = preg_replace_callback("/<[^>]*>/i",create_function( 
            '$matches', 
            'return preg_replace("/ [^ =]*'.$allowattributes.'=(\"[^\"]*\"|\'[^\']*\')/i", "", $matches[0]);'    
        ),$string); 
    } 
    return $string; 
}

I found the source for strip_javascript http://www.php.net/manual/en/function.strip-tags.php#89453 I don't know why its not there in the code already. Probably because no name, no email no identity to refer.

Comments

0
$allowedTags = array( 'div' => true, 'p' => true, 'span' => true, 'b' => true,
    'ul' => true, 'ol' => true, 'li' => true, 'blockquot' => true, 'em' => true, 'br' => true );

$allowedStyles = array( 'font-weight: bold' => true, 'font-style: italic' => true, 'text-decoration: underline' => true );

$allowedAttribs = array( 'align' => true );

$doc = new DOMDocument();
$doc->loadXML( '<doc><p style="font-weight: bold">test</p> <b align="left">asdfasd faksd</b><script>asdfasd</script></doc>' );

sanitizeNodeChildren( $doc->documentElement );

echo htmlentities( $doc->saveXml() );

function sanitizeNodeChildren( $parentNode ) {
    $node = $parentNode->firstChild;
    while( $node ) {
        if( !sanitizeNode( $node ) ) {
            $nodeToDelete = $node;
            $node = $node->nextSibling;
            $parentNode->removeChild( $nodeToDelete );
        } else {
            sanitizeNodeChildren( $node );
            $node = $node->nextSibling;
        }
    }
}

function sanitizeNode( $node ) {
    global $allowedTags, $allowedStyles, $allowedAttribs;
    if( $node->nodeType == XML_ELEMENT_NODE ) {
        if( !isset( $allowedTags[ $node->tagName ] ) ) return false;

        foreach( $node->attributes as $name => $attrNode ) {
            if( $name == 'style' ) {
                if( isset( $allowedStyles[ $attrNode->nodeValue ] ) ) continue;
            }
            if( isset( $allowedAttribs[ $name ] ) ) continue;
            $node->removeAttribute( $name );
        }
    }

    return true;
}

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.