This question (originally at Stackoverflow) is about standardization and/or generality of the "load HTML" in DOM tools, when used for "fix HTML code" tasks -- or, alternatially (?), is about the use of Tidy basic module as a de facto standard.
Many complex algoritmns can be simplified by the use of loadHTML that transform "invalid XHTML" into "valid XHTML"... I see the "code correction" (performed by load HTML/save XML) as an alternative to a complex "XHTML interleaving algorithm", but do not see how much this alternative is valid to express reusable algorithms, independent of programming language, browser, etc.
An example show better. Example: the "invert italics algorithm", implemented here with jQuery, and illustred below (sec. Illustrating) with Javascript and PHP.
If you do the "invert italics" task by an isolated rotine, it will be a complex parser, because need to intercalate tags. By other hand, using the loadHTML it will be simple (and faster because using build-in functions!).
So, the question (bearing in mind this example if you prefer),
- Can I use this algorithm in any browser? This "loadHTML acepting errors" feature will be always present?
- Can I use in another (non-browser) context of "descktop XML processing"? It is a "generic algoritm", for any DOM2 compliant framework? ... OR "HOW TO do a more generic algorithm with this DOM correction feature?"
Escope of some terms used here:
HTML code correction: see HTML Tidy lib and similar tools.
DOM toll: the (perhaps) most popular is the Gnome libxml2 parser, used in a lot of web-browsers, and languages like Perl and PHP.
loadHTML: see jQuery.html(myHTMLfragment), Javascript innerHTML, PHP loadHTML method, etc. or Tidy parseStribg, etc. PS: idem about back with saveXML.
##Illustrating##
Implementing in Javascript and PHP the basic algorithm of the example, "Invert Italics Algorithm".
###Javascript###
It runs (!). Illustrating, by an jQuery implementation:
// DOM correction as an alternative to a "XML interleaving algorithm"
var s = $('#sample1').html(); // get original html text fragment
// INVERSION ALGORITHM: add and remove italics.
s = "<i>"+
s.replace(/<(\/?)i>/mg,
function (m,p1){
return p1? '<i>': '</i>';
}
) +
"</i>"; // a not-well-formed-XHTML, but it is ok...
$('#inverted').html(s); // ...the DOM do all rigth!
// minor corrections, for clean empties:
s = $('#inverted').html();
s = s.replace(/<([a-z]+)>(\s*)<\/\1>/mg,'$2'); // clean
s = s.replace(/<([a-z]+)>(\s*)<\/\1>/mg,'$2'); // clean remain
$('#inverted').html(s);
// END ALGORITHM
alert(s);
###PHP, with DOMDocument ###
The same, translated to PHP. It not runs very well...
$sample1='<b><i>O</i>RIGINAL <big><i>with italics</i> and </big> withOUT</b>';
$inverted = '... inverted will be here ...';
echo $sample1;
// DOM correction as an alternative to a "XML interleaving algorithm"
$s = $sample1; // get original html text fragment
// INVERSION ALGORITHM: add and remove italics.
$s = "<i>" . preg_replace_callback( '/<(\/?)i>/s',
function ($m){ return $m[1]? '<i>': '</i>'; }, $s ) .
"</i>"; // a not-well-formed-XHTML, but it is ok...
$doc = new DOMDocument();
@$doc->loadHTML($s); //...Ops, how to say "DOM do corrections!" here??
$s = $doc->saveXML();
// minor corrections, for clean empties:
$s = preg_replace('/<([a-z]+)>(\s*)<\/\1>/s', '$2', $s); // clean
$s = preg_replace('/<([a-z]+)>(\s*)<\/\1>/s', '$2', $s); // clean remain
// END ALGORITHM
echo "\n\n$s";
###PHP, with Tidy ###
The same of Javascript, translated to PHP, and it show the same results (!)
$sample1='<b><i>O</i>RIGINAL <big><i>with italics</i> and </big> withOUT</b>';
$inverted = '... inverted will be here ...';
echo $sample1;
// Tidy correction
$s = $sample1; // get original html text fragment
// INVERSION ALGORITHM: add and remove italics.
$s = "<i>".
preg_replace_callback('/<(\/?)i>/s', function ($m){
return $m[1]? '<i>': '</i>';}, $s) .
"</i>"; // a not-well-formed-XHTML, but it is ok...
$config = array('show-body-only'=>true,'output-xhtml'=>true);
$tidy = new tidy;
$tidy->parseString($s, $config, 'utf8');
$s = $tidy; // ... because Tidy corrects!
// minor corrections, for clean empties:
$s = preg_replace('/<([a-z]+)>(\s*)<\/\1>/s', '$2', $s); // clean
$s = preg_replace('/<([a-z]+)>(\s*)<\/\1>/s', '$2', $s); // clean remain
// END ALGORITHM
echo "\n\n$s";