OK, I've been playing with this for a while. The result may not be the best or most direct solution (and, frankly, I disagree with your approach entirely if arbitrary users are going to be submitting the input), but it appears to "work". And, most importantly, it doesn't use regexes for parsing XML. :)
Faking the input
$str = <<
a div..
a div..
this should be ignored
EOF;
?>
Code
function recurse(&$doc, &$parent) {
if (!$parent->hasChildNodes())
return;
foreach ($parent->childNodes as $elm) {
if ($elm->nodeName == "code" || $elm->nodeName == "pre") {
$content = '';
while ($elm->hasChildNodes()) { // `for` breaks the `removeChild`
$child = $elm->childNodes->item(0);
$content .= $doc->saveXML($child);
$elm->removeChild($child);
}
$elm->appendChild($doc->createTextNode($content));
}
else {
recurse($doc, $elm);
}
}
}
// Load in the DOM (remembering that XML requires one root node)
$doc = new DOMDocument();
$doc->loadXML("" . $str . "");
// Iterate the DOM, finding
and
recurse($doc, $doc->documentElement);
// Output the result
foreach ($doc->childNodes->item(0)->childNodes as $node) {
echo $doc->saveXML($node);
}
?>
Output
<div> a div.. </div>
<div> a div.. </div>
Proof
You can see it working here.
Note that it doesn't explicitly call htmlspecialchars; the DOMDocument object handles the escaping itself.
I hope that this helps. :)