假设您使用有效的XHTML,可以很容易地解析HTML并确保标记被正确处理。您只需要跟踪到目前为止已打开的标签,并确保再次“关闭”。
header('Content-type: text/plain; charset=utf-8');
function printTruncated($maxLength, $html, $isUtf8=true)
{
$printedLength = 0;
$position = 0;
$tags = array();
// For UTF-8, we need to count multibyte sequences as one character.
$re = $isUtf8
? '{?([a-z]+)[^>]*>|?[a-zA-Z0-9]+;|[\x80-\xFF][\x80-\xBF]*}'
: '{?([a-z]+)[^>]*>|?[a-zA-Z0-9]+;}';
while ($printedLength < $maxLength && preg_match($re, $html, $match, PREG_OFFSET_CAPTURE, $position))
{
list($tag, $tagPosition) = $match[0];
// Print text leading up to the tag.
$str = substr($html, $position, $tagPosition - $position);
if ($printedLength + strlen($str) > $maxLength)
{
print(substr($str, 0, $maxLength - $printedLength));
$printedLength = $maxLength;
break;
}
print($str);
$printedLength += strlen($str);
if ($printedLength >= $maxLength) break;
if ($tag[0] == '&' || ord($tag) >= 0x80)
{
// Pass the entity or UTF-8 multibyte sequence through unchanged.
print($tag);
$printedLength++;
}
else
{
// Handle the tag.
$tagName = $match[1][0];
if ($tag[1] == '/')
{
// This is a closing tag.
$openingTag = array_pop($tags);
assert($openingTag == $tagName); // check that tags are properly nested.
print($tag);
}
else if ($tag[strlen($tag) - 2] == '/')
{
// Self-closing tag.
print($tag);
}
else
{
// Opening tag.
print($tag);
$tags[] = $tagName;
}
}
// Continue after the tag.
$position = $tagPosition + strlen($tag);
}
// Print any remaining text.
if ($printedLength < $maxLength && $position < strlen($html))
print(substr($html, $position, $maxLength - $printedLength));
// Close any open tags.
while (!empty($tags))
printf('%s>', array_pop($tags));
}
printTruncated(10, '<Hello> world!'); print("\n");
printTruncated(10, '
Heck, | throw |
in a | table |
printTruncated(10, "Hellow\xC3\xB8rld!"); print("\n");
编码注意:上面的代码假设XHTML是UTF-8编码。也支持ASCII兼容的单字节编码(例如Latin-1),只是传递false作为第三个参数。不支持其他多字节编码,虽然您可以通过使用mb_convert_encoding在调用函数之前转换为UTF-8,然后在每个打印语句中再次转换来支持。
(你应该总是使用UTF-8,虽然。)
编辑:更新以处理字符实体和UTF-8。修正了一个函数打印一个字符太多的错误,如果该字符是一个字符实体。