php 包含截断,php – 截断包含HTML的文本,忽略标记

最新推荐文章于 2022-03-23 19:39:27 发布

weixin_39937447

最新推荐文章于 2022-03-23 19:39:27 发布

阅读量81

点赞数

文章标签： php 包含截断

假设您使用有效的XHTML，可以很容易地解析HTML并确保标记被正确处理。您只需要跟踪到目前为止已打开的标签，并确保再次“关闭”。

header('Content-type: text/plain; charset=utf-8');

function printTruncated($maxLength, $html, $isUtf8=true)

{

$printedLength = 0;

$position = 0;

$tags = array();

// For UTF-8, we need to count multibyte sequences as one character.

$re = $isUtf8

? '{?([a-z]+)[^>]*>|?[a-zA-Z0-9]+;|[\x80-\xFF][\x80-\xBF]*}'

: '{?([a-z]+)[^>]*>|?[a-zA-Z0-9]+;}';

while ($printedLength < $maxLength && preg_match($re, $html, $match, PREG_OFFSET_CAPTURE, $position))

{

list($tag, $tagPosition) = $match[0];

// Print text leading up to the tag.

$str = substr($html, $position, $tagPosition - $position);

if ($printedLength + strlen($str) > $maxLength)

{

print(substr($str, 0, $maxLength - $printedLength));

$printedLength = $maxLength;

break;

}

print($str);

$printedLength += strlen($str);

if ($printedLength >= $maxLength) break;

if ($tag[0] == '&' || ord($tag) >= 0x80)

{

// Pass the entity or UTF-8 multibyte sequence through unchanged.

print($tag);

$printedLength++;

}

else

{

// Handle the tag.

$tagName = $match[1][0];

if ($tag[1] == '/')

{

// This is a closing tag.

$openingTag = array_pop($tags);

assert($openingTag == $tagName); // check that tags are properly nested.

print($tag);

}

else if ($tag[strlen($tag) - 2] == '/')

{

// Self-closing tag.

print($tag);

}

else

{

// Opening tag.

print($tag);

$tags[] = $tagName;

}

// Continue after the tag.

$position = $tagPosition + strlen($tag);

}

// Print any remaining text.

if ($printedLength < $maxLength && $position < strlen($html))

print(substr($html, $position, $maxLength - $printedLength));

// Close any open tags.

while (!empty($tags))

printf('%s>', array_pop($tags));

}

printTruncated(10, '<Hello> world!'); print("\n");

printTruncated(10, '

Heck,	throw
in a	table

'); print("\n");

printTruncated(10, "Hellow\xC3\xB8rld!"); print("\n");

编码注意：上面的代码假设XHTML是UTF-8编码。也支持ASCII兼容的单字节编码(例如Latin-1)，只是传递false作为第三个参数。不支持其他多字节编码，虽然您可以通过使用mb_convert_encoding在调用函数之前转换为UTF-8，然后在每个打印语句中再次转换来支持。

(你应该总是使用UTF-8，虽然。)

编辑：更新以处理字符实体和UTF-8。修正了一个函数打印一个字符太多的错误，如果该字符是一个字符实体。

weixin_39937447

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
php 包含截断,php – 截断包含HTML的文本,忽略标记

假设您使用有效的XHTML，可以很容易地解析HTML并确保标记被正确处理。您只需要跟踪到目前为止已打开的标签，并确保再次“关闭”。header('Content-type: text/plain; charset=utf-8');function printTruncated($maxLength, $html, $isUtf8=true){$printedLength = 0;$position...
复制链接

扫一扫