PHP中通过SimpleXMLElement配合DOMDocument提取XML中的HTML内容

最新推荐文章于 2021-07-01 18:24:10 发布

IOsetting

最新推荐文章于 2021-07-01 18:24:10 发布

阅读量125

点赞数

分类专栏： PHP 文章标签： java php mysql xml html

本文链接：https://blog.csdn.net/michaelchain/article/details/119630848

版权

PHP 专栏收录该内容

29 篇文章 0 订阅

订阅专栏

PHP中的simplexml_load_file在解析标准XML时没问题, 但是有两点缺陷: 1. 默认会忽略CDATA的内容 2. 所有HTML标签会被忽略, 在上级节点中能看到, 但是无法通过xpath检索第一点可以通过设置simplexml_load_file的LIBXML_NOCDATA来解决第二点无法直接解决, 只能通过其他办法, 将HTML节点提取出后, 使用DOMDocument来抽取所要的内容. 相关代码例子如下:

$xml = simplexml_load_file("embeded_html.xml", null, LIBXML_NOCDATA);
$node = $xml->xpath("/PathToHere/ContentItem/DataContent");
$children = $node[0]->children();
$html = $children->asXML();
//print_r($html);
$dom = new DOMDocument;
$dom->loadHTML($html);
//get content
$items = $dom->getElementsByTagName('div');

foreach ($items as $item) {
	if ($item->getAttribute('class') == 'content-attr') {
		echo $item->nodeValue, PHP_EOL;
	}
}

补充: 后来对方技术又提供了另一种解决的方案:

print_r($xml->NewsItem->NewsComponent->ContentItem->DataContent->html->body);
print_r($xml->NewsItem->NewsComponent->ContentItem->DataContent->html->body->div[1]);

The method simplexml_load_file works well with standard XML but: 1. By default, it ignores all CDATA content 2. All HTML content will be skipped. The content exists in the upper nodes, but can not be searched by xpath('path') The first one can be solved by specifying LIBXML_NOCDATA The second one can be solved by using DOMDocument to walk around. The source code:

$xml = simplexml_load_file("embeded_html.xml", null, LIBXML_NOCDATA);
$node = $xml->xpath("/PathToHere/ContentItem/DataContent");
$children = $node[0]->children();
$html = $children->asXML();
//print_r($html);
$dom = new DOMDocument;
$dom->loadHTML($html);
//get content
$items = $dom->getElementsByTagName('div');

foreach ($items as $item) {
	if ($item->getAttribute('class') == 'content-attr') {
		echo $item->nodeValue, PHP_EOL;
	}
}

IOsetting

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
PHP中通过SimpleXMLElement配合DOMDocument提取XML中的HTML内容

PHP中的simplexml_load_file在解析标准XML时没问题, 但是有两点缺陷: 1. 默认会忽略CDATA的内容 2. 所有HTML标签会被忽略, 在上级节点中能看到, 但是无法通过xpath检索第一点可以通过设置simplexml_load_file的LIBXML_NOCDATA来解决第二点无法直接解决, 只能通过其他办法, 将HTML节点提取出后, 使用DOMDocume...
复制链接

扫一扫