我想用PHP和CURL废弃一个中文网站.早些时候我遇到了压缩结果的问题,SO帮助我解决了问题.
现在我在通过PHP-DOMDocument解析内容时遇到了麻烦.
错误如下,
Warning: DOMDocument::loadHTML(): input conversion failed due to input error, bytes 0xE3 0x80 0x90 0xE8 in /var/www/html/ ..
即使警告这是阻止进一步的结果.
我的代码如下:
$agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0';
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL,$url);
curl_setopt($curl, CURLOPT_HTTPHEADER, array('text/html; charset=gb2312'));
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 10);
curl_setopt($curl, CURLOPT_ENCODING, ""); // handling all compressions
curl_setopt($curl, CURLOPT_USERAGENT, $agent);
curl_setopt($curl, CURLOPT_TIMEOUT, 1000);
$html = curl_exec($curl) or die("error: ".curl_error($curl));
curl_close($curl);
$htmlParsed = mb_convert_encoding($result,'utf-8','gb2312');
$doc = new DOMDocument();
$doc->loadHTML($htmlParsed);
$xpath = new DOMXpath($doc);
$elements = $xpath->query('//div[@class="test"]//a/@href');
if (!is_null($elements)) {
foreach ($elements as $element) {
echo "
[". $element->nodeName. "]";
$nodes = $element->childNodes;
foreach ($nodes as $node) {
echo $node->nodeValue. "\n";
}
}
}
我在目标网站上找到了内容类型,
所以我尝试将结果转换为utf-8.
由于输入转换在代码的’DOMDocument :: loadHTML()’行失败,我无法解析网页以获得结果.
我目前陷入困境,任何帮助或建议都将受到高度赞赏. Thanx提前.
(之前我曾经使用简单的HTML DOM解析器,这非常简单.但是后来在阅读SO中关于其用法的缺点.我计划切换到PHP的原生DOM解析器)
解决方法:
我今天看到了解决方案.
$html=new DOMDocument();
$html_source = get_html();
$html_source =mb_convert_encoding( $html_source, "HTML-ENTITIES", "UTF-8");
$html->loadHTML( $html_source );
标签:php,dom,curl,parsing,web-scraping
来源: https://codeday.me/bug/20190612/1225638.html