php 数据抓取curl+simple_html_dom总结

最新推荐文章于 2021-05-31 09:02:33 发布

dychen1026

最新推荐文章于 2021-05-31 09:02:33 发布

阅读量1.9k

点赞数

分类专栏： php+mysql

本文链接：https://blog.csdn.net/cdy102688/article/details/26380347

版权

php+mysql 专栏收录该内容

44 篇文章 0 订阅

订阅专栏

背景：在新开发的项目中，需要进行数据抓取，因为之前没有做过这个方面的内容，所以开始时候觉得会很麻烦。在谷歌了一把php数据抓取后，找到了现行的技术查看方向curl+simple_html_dom，以下记录的是自己在应用的实际总结。

第一步：首先封装出一个curl调用函数，完成curl参数的配置。

/**
 * 
 * get形式curl获取信息
 * @param unknown_type $url
 */
function getdatabycurl($url,$refer="http://www.baidu.com",$timeout=30){
    header("Content-type: text/html; charset=utf-8");
    $cookiefile = realpath("./")."/Application/Runtime/Temp/cookie.txt";//创建一个用于存放cookie信息的临时文件,
    if (!file_exists($cookiefile)){
        $file = @file_put_contents($cookiefile, "");
    }
    $ch = curl_init();
    //设置选项，包括URL
    curl_setopt($ch, CURLOPT_URL, $url);
	curl_setopt($ch, CURLOPT_TIMEOUT, $timeout);
	curl_setopt($ch, CURLOPT_HEADER, 0);
	curl_setopt($ch, CURLOPT_NOBODY,0);
	curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0)');
	curl_setopt($ch, CURLOPT_MAXREDIRS, 300);
	curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); //获取数据返回流形式
	//curl_setopt($ch, CURLOPT_AUTOREFERER, true); //重定向时，自动设置header中的Referer:信息
	curl_setopt($ch, CURLOPT_REFERER, $refer); //设定访问来源
	curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); //启用时会将服务器服务器返回的"Location: "放在header中递归的返回给服务器，使用CURLOPT_MAXREDIRS可以限定递归返回的数量
	
	// 对于cookie保存
	//curl_setopt($ch, CURLOPT_COOKIE, $cookie);
	curl_setopt($ch, CURLOPT_COOKIESESSION, true);
	curl_setopt($ch, CURLOPT_COOKIEFILE, $cookiefile);//关闭连接时，将服务器端返回的cookie保存在以下文件中
    curl_setopt($ch, CURLOPT_COOKIEJAR, $cookiefile);
    //执行并获取HTML文档内容
    //$output = curl_exec($ch);
    for ($i=0;$i<=5;$i++){
        $output = curl_exec($ch);
        if (!empty($output)){
            break;
        }
    }
    //释放curl句柄
    curl_close($ch);
    //unlink($cookiefile);
    
    // 当返回的内容为空时，重新尝试读取，最多3次
    /*if (empty($output) && $count <= 2){
        //echo "请求失败的url  ".$url.date("Y-m-d H:i:s",time())."<br>";
        $output = getdatabycurl($url,$count+1,$timeout+30);
    }*/
    return $output;
}

说明：参数$url为要抓取数据的页面地址，参数$refer为模拟的请求抓取页面的来源地址可以填写成抓取页面的域名，参数$timeout为连接超时时间设置。

第二步：引入simple_html_dom类文件

include_once 'simple_html_dom.php';

第三步：用simple_html_dom内的函数str_get_html将curl返回的结果解析成可以应用选择器选取元素的内容

$html = str_get_html($goods_detail_html);

第四步：通过simple_html_dom里的find方法选取到需要抓取的元素位置，进行解析

$e = $html->find('.attributes-list li',$i);
$attr_name = $e->plaintext;

第五步：检查解析出的结果编码是否正常，如不正常需要进行转码操作

/**
 * 
 * 将非utf-8的内容转为utf-8
 * @param unknown_type $comment
 */
function changeChartset($comment){
    $encoding = mb_detect_encoding($comment, array("UTF-8","ASCII","GB2312","GBK"));
    $comment = iconv($encoding, 'UTF-8', $comment);
    return $comment;
}

到此，简单的php数据抓取就算完成了，查看了下谷歌，如果想进行多线程抓取的话，需要先装上支持多线程的扩展库。因为我的项目服务器是单台，并且已经有了其他项目，也就没冒风险去装了，以后有机会的话，会尝试下php的多线程开发。