JQERUY方式筛选采集内容,相信很多大牛都知道这个类库,可自学出身的我还是找了N久,phpquery,Snoopy等一遍一遍尝试,最后才在无意中找到phpSimpleHtmlDom,更让人惊喜的是又找到了中文手册.
一个人的学习,漫长而又艰辛,真希望有时候能得到指点,不至于让时间无辜的流失.
基础代码获取网页建议用CURL,附加POST数据可以登陆后采集
<?php
require_once('./simple_html_dom.php');
$url='http://www.w3cschool.cc/';
$Curl=curl_init();//实例化cURL
curl_setopt($Curl, CURLOPT_URL, $url);//初始化路径
curl_setopt($Curl, CURLOPT_RETURNTRANSFER, 1);//0获取后直接打印出来
curl_setopt($Curl, CURLOPT_HEADER, 1);//0关闭打印相应头,直接打印需为1,
$result=curl_exec($Curl);//执行一个cURL会话
curl_close($Curl);//关闭cURL会话
$html = str_get_html($result);//创建DOM
foreach($html->find('#leftcolumn a') as $element) {
echo $element->href . '<br>';//获取URL
echo $element->plaintext . '<br>';//获取纯文本
}
$html->clear();
unset($html);
中文手册(作者: S.C. Chen):
http://www.ecartchina.com/php-simple-html-dom/index.htm
采集淘宝测试
require_once(‘simple_html_dom.php’);
ini_set(“time_limit”,”0”);
ini_set(“memory_limit”,”512M”);
memory=memorygetusage();echo‘memory:′.(
memory/1024).’KB
’;
echo ‘time:’.date(‘H:i:s’,time()).’
’;
function curl_get_content(
url)$Curl=curlinit();//实例化cURLcurlsetopt($Curl,CURLOPTURL,$url);//初始化路径curlsetopt($Curl,CURLOPTRETURNTRANSFER,1);//0获取后直接打印出来curlsetopt($Curl,CURLOPTHEADER,0);//0关闭打印相应头,直接打印需为1,$result=curlexec($Curl);//执行一个cURL会话curlclose($Curl);//关闭cURL会话return$result;
cateUrl=’http://the-seventh-sense.taobao.com/‘;
cateCon=curlgetcontent(
cateUrl);
cateHtml=strgethtml(
cateCon);//创建DOM
CateList=array();
i=0;
foreach(
cateHtml−>find(′.JTAllCatsTreeli.fst−cat−hda[href∗=category]′)as
element) {
CateList[
i][‘url’]=urldecode(
element−>href);//获取URL
CateList[
i][‘name′]=
element->plaintext;//获取纯文本
i++;
}cateHtml->clear();
unset(
cateHtml);
i=0;
foreach (
CateListas
goodsUrl) {
goodsCon=curlgetcontent(
goodsUrl[‘url’]);
goodsHtml=strgethtml(
goodsCon);//创建DOM
goodsBlock=
goodsHtml->find(‘.shop-hesper-bd .item’);
foreach(
goodsBlockas
goodsElement ) {
goodsList[
i][‘name’]=
goodsElement−>find(“.detail.item−name”,0)−>plaintext;
goodsList[
i][‘price′]=
goodsElement->find(“.detail .c-price”,0)->plaintext;
goodsList[
i][‘img’]=
goodsElement−>find(“.photoaimg”,0)−>src;
goodsList[
i][‘catename′]=
goodsUrl[‘name’];
i++;
}goodsHtml->clear();
unset($goodsHtml);
}
echo ‘
’;
n1=count( CateList);
n2=count( goodsList);
echo ‘采集’. n1.′条栏目′. n2.’个商品
’;
memory=memorygetusage();echo‘memory:′.( memory/1024).’KB
’;
echo ‘time:’.date(‘H:i:s’,time()).’
’;
复制代码
beginmemory:971.953125KB
begintime:05:30:19
overmemory:1352.890625KB
overtime:05:30:39
耗时20s,成功采集9个栏目127个商品