关于采集,是一种即可恨又开爱的东西。可爱之处在于,通过采集,可以快速使你,从数据的平民窟,变成数据的富翁。而且消耗的时间相当之短。 可恨之处也很明显,对于被你采集的人和你的竞争对手来说,都是咬牙切齿,夜不能寐的。
我这里采集的部分相当简单。主要是用到了 "PHP Simple HTML DOM"
用法可以参考:http://www.cnphp.info/php-simple-html-dom-parser-intro.html
这是一个使用非常简单方便的类,尤其是适合那些用惯了jquery的同学。他的选择器基本和jquery一直,所以用这个类采集,基本不用写正则表达式。 jquery是一个好东西,以至于很多语言都在模仿他的设计方式。java里面有个叫jsoup的包也是实现了类似的功能,使用非常简单方便。后台是使用java的人可以看看 jsoup。
下面举一个采集实例,目标对象是:http://www.dy9.net/nList/1.html 这个网站。网站截图如下:
目测应该是一个百度影音的视频播放网站。
列表主体部分代码如下:
1 <div class="main"> 2 <div class="nBox"> 3 <div class="head"> 4 <div class="status"><div class="ico"></div></div> 5 <h3 class="title">最新动作片</h3> 6 <div class="xpage"><span>共2494条数据 页次:1/179页</span><em class="nolink">首页</em><em class="nolink">上一页</em><em>1</em><a href="/nList/1_2.html">2</a><a href="/nList/1_3.html">3</a><a href="/nList/1_4.html">4</a><a href="/nList/1_5.html">5</a><a href="/nList/1_6.html">6</a><a href="/nList/1_7.html">7</a><a href="/nList/1_8.html">8</a><a href="/nList/1_2.html">下一页</a><a href="/nList/1_179.html">尾页</a><span><input type="input" name="page" size="4"><input type="button" value="跳转" onclick="getPageGoUrl(179,'page','/nList/1_<page>.html')" class="btn"></span></div> 7 </div> 8 <div class="border"> 9 10 <div class="tw w50p"> <a href="/movie/23660.html" class="imgBg1"><img src="/pic/uploadimg/2013-10/23660.jpg" alt="特殊身份/终极解码" title="特殊身份/终极解码" width="120" height="160" onerror="src='/template/skin/images/nopic.gif'"><span class="imgBg1Bg">DVD 11 </span></a> 12 <div class="twC2"> 13 <p><strong><a href="/movie/23660.html">特殊身份/终极解码</a></strong></p> 14 <p class="actor">主演:甄子丹,景甜,安志杰,..</p> 15 <p>地区:香港</p> 16 <p>类型:动作片</p> 17 <p>时间:2013-11-28</p> 18 <p><a href="/player/23660-0.html" class="btn1">马上观看</a></p> 19 </div> 20 </div> 21 22 <div class="tw w50p"> <a href="/movie/1668.html" class="imgBg1"><img src="http://i3.ku6img.com/cms/jc/201009/25/16607v0ft_1.jpg" alt="猛龙" title="猛龙" width="120" height="160" onerror="src='/template/skin/images/nopic.gif'"><span class="imgBg1Bg">全集 23 </span></a> 24 <div class="twC2"> 25 <p><strong><a href="/movie/1668.html">猛龙</a></strong></p> 26 <p class="actor">主演:洪金宝,迈克尔.比恩,..</p> 27 <p>地区:香港</p> 28 <p>类型:动作片</p> 29 <p>时间:2013-11-28</p> 30 <p><a href="/player/1668-0.html" class="btn1">马上观看</a></p> 31 </div> 32 </div> 33 34 <div class="tw w50p"> <a href="/movie/4495.html" class="imgBg1"><img src="/pic/uploadimg/2011-7/4495.jpg" alt="第三十九级台阶" title="第三十九级台阶" width="120" height="160" onerror="src='/template/skin/images/nopic.gif'"><span class="imgBg1Bg"> 35 </span></a> 36 <div class="twC2"> 37 <p><strong><a href="/movie/4495.html">第三十九级台阶</a></strong></p> 38 <p class="actor">主演:鲁伯特·潘瑞-琼斯,L..</p> 39 <p>地区:大陆</p> 40 <p>类型:动作片</p> 41 <p>时间:2013-11-28</p> 42 <p><a href="/player/4495-0.html" class="btn1">马上观看</a></p> 43 </div> 44 </div> 45 46 <div class="tw w50p"> <a href="/movie/16510.html" class="imgBg1"><img src="/pic/uploadimg/2013-11/201311251248364079.jpg" alt="大洋深处" title="大洋深处" width="120" height="160" onerror="src='/template/skin/images/nopic.gif'"><span class="imgBg1Bg">暂无 47 </span></a> 48 <div class="twC2"> 49 <p><strong><a href="/movie/16510.html">大洋深处</a></strong></p> 50 <p class="actor">主演:克里斯·海姆斯沃斯,汤..</p> 51 <p>地区:欧美</p> 52 <p>类型:动作片</p> 53 <p>时间:2013-11-25</p> 54 <p><a href="/player/16510-0.html" class="btn1">马上观看</a></p> 55 </div> 56 </div> 57 58 <div class="tw w50p"> <a href="/movie/23479.html" class="imgBg1"><img src="/pic/uploadimg/2013-9/23479.jpg" alt="狄仁杰之神都龙王" title="狄仁杰之神都龙王" width="120" height="160" onerror="src='/template/skin/images/nopic.gif'"><span class="imgBg1Bg">TS抢先版 59 </span></a> 60 <div class="twC2"> 61 <p><strong><a href="/movie/23479.html">狄仁杰之神都龙王</a></strong></p> 62 <p class="actor">主演:赵又廷,冯绍峰,林更新..</p> 63 <p>地区:大陆</p> 64 <p>类型:动作片</p> 65 <p>时间:2013-11-24</p> 66 <p><a href="/player/23479-0.html" class="btn1">马上观看</a></p> 67 </div> 68 </div> 69 70 <div class="tw w50p"> <a href="/movie/24048.html" class="imgBg1"><img src="/pic/uploadimg/2013-11/24048.jpg" alt="新雌雄大盗" title="新雌雄大盗" width="120" height="160" onerror="src='/template/skin/images/nopic.gif'"><span class="imgBg1Bg">DVD 71 </span></a> 72 <div class="twC2"> 73 <p><strong><a href="/movie/24048.html">新雌雄大盗</a></strong></p> 74 <p class="actor">主演:Eric,Robert..</p> 75 <p>地区:欧美</p> 76 <p>类型:动作片</p> 77 <p>时间:2013-11-24</p> 78 <p><a href="/player/24048-0.html" class="btn1">马上观看</a></p> 79 </div> 80 </div> 81 82 <div class="tw w50p"> <a href="/movie/19935.html" class="imgBg1"><img src="/pic/uploadimg/2013-11/201311240161448042.jpg" alt="四大名捕2" title="四大名捕2" width="120" height="160" onerror="src='/template/skin/images/nopic.gif'"><span class="imgBg1Bg">预告 83 </span></a> 84 <div class="twC2"> 85 <p><strong><a href="/movie/19935.html">四大名捕2</a></strong></p> 86 <p class="actor">主演:邓超,刘亦菲,邹兆龙,..</p> 87 <p>地区:大陆</p> 88 <p>类型:动作片</p> 89 <p>时间:2013-11-24</p> 90 <p><a href="/player/19935-0.html" class="btn1">马上观看</a></p> 91 </div> 92 </div> 93 94 <div class="tw w50p"> <a href="/movie/21504.html" class="imgBg1"><img src="/pic/uploadimg/2013-4/21504.jpg" alt="雷神2:黑暗世界" title="雷神2:黑暗世界" width="120" height="160" onerror="src='/template/skin/images/nopic.gif'"><span class="imgBg1Bg">首发 95 </span></a> 96 <div class="twC2"> 97 <p><strong><a href="/movie/21504.html">雷神2:黑暗世界</a></strong></p> 98 <p class="actor">主演:克里斯·海姆斯沃斯,汤..</p> 99 <p>地区:欧美</p> 100 <p>类型:动作片</p> 101 <p>时间:2013-11-23</p> 102 <p><a href="/player/21504-0.html" class="btn1">马上观看</a></p> 103 </div> 104 </div> 105 106 <div class="tw w50p"> <a href="/movie/23507.html" class="imgBg1"><img src="/pic/uploadimg/2013-9/23507.jpg" alt="逃出生天3D" title="逃出生天3D" width="120" height="160" onerror="src='/template/skin/images/nopic.gif'"><span class="imgBg1Bg">TS粤语 107 </span></a> 108 <div class="twC2"> 109 <p><strong><a href="/movie/23507.html">逃出生天3D</a></strong></p> 110 <p class="actor">主演:古天乐,刘青云,李心洁..</p> 111 <p>地区:香港</p> 112 <p>类型:动作片</p> 113 <p>时间:2013-11-23</p> 114 <p><a href="/player/23507-0.html" class="btn1">马上观看</a></p> 115 </div> 116 </div> 117 118 <div class="tw w50p"> <a href="/movie/24031.html" class="imgBg1"><img src="/pic/uploadimg/2013-11/24031.jpg" alt="桎梏" title="桎梏" width="120" height="160" onerror="src='/template/skin/images/nopic.gif'"><span class="imgBg1Bg">BD 119 </span></a> 120 <div class="twC2"> 121 <p><strong><a href="/movie/24031.html">桎梏</a></strong></p> 122 <p class="actor">主演:朴雅卡·乔普拉,拉姆·..</p> 123 <p>地区:其它</p> 124 <p>类型:动作片</p> 125 <p>时间:2013-11-22</p> 126 <p><a href="/player/24031-0.html" class="btn1">马上观看</a></p> 127 </div> 128 </div> 129 130 <div class="tw w50p"> <a href="/movie/24020.html" class="imgBg1"><img src="/pic/uploadimg/2013-11/24020.jpg" alt="悍战谍影/谍战马德拉斯" title="悍战谍影/谍战马德拉斯" width="120" height="160" onerror="src='/template/skin/images/nopic.gif'"><span class="imgBg1Bg">BD 131 </span></a> 132 <div class="twC2"> 133 <p><strong><a href="/movie/24020.html">悍战谍影/谍战马..</a></strong></p> 134 <p class="actor">主演:约翰·亚伯拉罕,娜吉丝..</p> 135 <p>地区:欧美</p> 136 <p>类型:动作片</p> 137 <p>时间:2013-11-22</p> 138 <p><a href="/player/24020-0.html" class="btn1">马上观看</a></p> 139 </div> 140 </div> 141 142 <div class="tw w50p"> <a href="/movie/5076.html" class="imgBg1"><img src="/pic/uploadimg/2011-7/5076.jpg" alt="尼姆岛" title="尼姆岛" width="120" height="160" onerror="src='/template/skin/images/nopic.gif'"><span class="imgBg1Bg"> 143 </span></a> 144 <div class="twC2"> 145 <p><strong><a href="/movie/5076.html">尼姆岛</a></strong></p> 146 <p class="actor">主演:阿比吉尔·布莱斯林,杰..</p> 147 <p>地区:欧美</p> 148 <p>类型:动作片</p> 149 <p>时间:2013-11-21</p> 150 <p><a href="/player/5076-0.html" class="btn1">马上观看</a></p> 151 </div> 152 </div> 153 154 <div class="tw w50p"> <a href="/movie/24008.html" class="imgBg1"><img src="/pic/uploadimg/2013-11/24008.jpg" alt="间谍/K先生" title="间谍/K先生" width="120" height="160" onerror="src='/template/skin/images/nopic.gif'"><span class="imgBg1Bg">DVD+BD 155 </span></a> 156 <div class="twC2"> 157 <p><strong><a href="/movie/24008.html">间谍/K先生</a></strong></p> 158 <p class="actor">主演:薛景求,文素丽,高昌锡</p> 159 <p>地区:韩国</p> 160 <p>类型:动作片</p> 161 <p>时间:2013-11-21</p> 162 <p><a href="/player/24008-0.html" class="btn1">马上观看</a></p> 163 </div> 164 </div> 165 166 <div class="tw w50p"> <a href="/movie/20032.html" class="imgBg1"><img src="/pic/uploadimg/2012-12/20032.jpg" alt="逃脱" title="逃脱" width="120" height="160" onerror="src='/template/skin/images/nopic.gif'"><span class="imgBg1Bg">DVD+BD 167 </span></a> 168 <div class="twC2"> 169 <p><strong><a href="/movie/20032.html">逃脱</a></strong></p> 170 <p class="actor">主演:戴克斯·夏普德,克里斯..</p> 171 <p>地区:欧美</p> 172 <p>类型:动作片</p> 173 <p>时间:2013-11-20</p> 174 <p><a href="/player/20032-0.html" class="btn1">马上观看</a></p> 175 </div> 176 </div> 177 178 <div class="page"><span>共2494条数据 页次:1/179页</span><em class="nolink">首页</em><em class="nolink">上一页</em><em>1</em><a href="/nList/1_2.html">2</a><a href="/nList/1_3.html">3</a><a href="/nList/1_4.html">4</a><a href="/nList/1_5.html">5</a><a href="/nList/1_6.html">6</a><a href="/nList/1_7.html">7</a><a href="/nList/1_8.html">8</a><a href="/nList/1_2.html">下一页</a><a href="/nList/1_179.html">尾页</a><span><input type="input" name="page" size="4"><input type="button" value="跳转" onclick="getPageGoUrl(179,'page','/nList/1_<page>.html')" class="btn"></span></div> 179 </div> 180 </div> 181 </div>
现在假设我们的目标是:采集电影的名称,地区,类型,时间,和对于的播放地址。
那么php代码如下:
//简单的输出采集也的所有电影详情也url地址
public function dy9list(){
$from="http://www.dy9.net/nList/13.html";
$html = file_get_html("$from");
$info=$html->find("div[class=w50p]");
foreach ($info as $v){
$href=$v->find("a",0)->href;
dump($href);
}
}
//结果如下:
string(17) "/movie/24007.html"
string(17) "/movie/23417.html"
string(17) "/movie/23535.html"
string(17) "/movie/24022.html"
string(17) "/movie/23611.html"
string(17) "/movie/23003.html"
string(17) "/movie/21791.html"
string(17) "/movie/23517.html"
string(17) "/movie/24058.html"
string(17) "/movie/23767.html"
string(17) "/movie/21790.html"
string(17) "/movie/22244.html"
string(17) "/movie/23943.html"
string(17) "/movie/23543.html"
采集关键项数据代码如下:
/**
* www.dy9.net 列表采集
*/
public function dy9list(){
$from="http://www.dy9.net/nList/13.html";
$html = file_get_html("$from");
$info=$html->find("div[class=w50p]");
foreach ($info as $v){
$movie['href']=$v->find("a",0)->href;
$movie['name']=$v->find("p",0)->plaintext;
$movie['star']=$v->find("p",1)->plaintext;
$movie['area']=$v->find("p",2)->plaintext;
$movie['type']=$v->find("p",3)->plaintext;
$movie['time']=$v->find("p",4)->plaintext;
dump($movie);
}
}
array(6) {
["href"] => string(17) "/movie/24007.html"
["name"] => string(12) "漂亮男人"
["star"] => string(40) "主演:张根硕,李智恩,李章宇.."
["area"] => string(15) "地区:韩国"
["type"] => string(18) "类型:日韩剧"
["time"] => string(19) "时间:2013-11-29"
}
array(6) {
["href"] => string(17) "/movie/23417.html"
["name"] => string(18) "土豆星球2013.."
["star"] => string(40) "主演:李顺载,吕珍九,河妍秀.."
["area"] => string(15) "地区:韩国"
["type"] => string(18) "类型:日韩剧"
["time"] => string(19) "时间:2013-11-29"
}
array(6) {
["href"] => string(17) "/movie/23535.html"
["name"] => string(12) "多谢款待"
["star"] => string(40) "主演:杏,东出昌大,原田泰造.."
["area"] => string(15) "地区:日本"
["type"] => string(18) "类型:日韩剧"
["time"] => string(19) "时间:2013-11-29"
}
array(6) {
["href"] => string(17) "/movie/24022.html"
["name"] => string(27) "来自风平浪静的明天"
["star"] => string(40) "主演:花江夏树,花泽香菜,石.."
["area"] => string(15) "地区:日本"
["type"] => string(18) "类型:日韩剧"
["time"] => string(19) "时间:2013-11-29"
}
array(6) {
["href"] => string(17) "/movie/23611.html"
["name"] => string(12) "继承者们"
["star"] => string(40) "主演:李敏镐,朴信惠,金宇彬.."
["area"] => string(15) "地区:韩国"
["type"] => string(18) "类型:日韩剧"
["time"] => string(19) "时间:2013-11-29"
}
array(6) {
["href"] => string(17) "/movie/23003.html"
["name"] => string(15) "红宝石戒指"
["star"] => string(40) "主演:李素妍,林贞恩,郑东焕.."
["area"] => string(15) "地区:韩国"
["type"] => string(18) "类型:日韩剧"
["time"] => string(19) "时间:2013-11-29"
}
array(6) {
["href"] => string(17) "/movie/21791.html"
["name"] => string(15) "丑八怪警报"
["star"] => string(38) "主演:林周焕,姜索拉,申素率"
["area"] => string(15) "地区:韩国"
["type"] => string(18) "类型:日韩剧"
["time"] => string(19) "时间:2013-11-28"
}
array(6) {
["href"] => string(17) "/movie/23517.html"
["name"] => string(24) "因为是你才喜欢 .."
["star"] => string(38) "主演:尹海英,李在皇,尹智敏"
["area"] => string(15) "地区:韩国"
["type"] => string(18) "类型:日韩剧"
["time"] => string(19) "时间:2013-11-28"
}
array(6) {
["href"] => string(17) "/movie/24058.html"
["name"] => string(22) "欧若拉公主 国语"
["star"] => string(40) "主演:全素敏,孙昌锡,边熙峰.."
["area"] => string(15) "地区:韩国"
["type"] => string(18) "类型:日韩剧"
["time"] => string(19) "时间:2013-11-28"
}
array(6) {
["href"] => string(17) "/movie/23767.html"
["name"] => string(10) "LEGALHIG.."
["star"] => string(9) "主演:"
["area"] => string(15) "地区:日本"
["type"] => string(18) "类型:日韩剧"
["time"] => string(19) "时间:2013-11-28"
}
array(6) {
["href"] => string(17) "/movie/21790.html"
["name"] => string(15) "欧若拉公主"
["star"] => string(40) "主演:全素敏,孙昌锡,边熙峰.."
["area"] => string(15) "地区:韩国"
["type"] => string(18) "类型:日韩剧"
["time"] => string(19) "时间:2013-11-28"
}
array(6) {
["href"] => string(17) "/movie/22244.html"
["name"] => string(6) "恩熙"
["star"] => string(40) "主演:金恩熙,景秀珍,林成载.."
["area"] => string(15) "地区:韩国"
["type"] => string(18) "类型:日韩剧"
["time"] => string(19) "时间:2013-11-28"
}
array(6) {
["href"] => string(17) "/movie/23943.html"
["name"] => string(19) "黄金时刻 国语"
["star"] => string(35) "主演:李善均,黄静茵,李圣"
["area"] => string(15) "地区:韩国"
["type"] => string(18) "类型:日韩剧"
["time"] => string(19) "时间:2013-11-28"
}
array(6) {
["href"] => string(17) "/movie/23543.html"
["name"] => string(6) "猫侍"
["star"] => string(28) "主演:北村一辉,平田"
["area"] => string(15) "地区:日本"
["type"] => string(18) "类型:日韩剧"
["time"] => string(19) "时间:2013-11-28"
}
//数据整理之后就在一个数组里面了,然后 add到数据库就OK了,当然这里你也可以采集到影片的图片。
采集大概的节奏就是这样了。具体不同的网站结构不同,但是道理是一样的。只要有规律就可以采集。
采集之后把数据添加到数据库,同时保存这条数据的来源,可以作为排重用,之后也可以提示去源网页。
我这里只是简单的采集了列表也里面的数据,一般的网页就是列表页,然后详情页,大部分数据都是在详情页。尤其是视频正在的播放网址,只能在播放页采集到。
很多网站对视频资源地址都是做了处理的,比如有的网站会对百度影音的的url做一个 base64编码,然后他使用的时候通过base64解码。有一些是把多个播放地址拼接然后中间加入一些分割符号。
---------------------------------------------------