采集程序

最近项目需要,写一个采集程序,用于网站数据的填充,学习了一下,写了个小采集程序,

一共写了两个文件1、是链接寻址2、采集数据封装入库

1.xunzhi.php

ini_set('max_execution_time', '0');
require_once(dirname(__FILE__) . '/../application/initialize.php');
require_once 'index2.php';
$content = file_get_contents("http://****/beijing/districts");
preg_match_all("/<span class=\"district_info\">[^.]{1,300}<\/span>/",$content,$district);
//获得地区url
foreach($district[0] as $value){


preg_match_all("/\/beijing\/[\w]*\/all/",$value,$districturl);
$districtflag = true;
$page=1;


echo $value."开始采集".$districturl[0][0]."<br/>";
while($districtflag){
$content = file_get_contents("http://****".$districturl[0][0]."/p".$page."");
preg_match_all("/<h1 class=\"name\">[^.]{1,500}<\/h1>/",$content,$restaurant);
$flag = $restaurant[0];
if($flag){


foreach ($restaurant[0] as $value){
preg_match_all("/restaurant\/[\w]*/",$value,$url);
$id = caijirestaurant("http://meican.com/".$url[0][0]);
$condition = array(
"restaurant" => $id,
"url" => "http://****/".$url[0][0]
);
Restaurant::caiji($condition);
echo "http://***/".$url[0][0],"<br/>"."采集完毕";

flush();
ob_flush();
}
echo "第".$page."页";
$page++;


}else{
$districtflag = false;
}
}


}
echo "采集完毕hhhwoku";

2、数据封装入库

function caijirestaurant($url){


$restaurantid = "";
$content = file_get_contents($url);
//print_r($content);
$data = split("<body id=\"restaurant\" class=\"[\s]*\">",$content);
$dishes = split("<ul class=\"dishes\">",$data[1]);
$dishes = split("<div class=\"right\">",$dishes[1]);
$dishesdata = split("<li class=\"dish[^.]{4,100}is-section\">[^.]{4,250}<\/li>",$dishes[0]);//分类名称
$categorydata = preg_match_all("/<li class=\"dish[^.]{4,100}is-section\">[^.]{4,250}<\/li>/", $dishes[0], $category);//菜品名称
$rate = split("<div class=\"restaurant_rate",$data[1]);
preg_match_all("/rate[\S]{1}/", $rate[1], $ratedata);
$ratedata = substr($ratedata[0][0], 4);
$address = explode("<div id=\"content\">", $content);
$addressdata = explode("<ul class=\"restaurant_list not_index\">",$address[1]);
//print_r($addressdata[0]);
$addressdata= split("</span>",$addressdata[0]);
//print_r($addressdata);
$address = strip_tags($addressdata[2]).strip_tags($addressdata[3]);
$address = explode("&rsaquo;", $address);
$address= trim($address[1]).trim($address[2]);
preg_match_all("/<h1 class=\"name\">[\s\S]+<\/h1>/", $content, $restaurant);
$name = strip_tags($restaurant[0][0]);//餐馆名称
preg_match_all("/<div class=\"tel\">[\s\S]{1,200}<\/div>/",$content,$tel);
preg_match_all("/[\d]{1,4}[\s\S]{1,50}[\d]{1,5}/",$tel[0][0],$phone);
$phone = strip_tags($phone[0][0]).strip_tags($phone[0][1]);//电话号码
preg_match_all("/<input type=\"hidden\" id=\"center_latitude\"[\s\S]+\/>/",$content,$latlng);
preg_match_all("/value=\"[\s\S]{1,20}\"/",$latlng[0][0],$marker);//latlng
$lat = $marker[0][0];
$lng = $marker[0][1];
preg_match_all("/<table class=\"restaurant_info_all\">[\s\S]+<\/table>/",$content,$infor);
preg_match_all("/<tr class=\"restaurant_info_item\">[\s\S]{1,200}<\/tr>/",$infor[0][0],$infornation);
preg_match_all("/<td>[\s\S]+<\/td>/",$infornation[0][1],$time);
$time= trim(strip_tags($time[0][0]));
header("content-type:text/html; charset=utf-8");


$lat = explode("\"", $lat);
$lng =explode("\"",$lng);


$information = split("<\/tr>",$infor[0][0]);
$time= split("<td>",$information[0]);
$timedata = trim(strip_tags($time[1]));
if(strlen($timedata)<2)
$timedata ="";
$delivery = split("<td>",$information[1]);
$deliverydata = trim(strip_tags($delivery[1]));
if(strlen($deliverydata)<2)
$deliverydata ="";
$minimum = split("<td>",$information[2]);
$minmumdata = trim(strip_tags($minimum[1]));
if(strlen($minmumdata)<2)
$minmumdata ="";
$scope = split("<td>",$information[3]);
$scopedata = trim(strip_tags($scope[1]));
if(strlen($scopedata)<2)
$scopedata ="";
if(!$minmumdata&&!$scopedata&&!$timedata){
$state = "offline";
}else{
$state = "online";
}




$condition = array(
"name" => trim($name),
"minimum" => trim($minmumdata),
"scope" => trim($scopedata),
"deliverycharge" => trim($deliverydata),
"order_time" => trim($timedata),
"phone" => trim($phone),
"address" => trim($address),
"state" => trim($state),
"lat" => trim($lat[1]),
"lng" => trim($lng[1]),
"star" => trim($ratedata)
);
$restaurant = Restaurant::Register($condition);
$id = $restaurant['id'];
$restaurantid = $id;
$restaurantCategory = array();//餐馆类别
$i=0;
foreach($category[0] as $value){
preg_match("/<span class=\"name\">[\s\S]+<\/span>/", $value, $categoryname);
$categorydata = strip_tags($categoryname[0]);
//echo ($categorydata);echo "<br/>";
$condition = array(
"name" => $categorydata,
"restaurant" => $id,
);
$categoryid = Restaurant::categoryNew($id, $condition);
$restaurantCategory[$i] = $categoryid;
$i++;


}
$i = 0;
foreach($dishesdata as $value){
$dish = split("<\/li>",$value);
preg_match_all("<span class=\"name\">",$dish[0], $flagdata);
$flag = $flagdata[0][0];
if($flag){
foreach($dish as $value){

$dishinfo = split("<span class=\"price_outer\">",$value);
$dishname = trim(strip_tags($dishinfo[0]));
if($dishname){
$dishprice = strip_tags($dishinfo[1]);
$condition = array(
"restaurant" => $id,
"category" => $restaurantCategory[$i],
"name" => trim($dishname),
"price" => trim($dishprice),
"state" => "online"
);
Restaurant::dishNew($condition);
}}
$i++;
}
}
return $restaurantid;
}

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值