php 抓取页面

最新推荐文章于 2021-03-20 09:01:51 发布

wtcsy

最新推荐文章于 2021-03-20 09:01:51 发布

阅读量1.2k

点赞数

分类专栏： php 文章标签： php PHP

本文链接：https://blog.csdn.net/wtcsy/article/details/8150738

版权

php 专栏收录该内容

10 篇文章 1 订阅

订阅专栏

Curl

抓取页面也就是把别人的页面抓回来分析得出自己需要的东西然后入库

简单点抓取就是抓个 get请求的不涉及到 post数据也不涉及到登陆

一个最简单的抓取

	function getHtml($url){
		$opts = array(
				'http'=>array(
				'method'=>"GET",
				'header'=>"Content-Type: text/html; charset=utf-8" 
					)
				);        
		$context = stream_context_create($opts);        
		return file_get_contents($url, false, $context);   	
	}
	echo getHtml("http://www.baidu.com");

抓取回来的是页面也就是由html css js组成的东西,

一般要获取自己需要的东西,这是就得就分析页面结构,然后写正则来获取数据

一个例子(获取163财经频道的本周点击最多 http://money.163.com/special/002526BH/rank.html)

	function getHtml($url){
		$opts = array(
				'http'=>array(
				'method'=>"GET",
				'header'=>"Content-Type: text/html; charset=utf-8" 
					)
				);        
		$context = stream_context_create($opts);        
		return file_get_contents($url, false, $context);   	
	}
	
	$url  = 'http://money.163.com/special/002526BH/rank.html';
	$file = getHtml($url);
	$reg = '/<div class="tabContents">([\s\S]*?)<\/div>/';

	if(preg_match_all($reg,$file,$matches)){
		$data = array();
		$news = $matches[1][1]; //$matches[1] 表示能够匹配改结构的所有的匹配项
		$reg = '/<a href="(?P<url>.*?)"\s*title="(?P<title>.*?)"\s*>(?P<subjet>.*?)<\/a>/';
		if(preg_match_all($reg,$news,$matches)){
			$len = count($matches["subjet"]);
			$subjet = $matches["subjet"];
			$url = $matches["url"];		
			for($i=0;$i<$len;$i++){
				$data[$i] = array("subjet"=>$subjet[$i],"url"=>$url[$i]);
				print_r($data[$i]);
			}
		}
	}else{
		echo "抓取失败";
	}

抓取结果

Array
(
    [subjet] => 王石回应离婚传闻：我没有背叛家庭
    [url] => http://money.163.com/12/1029/10/8EVOHHRV00253B0H.html
)
Array
(
    [subjet] => 浙江楼市从领涨到领跌 温州炒房客被套转投实业
    [url] => http://money.163.com/12/1105/01/8FGR7QQK00253B0H.html
)
Array
(
    [subjet] => 《时代周刊》评选2012最佳发明：谷歌眼镜上榜
    [url] => http://money.163.com/12/1102/15/8FAM06V600253G87.html
)
Array
(
    [subjet] => 网络盛传万科董事长王石已离婚
    [url] => http://money.163.com/12/1029/00/8EUPDERR00253B0H.html
)
Array
(
    [subjet] => 中粮上海楼盘单价21.6万元/平 刷新汤臣一品纪
    [url] => http://money.163.com/12/1030/10/8F2CUD73002534NU.html
)
Array
(
    [subjet] => 中国近10年来涌现13位首富 有人已锒铛入狱
    [url] => http://money.163.com/12/1029/14/8F06QTD700253G87.html
)
Array
(
    [subjet] => 2012全球25大最佳国家品牌：瑞士排名第一
    [url] => http://money.163.com/12/1030/19/8F3CH5AJ00253B0H.html
)
...........内容较多  省略掉

第二中抓取麻烦点,就要post数据,用到Curl
一些Curl的介绍的文章

http://developer.51cto.com/art/200904/121739.htm

http://www.cmx8.cn/curl.html

一个例子 (抓取博客园前端文章的例子抓取地址 http://www.cnblogs.com/mvc/AggSite/PostList.aspx)

抓取成功的页面很少, 因为博客园的博客皮肤太多 , 那个正则只支持几款皮肤, 所以失败率很高呀.ps博客园的东西真难抓取

	header("Content-type: text/html;charset=UTF-8");
	
	function postHtml($json,$url){
		$curlPost = http_build_query(json_decode($json,true));
		$ch = curl_init();
		curl_setopt($ch, CURLOPT_URL, $url);
		curl_setopt($ch, CURLOPT_HEADER, 1);
		curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
		curl_setopt($ch, CURLOPT_POST, 1);
		curl_setopt($ch, CURLOPT_POSTFIELDS, $curlPost);
		$data = curl_exec($ch);
		curl_close($ch);
		return $data;
	}

	function getHtml($url){
		$opts = array(
				'http'=>array(
				'method'=>"GET",
				'header'=>"Content-Type: text/html; charset=utf-8" 
					)
				);        
		$context = stream_context_create($opts);        
		return file_get_contents($url, false, $context);   	
	}

	$url = 'http://www.cnblogs.com/mvc/AggSite/PostList.aspx';
	$json = '{"CategoryType":"TopSiteCategory","ParentCategoryId":0,"CategoryId":108703,"PageIndex":3,"ItemListActionName":"PostList"}';
	$html = postHtml($json,$url); //抓取的原始页面
	$reg  = '/<a class="titlelnk" href="(?P<href>.*?)"[^>]+>(?P<subject>.*?)<\/a>[\s\S]*?<\/a>(?P<time>[\s\S]*?)<span class="article_comment">/';

	//因为上面的抓取 主要是提供的 标题  时间  有url  所有后面还要到url去抓取文章的内容
	if(preg_match_all($reg,$html,$matches)){
		$data = array();
		foreach($matches["href"] as $i=>$href){
			$html = getHtml($href);
			$reg = '/<div class="postText">(?P<content>[\s\S]*?)<\/div>\s*\r\s*<p class="postfoot">/';
			if(preg_match_all($reg,$html,$rs)){
				echo $matches["href"][$i]."----------------成功<br>";
				$data[$i] = array("href"=>$href,"subject"=>$matches["subject"][$i],"time"=>$matches["time"][$i],"body"=>$rs["content"][0]);
			}else{
				echo $matches["href"][$i]."----------------失败<br>";
			}			
		}
	}

第3中抓取是需要登录的,有些数据需要登录后才可见. 就需要模拟登陆.

http是无状态的,服务端要识别客户端的请求用户是否登陆,就需要用到cookie, 所以先抓包登陆请求地址需要post参数,然后讲cookie存起来放到其他文件,然后请求登陆后的页面的时候把cookie带上,就可以模拟登陆请求一些需要的东西了,(ps 有些网站似乎有限制不管怎么请求都没有返回)

一个例子

<?php

$url = "http://passport.cnblogs.com/login.aspx";
$login = "xxxxx";  //账号和密码
$password = "xxxxx";

$post_data = array( "__EVENTTARGET" => "","__EVENTARGUMENT" => "","__VIEWSTATE" => "/wEPDwULLTE1MzYzODg2NzZkGAEFHl9fQ29udHJvbHNSZXF1aXJlUG9zdEJhY2tLZXlfXxYBBQtjaGtSZW1lbWJlcm1QYDyKKI9af4b67Mzq2xFaL9Bt" ,"__EVENTVALIDATION" => "/wEWBQLWwpqPDQLyj/OQAgK3jsrkBALR55GJDgKC3IeGDE1m7t2mGlasoP1Hd9hLaFoI2G05","tbUserName" => $login,"tbPassword" => $password,"btnLogin" => "登++录","txtReturnUrl" => "http://home.cnblogs.com/"); 
$cookie_jar = tempnam('./temp','cookie');//存放COOKIE的文件

$ch = curl_init();

curl_setopt($ch, CURLOPT_URL, $url);

curl_setopt($ch, CURLOPT_POST, 1);

curl_setopt($ch, CURLOPT_HEADER, 0);

curl_setopt($ch, CURLOPT_RETURNTRANSFER, 0);

curl_setopt($ch, CURLOPT_POSTFIELDS, $post_data);  

curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie_jar);  //保存cookie信息

curl_exec($ch);

curl_close($ch); 

//上面是登陆的请求  post的东西是我抓包获取的  主要是为了获得cookie
//下面的才是用登陆后的cookie  去获取登陆后的东西


$url = "http://home.cnblogs.com/";
$ch = curl_init();

curl_setopt($ch, CURLOPT_URL, $url);

curl_setopt($ch, CURLOPT_REFERER, $url);       //伪装REFERER


curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);   //返回数据，而不是直接输出

curl_setopt($ch, CURLOPT_HEADER, 0);   // 设置是否显示header信息 0是不显示，1是显示  默认为0

curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie_jar);  //发送cookie文件

$output2 = curl_exec($ch);    //发送HTTP请求
curl_close($ch); 

echo $output2;
?>

wtcsy

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
php 抓取页面

Curl抓取页面也就是把别人的页面抓回来分析得出自己需要的东西然后入库简单点抓取就是抓个 get请求的不涉及到 post数据也不涉及到登陆一个最简单的抓取 function getHtml($url){ $opts = array( 'http'=>array( 'method'=>"GET", 'header'=>
复制链接

扫一扫

专栏目录