PHP CURL抓取重定向页面内容

抓取某页面原始地址为:

http://store.xx.com/shop/view_shop.htm?user_number_id=692420117

初始curl请求方法:

$ch = curl_init(); 
curl_setopt($ch, CURLOPT_URL, $url); 
$ret = curl_exec($ch); 
$info = curl_getinfo($ch); 

结果直接是301

后来直接访问地址发现会重定向到https地址 修改为

$ch = curl_init(); 
curl_setopt($ch, CURLOPT_URL, $url); 
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, FALSE);
curl_setopt($ch, CURLOPT_MAXREDIRS,20);        //设置最大的重定向次数
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);  //跟随重定向
$ret = curl_exec($ch); 
$info = curl_getinfo($ch); 

发现页面直接输出,无法对返回结果进行操作

最后修改为

$ch = curl_init(); 
curl_setopt($ch, CURLOPT_URL, $url); 
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, FALSE);
curl_setopt($ch, CURLOPT_AUTOREFERER, true); //自动设置header中的referer信息
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); //将数据保留到返回结果中 而非直接输出
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); //跟随重定向
$ret = curl_exec($ch); 
$info = curl_getinfo($ch); 
curl_close($ch);

/**
对$ret操作即可
**/

PS:如果要抓取的页面有很多重定向则可以先只抓取头部,直至状态码为200状态,获取$ret里面的location地址  然后抓取最后一次location页面的内容即可,可以提高速度;也可以直接调用$info['url'] 也是可以的

curl_setopt($ch, CURLOPT_HEADER, TRUE);//抓取头部
curl_setopt($ch, CURLOPT_NOBODY, TRUE);//不抓取内容

$info内容如下:

Array ( 
[url] => https://yy.tmall.com/shop/view_shop.htm?user_number_id=xxx
[content_type] => text/html;charset=GBK 
[http_code] => 200 
[header_size] => 3394 
[request_size] => 396 
[filetime] => -1 
[ssl_verify_result] => 20 
[redirect_count] => 3 
[total_time] => 0.640332 
[namelookup_time] => 0.051478 
[connect_time] => 0.20533 
[pretransfer_time] => 0.470629 
[size_upload] => 0 
。。。
[redirect_time_us] => 444840 
[starttransfer_time_us] => 639978 
[total_time_us] => 640332 
)

$ret内容如下  几次跳转都有很清晰的记录

HTTP/1.1 301 Moved Permanently
Date: Tue, 02 Jun 2020 10:17:02 GMT 
Content-Type: text/html 
Content-Length: 357 
Connection: keep-alive 
ufe-result: A6 
Location: https://store.xx.com/shop/view_shop.htm?user_number_id=xxx 
Server: Tengine/Aserver 
EagleEye-TraceId: 0b5218fa15910930221194685ed5bf 
Timing-Allow-Origin: * 

HTTP/2 302 
date: Tue, 02 Jun 2020 10:17:02 GMT 
content-type: text/html; charset=GBK 
content-length: 337 
location: https://shopXX.tb.com/shop/view_shop.htm?user_number_id=xxx 
ufe-result: A6 
set-cookie: cookie2=12012d6326c0ad6f4809f28835591ba9; Domain=.taobao.com; Path=/; HttpOnly
Timing-Allow-Origin: * 

HTTP/2 302 
date: Tue, 02 Jun 2020 10:17:02 GMT 
content-type: text/html; charset=GBK 
content-length: 337 
location: https://xx.tm.com/shop/view_shop.htm?user_number_id=xx 
ufe-result: A6 
url-hash: http://shopXX.tb.com/index.htm 
set-cookie: thw=cn; Path=/; Domain=.taobao.com; Expires=Wed, 02-Jun-21 10:17:02 GMT;
timing-allow-origin: * 

HTTP/2 200 
date: Tue, 02 Jun 2020 10:17:02 GMT 
content-type: text/html;charset=GBK 
vary: Accept-Encoding 
ufe-result: A6 
url-hash: http://xx.tm.com/index.htm 
eagleeye-traceid: 0b5205c015910930226692698e6f4d  
timing-allow-origin: * 

 

 
使用PHPcurl库可以方便地实现网页抓取功能。通过curl库,我们可以发送HTTP请求并获取服务器响应的内容。以下是一个使用php curl抓取页面所有链接的方法: 1. 创建一个curl资源句柄: ``` $ch = curl_init(); ``` 2. 设置curl选项,包括目标URL、请求头信息和其他参数: ``` curl_setopt($ch, CURLOPT_URL, "目标URL"); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/600.1.3 (KHTML, like Gecko) Version/8.0 Mobile/12A4345d Safari/600.1.4"); ``` 其中,`CURLOPT_URL`用于设置目标URL,`CURLOPT_RETURNTRANSFER`用于设置是否将抓取内容作为字符串返回,`CURLOPT_FOLLOWLOCATION`用于设置是否跟随重定向,`CURLOPT_USERAGENT`用于设置用户代理。 3. 执行curl请求并获取响应内容: ``` $response = curl_exec($ch); ``` 4. 使用正则表达式或其他方法从响应内容中提取所有链接: ``` preg_match_all('/<a\s+href=["\'](.*?)["\']/', $response, $matches); $links = $matches<span class="em">1</span><span class="em">2</span><span class="em">3</span> #### 引用[.reference_title] - *1* [php curl抓取网页的介绍和推广及使用CURL抓取淘宝页面集成方法](https://download.csdn.net/download/weixin_38594687/13020038)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v92^chatsearchT0_1"}}] [.reference_item style="max-width: 33.333333333333336%"] - *2* [PHP curl实现抓取302跳转后页面的示例](https://download.csdn.net/download/weixin_38500572/13045232)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v92^chatsearchT0_1"}}] [.reference_item style="max-width: 33.333333333333336%"] - *3* [phpcurl抓取页面](https://blog.csdn.net/weixin_27727467/article/details/115831006)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v92^chatsearchT0_1"}}] [.reference_item style="max-width: 33.333333333333336%"] [ .reference_list ]
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值