抓取某页面原始地址为:
http://store.xx.com/shop/view_shop.htm?user_number_id=692420117
初始curl请求方法:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
$ret = curl_exec($ch);
$info = curl_getinfo($ch);
结果直接是301
后来直接访问地址发现会重定向到https地址 修改为
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, FALSE);
curl_setopt($ch, CURLOPT_MAXREDIRS,20); //设置最大的重定向次数
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); //跟随重定向
$ret = curl_exec($ch);
$info = curl_getinfo($ch);
发现页面直接输出,无法对返回结果进行操作
最后修改为
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, FALSE);
curl_setopt($ch, CURLOPT_AUTOREFERER, true); //自动设置header中的referer信息
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); //将数据保留到返回结果中 而非直接输出
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); //跟随重定向
$ret = curl_exec($ch);
$info = curl_getinfo($ch);
curl_close($ch);
/**
对$ret操作即可
**/
PS:如果要抓取的页面有很多重定向则可以先只抓取头部,直至状态码为200状态,获取$ret里面的location地址 然后抓取最后一次location页面的内容即可,可以提高速度;也可以直接调用$info['url'] 也是可以的
curl_setopt($ch, CURLOPT_HEADER, TRUE);//抓取头部
curl_setopt($ch, CURLOPT_NOBODY, TRUE);//不抓取内容
$info内容如下:
Array (
[url] => https://yy.tmall.com/shop/view_shop.htm?user_number_id=xxx
[content_type] => text/html;charset=GBK
[http_code] => 200
[header_size] => 3394
[request_size] => 396
[filetime] => -1
[ssl_verify_result] => 20
[redirect_count] => 3
[total_time] => 0.640332
[namelookup_time] => 0.051478
[connect_time] => 0.20533
[pretransfer_time] => 0.470629
[size_upload] => 0
。。。
[redirect_time_us] => 444840
[starttransfer_time_us] => 639978
[total_time_us] => 640332
)
$ret内容如下 几次跳转都有很清晰的记录
HTTP/1.1 301 Moved Permanently
Date: Tue, 02 Jun 2020 10:17:02 GMT
Content-Type: text/html
Content-Length: 357
Connection: keep-alive
ufe-result: A6
Location: https://store.xx.com/shop/view_shop.htm?user_number_id=xxx
Server: Tengine/Aserver
EagleEye-TraceId: 0b5218fa15910930221194685ed5bf
Timing-Allow-Origin: *
HTTP/2 302
date: Tue, 02 Jun 2020 10:17:02 GMT
content-type: text/html; charset=GBK
content-length: 337
location: https://shopXX.tb.com/shop/view_shop.htm?user_number_id=xxx
ufe-result: A6
set-cookie: cookie2=12012d6326c0ad6f4809f28835591ba9; Domain=.taobao.com; Path=/; HttpOnly
Timing-Allow-Origin: *
HTTP/2 302
date: Tue, 02 Jun 2020 10:17:02 GMT
content-type: text/html; charset=GBK
content-length: 337
location: https://xx.tm.com/shop/view_shop.htm?user_number_id=xx
ufe-result: A6
url-hash: http://shopXX.tb.com/index.htm
set-cookie: thw=cn; Path=/; Domain=.taobao.com; Expires=Wed, 02-Jun-21 10:17:02 GMT;
timing-allow-origin: *
HTTP/2 200
date: Tue, 02 Jun 2020 10:17:02 GMT
content-type: text/html;charset=GBK
vary: Accept-Encoding
ufe-result: A6
url-hash: http://xx.tm.com/index.htm
eagleeye-traceid: 0b5205c015910930226692698e6f4d
timing-allow-origin: *