刚刚使用CURL测试,发现CURL是可以获取https的网页的,这样抓取网页内容就变简单了,以抓取w3school的php教程课程表为例:
一、编写一个CURL请求函数
用于调用
function getContent($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 3);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_MAXREDIRS, 3);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false); // https协议
$html = curl_exec($ch);
curl_close($ch);
return $html;
}
二、调用函数
传入参数,参数是网页完整的网址
$cnt = getContent("https://www.w3school.com.cn/php/index.asp");
三、使用正则匹配
匹配规则可以按自己的方法写,我写的是以 <li><a 开始, 以</a></li>结束,匹配中间的内容,其中.*
是用于匹配a标签内的内容,这个内容对应$0
用不了上的,而两个肩括号>(.*)
<内匹配$1
的才是正文
preg_match_all('/<li><a .*>(.*)<\/a><\/li>/', $cnt, $match);
四、输出内容:
print_r($match[1]);
Array
(
[0] => PHP 简介
[1] => PHP 安装
[2] => PHP 语法
[3] => PHP 变量
[4] => PHP Echo / Print
[5] => PHP 数据类型
[6] => PHP 字符串函数
[7] => PHP 常量
[8] => PHP 运算符
[9] => PHP If…Else
[10] => PHP Switch
[11] => PHP While 循环
[12] => PHP For 循环
[13] => PHP 函数
[14] => PHP 数组
[15] => PHP 数组排序
[16] => PHP 超全局
[17] => PHP 表单处理
[18] => PHP 表单验证
[19] => PHP 表单必填
[20] => PHP 表单 URL/E-mail
[21] => PHP 表单完成
[22] => PHP 多维数组
[23] => PHP 日期
[24] => PHP Include
[25] => PHP 文件
[26] => PHP 文件打开/读取
[27] => PHP 文件创建/写入
[28] => PHP 文件上传
[29] => PHP Cookies
[30] => PHP Sessions
[31] => PHP E-mail
[32] => PHP 安全 E-mail
[33] => PHP Error
[34] => PHP Exception
[35] => PHP Filter
[36] => MySQL 简介
[37] => MySQL Connect
[38] => MySQL Create
[39] => MySQL Insert
[40] => MySQL Select
[41] => MySQL Where
[42] => MySQL Order By
[43] => MySQL Update
[44] => MySQL Delete
[45] => PHP ODBC
[46] => XML Expat Parser
[47] => XML DOM
[48] => XML SimpleXML
[49] => AJAX 简介
[50] => XMLHttpRequest
[51] => AJAX Suggest
[52] => AJAX XML
[53] => AJAX Database
[54] => AJAX responseXML
[55] => AJAX Live Search
[56] => AJAX RSS Reader
[57] => AJAX Poll
[58] => PHP Array
[59] => PHP Calendar
[60] => PHP Date
[61] => PHP Directory
[62] => PHP Error
[63] => PHP Filesystem
[64] => PHP Filter
[65] => PHP FTP
[66] => PHP HTTP
[67] => PHP LibXML
[68] => PHP Mail
[69] => PHP Math
[70] => PHP MySQL
[71] => PHP MySQLi
[72] => PHP SimpleXML
[73] => PHP String
[74] => PHP XML
[75] => PHP Zip
[76] => PHP 杂项
[77] => PHP 时区
[78] => PHP 测验
[79] => 网站构建
[80] => 万维网联盟 (W3C)
[81] => 浏览器信息
[82] => 网站品质
[83] => 语义网
[84] => 职业规划
[85] => 网站主机
[86] => Array
[87] => Calendar
[88] => Date
[89] => Directory
[90] => Error
[91] => Filesystem
[92] => Filter
[93] => FTP
[94] => HTTP
[95] => LibXML
[96] => Mail
[97] => Math
[98] => MySQL
[99] => SimpleXML
[100] => String
[101] => XML Parser
[102] => Zip
[103] => 杂项函数
)
抓取结果正确!!最终使用php抓取页面的功能完成啦~~~
乱码问题:
w3school.com.cn是使用gbk编码,不是我们常用的utf-8,出现乱码情况下在代码前面加上gbk编码
header("Content-type:text/html;charset=gbk");
curl扩展
phpinfo查看是否有curl扩展,我的是有的
phpinfo();
没有curl扩展的话,你需要去开启curl,在 php.ini 中找到
;extension=php_curl.dll
改成
extension=php_curl.dll
重启apache就可以了