这两天有基友要php中curl抓取教务处成绩的源码,用于微信公众平台的开发。下面笔者只好忍痛割爱了。php中CURL技术模拟登陆抓取数据实战,抓取沈阳工学院教务处学生成绩。
首先,教务处登录需要验证码。我们寻找验证码的链接地址http://218.61.108.163/ACTIONVALIDATERANDOMPICTURE.APPPROCESS,来进行数据的抓取。下面看下主要代码-index.php
<?php $ch=curl_init("http://218.61.108.163/ACTIONVALIDATERANDOMPICTURE.APPPROCESS"); curl_setopt($ch,CURLOPT_RETURNTRANSFER,1); curl_setopt($ch,CURLOPT_HEADER,1); $str=curl_exec($ch); curl_close($ch); list($header, $body) = explode("\r\n\r\n", $str); preg_match("/JSESSIONID=(.*); path=/i", $header, $matches); $cookie = $matches[1]; ?>
需要模拟cookie进行登录,所以我们建立一个api.php的页面进行cookie的模拟,以及需要抓取成绩的链接地址http://218.61.108.163/ACTIONLOGON.APPPROCESS,对首页index.php表单中值进行获取
<?php if(isset($_POST['code'])){ $jwid=$_POST['xuehao']; $jwpwd=$_POST['mima']; $code=$_POST['code']; $ck=$_POST['ck']; $data="WebUserNO={$jwid}&Password={$jwpwd}&Agnomen={$code}&submit.x=23&submit.y=9&applicant=ACTIONQUERYSTUDENTSCORE"; $ch=curl_init("http://218.61.108.163/ACTIONLOGON.APPPROCESS"); curl_setopt($ch,CURLOPT_RETURNTRANSFER,1); curl_setopt($ch, CURLOPT_COOKIE, "JSESSIONID={$ck}"); curl_setopt($ch, CURLOPT_POST, 1); curl_setopt($ch, CURLOPT_POSTFIELDS, $data); $str=curl_exec($ch); curl_close($ch); } }
在登录页中,我们可以看到登录需要验证码。所以,我们建议一个code.php页面用于验证码的获取、
<?php $ch=curl_init("http://218.61.108.163/ACTIONVALIDATERANDOMPICTURE.APPPROCESS"); curl_setopt($ch,CURLOPT_RETURNTRANSFER,1); curl_setopt($ch, CURLOPT_COOKIE, "JSESSIONID={$_GET['ck']}"); $str=curl_exec($ch); curl_close($ch); echo $str; ?>
最后一步。把所要获取的数据接收,使用正则表达式进行数据的抓取以及排版。
<?php function get_td_array($table) { $table = preg_replace("/<table[^>]*?>/is","",$table); $table = preg_replace("/<tr[^>]*?>/si","",$table); $table = preg_replace("/<td[^>]*?>/si","",$table); $table = str_replace("</tr>","{tr}",$table); $table = str_replace("</td>","{td}",$table); $table = str_replace(" ","",$table); $table = preg_replace("'<[/!]*?[^<>]*?>'si","",$table); $table = preg_replace("'([rn])[s]+'","",$table); $table = str_replace(" ","",$table); $table = str_replace(" ","",$table); $table = explode('{tr}', $table); array_pop($table); foreach ($table as $key=>$tr) { $td = explode('{td}', $tr); $td = explode('{td}', $tr); array_pop($td); $td_array[] = $td; } return $td_array; } ?>
完整的代码大家可以去http://pan.baidu.com/share/link?shareid=3722188112&uk=1496266064进行下载。密码:a3eh