今天突发奇想,想是一下抓取网页的信息,然后我学院网站的老师的信息就成为了我的目标了。
顺便打一下广告,我学院(华南农业大学信息学院)的教师的网址是:http://info.scau.edu.cn/nav-contact.asp
我新建了一个PHP项目,HtmlCatcher,里面包括三个文件Teacher.php(老师的实体),index.php(进口),view.php(页面)。
Teacher.php的代码如下:
<?php
/**
* 教师实体类
*/
class Teacher {
var $position;
var $name;
var $phone;
var $email;
var $office;
function __construct() {
}
public static function parse(DOMElement $row) {
$cells_list = $row->getElementsByTagName('td');
if($cells_list->length != 5) {
return null;
}
$teacher = new Teacher();
foreach ($cells_list as $cell) {
$div_list = $cell->getElementsByTagName('div');
$div = $div_list->item(0);
$class = $div->getAttribute('class');
if (strpos($class, 'ofc') !== false) {
$teacher->office = $div->nodeValue;
}
else if (strpos($class, 'name') !== false) {
$teacher->name = $div->nodeValue;
}
else if (strpos($class, 'type') !== false) {
$teacher->position = $div->nodeValue;
}
else if (strpos($class, 'max-col') !== false) {
$teacher->email = $div->nodeValue;
}
else {
$teacher->phone = $div->nodeValue;
}
}
return $teacher;
}
}
?>
老师的实体类,包括了老师的职位、姓名、办公电话、邮箱、办公室。
parse方法是解析一个DOM节点,把里面属于老师的信息整理出来,如果符合要求的话,返回一个Teacher,否则返回null。
index.php代码如下:
<?php
define('ROOT', dirname(__FILE__));
define('DS', DIRECTORY_SEPARATOR);
require_once(ROOT.DS."Teacher.php");
$url = "http://info.scau.edu.cn/nav-contact.asp";
$htmlDoc = new DOMDocument;
$htmlDoc->loadHTMLFile($url);
$htmlDoc->normalizeDocument();
$tables_list = $htmlDoc->getElementsByTagName('table');
$arr = Array();
$table = $tables_list->item(0);
$rows_list = $table->getElementsByTagName('tr');
foreach ($rows_list as $row) {
$teacher = Teacher::parse($row);
if(!is_null($teacher)) {
$teacherName = $_GET['name'];
if (is_null($teacherName)) {
array_push($arr, $teacher);
}
else {
if ($teacher->name == $teacherName) {
array_push($arr, $teacher);
break;
}
}
}
}
include_once(ROOT.DS."view.php");
?>
首先读取网站的DOM,然后获取table标签,table标签,再从中提取tbody标签,然后获取所有tr标签,解析成一个Teacher的数组$arr。
view.php代码:
<html>
<head>
<title>信息学院教师信息</title>
<meta charset="utf8">
</head>
<body>
<?php
foreach ($arr as $teacher)
{
echo $teacher->office." ".$teacher->phone
." ".$teacher->name." ".$teacher->position
.$teacher->email;
echo "</br>";
}
?>
</body>
</html>
显示$arr中的Teacher的信息。
显示结果如下
到此完毕。。。。