用PHP的cURL函数又写了一个爬虫,一个是单线程版,另一个是多线程版,差别还是挺大的。
单线程版:<?php
$t1 = microtime(true); //开始时间
$path = dirname(__FILE__); //取得文件绝对路径
$handle = fopen("$path/id.txt", "r"); //打开学号表文件(每行一个学号)
//创建相片存放目录img
if (!file_exists("$path/img")) {
mkdir("$path/img");
}
if ($handle) {
$ch = curl_init();
//循环抓取学号对应的图片
while (!feof($handle)) {
$buffer = trim(fgets($handle)); //取得学号表中的每一行,并去掉后面的空格
$imgName = $buffer . ".jpg";
$file = fopen("$path/img/$imgName", "wb");
$url = "http://xssw.hnu.cn/Web/SWPT/W_swpt_readImage.aspx?sno=" . $buffer;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FILE, $file);
curl_exec($ch);
}
curl_close($ch);
fclose($handle);
echo "Done! ";
}
$t2 = microtime(true);
echo 'Used ' . round($t2 - $t1, 3) . ' seconds.'; //程序运行时间
?>
多线程版:<?php
$t1 = microtime(true); //开始时间
$path = dirname(__FILE__); //取得文件绝对路径
$handle = fopen("$path/id.txt", "r"); //打开学号表文件(每行一个学号)
//创建相片存放目录img
if (!file_exists("$path/img")) {
mkdir("$path/img");
}
if ($handle) {
$urls = array();
$i = 0;
//循环获得图片对应的文件和链接
while (!feof($handle)) {
$buffer = trim(fgets($handle)); //取得学号表中的每一行,并去掉后面的空格
$imgName = $buffer . ".jpg";
$file = fopen("$path/img/$imgName", "wb");
$files[$i] = $file;
$url = "http://xssw.hnu.cn/Web/SWPT/W_swpt_readImage.aspx?sno=" . $buffer;
$urls[$i] = $url;
$i++;
}
$mh = curl_multi_init();
//初始化
foreach ($urls as $i => $url) {
$conn[$i] = curl_init($url);
curl_setopt($conn[$i], CURLOPT_HEADER, 0);
curl_setopt($conn[$i], CURLOPT_RETURNTRANSFER, 1);
curl_setopt($conn[$i], CURLOPT_CONNECTTIMEOUT, 5);
curl_setopt($conn[$i], CURLOPT_FILE, $files[$i]);
curl_multi_add_handle($mh, $conn[$i]);
}
//执行
do {
curl_multi_exec($mh, $active);
} while ($active);
//清理
foreach ($urls as $i => $url) {
curl_multi_remove_handle($mh, $conn[$i]);
curl_close($conn[$i]);
fclose($files[$i]);
}
curl_multi_close($mh);
fclose($handle);
}
$t2 = microtime(true);
echo "Done! ";
echo 'Used ' . round($t2 - $t1, 3) . ' seconds.'; //程序运行时间
?>
运行时间比较:
单线程版(200张图片):Used 3.028 seconds.
Used 2.767 seconds.
Used 2.688 seconds.
Used 2.672 seconds.
多线程版(200张图片):Used 1.314 seconds.
Used 0.686 seconds.
Used 0.597 seconds.
Used 0.563 seconds.
单线程版(2000张图片):Used 35.708 seconds.
Used 27.85 seconds.
多线程版(2000张图片):Used 7.233 seconds.
Used 7.241 seconds.
相比单线程,多线程可以充分利用带宽(下载速度几乎达到上限——10MB/s)。不过两者都有一个大问题……下载图片数过大时会有不少图片损坏。
参考资料: