PHP高级编程之--单线程实现并行抓取网页
钻研PHP很多年,想总结一下自己,但是不知从何说起,那就先谈一些我知道的PHP的用法,这些用法很多有多年的工作经验的人可能都不知道。
实际中可能用不到,但是用到了,就是体现你实力的时候。
现在问题是这样的,一个用户,用的是windows 的虚拟服务器,然后呢,当打开一个页面的时候,要并行的抓取大概10个网站的标题显示出来。
这样的用法很少见,但是,有需求,就得解决。
串行是一个解决方案,但是要等待的时间过长。于是我想到用curl 去并行抓取。不知道的可以参考这篇文章
PHP多线程(四) 内部多线程
但是,最后发现,那个虚拟服务器上没有curl,这真是让人纠结。于是,我决定改变思路,用单个线程也实现多个线程的效果。我想对网络编程有点
了解的人肯定知道IO复用这个概念,当然PHP上也是支持的,而且,内部支持,不需要任何扩展。
可能有很多年编程经验的人对PHP的stream 函数可能不太了解。PHP的压缩文件流,文件流,tcp 协议下的应用 都封装成一个stream。所以,读本地文件
和读网络文件没有任何的差别。说了这样多,我想大家都基本上明白了,直接贴上代码吧:
代码比较的粗糙,如果大家要实际用的话,还是要处理一些细节问题。
![](https://i-blog.csdnimg.cn/blog_migrate/8f900a89c6347c561fdf2122f13be562.gif)
![ExpandedBlockStart.gif](https://i-blog.csdnimg.cn/blog_migrate/961ddebeb323a10fe0623af514929fc1.gif)
<?
php
function http_get_open( $url )
{
$url = parse_url ( $url );
if ( empty ( $url [ ' host ' ])) {
return false ;
}
$host = $url [ ' host ' ];
if ( empty ( $url [ ' path ' ])) {
$url [ ' path ' ] = " / " ;
}
$get = $url [ ' path ' ] . " ? " . @ $url [ ' query ' ];
$fp = stream_socket_client ( " tcp://{ $host }:80 " , $errno , $errstr , 30 );
if ( ! $fp ) {
echo " $errstr ( $errno )<br />\n " ;
return false ;
} else {
fwrite ( $fp , " GET { $get } HTTP/1.0\r\nHost: { $host }\r\nAccept: */*\r\n\r\n " );
}
return $fp ;
}
function http_multi_get( $urls )
{
$result = array ();
$fps = array ();
foreach ( $urls as $key => $url )
{
$fp = http_get_open( $url );
if ( $fp === false ) {
$result [ $key ] = false ;
} else {
$result [ $key ] = '' ;
$fps [ $key ] = $fp ;
}
}
while ( 1 )
{
$reads = $fps ;
if ( empty ( $reads )) {
break ;
}
if (( $num = stream_select ( $reads , $w = null , $e = null , 30 )) === false ) {
echo " error " ;
return false ;
} else if ( $num > 0 ) { // can read
foreach ( $reads as $value )
{
$key = array_search ( $value , $fps );
if ( ! feof ( $value )) {
$result [ $key ] .= fread ( $value , 128 );
} else {
unset ( $fps [ $key ]);
}
}
} else { // time out
echo " timeout " ;
return false ;
}
}
foreach ( $result as $key => & $value )
{
if ( $value ) {
$value = explode ( " \r\n\r\n " , $value , 2 );
}
}
return $result ;
}
$urls = array ();
$urls [] = " http://www.qq.com " ;
$urls [] = " http://www.sina.com.cn " ;
$urls [] = " http://www.sohu.com " ;
$urls [] = " http://www.360.cn " ;
// 并行的抓取
$t1 = microtime ( true );
$result = http_multi_get( $urls );
$t1 = microtime ( true ) - $t1 ;
var_dump ( " cost: " . $t1 );
// 串行的抓取
$t1 = microtime ( true );
foreach ( $urls as $value )
{
file_get_contents ( $value );
}
$t1 = microtime ( true ) - $t1 ;
var_dump ( " cost: " . $t1 );
?>
function http_get_open( $url )
{
$url = parse_url ( $url );
if ( empty ( $url [ ' host ' ])) {
return false ;
}
$host = $url [ ' host ' ];
if ( empty ( $url [ ' path ' ])) {
$url [ ' path ' ] = " / " ;
}
$get = $url [ ' path ' ] . " ? " . @ $url [ ' query ' ];
$fp = stream_socket_client ( " tcp://{ $host }:80 " , $errno , $errstr , 30 );
if ( ! $fp ) {
echo " $errstr ( $errno )<br />\n " ;
return false ;
} else {
fwrite ( $fp , " GET { $get } HTTP/1.0\r\nHost: { $host }\r\nAccept: */*\r\n\r\n " );
}
return $fp ;
}
function http_multi_get( $urls )
{
$result = array ();
$fps = array ();
foreach ( $urls as $key => $url )
{
$fp = http_get_open( $url );
if ( $fp === false ) {
$result [ $key ] = false ;
} else {
$result [ $key ] = '' ;
$fps [ $key ] = $fp ;
}
}
while ( 1 )
{
$reads = $fps ;
if ( empty ( $reads )) {
break ;
}
if (( $num = stream_select ( $reads , $w = null , $e = null , 30 )) === false ) {
echo " error " ;
return false ;
} else if ( $num > 0 ) { // can read
foreach ( $reads as $value )
{
$key = array_search ( $value , $fps );
if ( ! feof ( $value )) {
$result [ $key ] .= fread ( $value , 128 );
} else {
unset ( $fps [ $key ]);
}
}
} else { // time out
echo " timeout " ;
return false ;
}
}
foreach ( $result as $key => & $value )
{
if ( $value ) {
$value = explode ( " \r\n\r\n " , $value , 2 );
}
}
return $result ;
}
$urls = array ();
$urls [] = " http://www.qq.com " ;
$urls [] = " http://www.sina.com.cn " ;
$urls [] = " http://www.sohu.com " ;
$urls [] = " http://www.360.cn " ;
// 并行的抓取
$t1 = microtime ( true );
$result = http_multi_get( $urls );
$t1 = microtime ( true ) - $t1 ;
var_dump ( " cost: " . $t1 );
// 串行的抓取
$t1 = microtime ( true );
foreach ( $urls as $value )
{
file_get_contents ( $value );
}
$t1 = microtime ( true ) - $t1 ;
var_dump ( " cost: " . $t1 );
?>
最后运行的结果:
(length=21)
(length=21)
基本上是两倍的效率,当然,发现新浪非常的慢,要2.5s 左右,
基本上是被他给拖累了,360只要 0.2s
如果,所有网站都差不多的速度,并行的数目更大,那么差的倍数也就越大。