[推荐]Rolling cURL: PHP并发最佳实践 商品价格监控 curl_multi族 函数

rolling-curl完整版下载地址

https://github.com/LionsAd/rolling-curl


curl_multi在抓取数据中的并发实现(查看curl_multi函数的详细介绍)

 

cURL multi批处理实现及避免cURL multi造成CPU负载过高问题


php中curl_multi的应用 同时请求多个url

<?php
/*
authored by Josh Fraser (www.joshfraser.com)
released under Apache License 2.0

Maintained by Alexander Makarov, http://rmcreative.ru/

$Id$
*/

// a little example that fetches a bunch of sites in parallel and echos the page title and response info for each request

function request_callback($response, $info) {
	// parse the page title out of the returned HTML
	if (preg_match("~<title>(.*?)</title>~i", $response, $out)) {
		$title = $out[1];
	}
	echo "<b>$title</b><br />";
	print_r($info);
	echo "<hr>";
}

require("RollingCurl.php");

// single curl request
$rc = new RollingCurl("request_callback");
$rc->request("http://www.msn.com");
$rc->execute();

// another single curl request
$rc = new RollingCurl("request_callback");
//request 方法 参数是网址
$rc->request("http://www.google.com");
$rc->request("http://www.baidu.com");
$rc->execute();

echo "<hr>";

// top 20 sites according to alexa (11/5/09)
$urls = array("http://www.google.com",
              "http://www.facebook.com",
              "http://www.yahoo.com",
              "http://www.youtube.com",
              "http://www.live.com",
              "http://www.wikipedia.com",
              "http://www.blogger.com",
              "http://www.msn.com",
              "http://www.baidu.com",
              "http://www.yahoo.co.jp",
              "http://www.myspace.com",
              "http://www.qq.com",
              "http://www.google.co.in",
              "http://www.twitter.com",
              "http://www.google.de",
              "http://www.microsoft.com",
              "http://www.google.cn",
              "http://www.sina.com.cn",
              "http://www.wordpress.com",
              "http://www.google.co.uk");

$rc = new RollingCurl("request_callback");
$rc->window_size = 20;
foreach ($urls as $url) {
    $request = new RollingCurlRequest($url);
	//add 方法  参数是 RollingCurlRequest
    $rc->add($request);
}
$rc->execute();





//Example 1 - Hello world:

// an array of URL's to fetch
$urls = array("http://www.google.com",
              "http://www.facebook.com",
              "http://www.yahoo.com");

// a function that will process the returned responses
function request_callback($response, $info) {
	// parse the page title out of the returned HTML
	if (preg_match("~<title>(.*?)</title>~i", $response, $out)) {
		$title = $out[1];
	}
	echo "<b>$title</b><br />";
	print_r($info);
	echo "<hr>";
}

// create a new RollingCurl object and pass it the name of your custom callback function
$rc = new RollingCurl("request_callback");
// the window size determines how many simultaneous requests to allow.  
//创建一个新的RollingCurl对象并将其传递给您的自定义回调函数的名称
$rc->window_size = 20;
foreach ($urls as $url) {
    // add each request to the RollingCurl object
    $request = new RollingCurlRequest($url); //?? RollingCurlRequest
    $rc->add($request);
}
$rc->execute();





//Example 2 - Setting custom options:

//Set custom options for EVERY request: 设置自定义选项为每个请求
$rc = new RollingCurl("request_callback");
$rc->options = array(CURLOPT_HEADER => true, CURLOPT_NOBODY => true); 
$rc->execute();

//Set custom options for A SINGLE request: 设置自定义选项为单个请求
$rc = new RollingCurl("request_callback");
$request = new RollingCurlRequest($url); // ?? RollingCurlRequest
$request->options = array(CURLOPT_HEADER => true, CURLOPT_NOBODY => true); 
$rc->add($request);
$rc->execute();




//Example 3 - Shortcuts: get方法

$rc = new RollingCurl("request_callback");
// 方法类似request 只是指定get方法了
$rc->get("http://www.google.com");
$rc->get("http://www.yahoo.com");
$rc->execute();

//Example 4 - Class callbacks:
class MyInfoCollector {
    private $rc;

    function __construct(){
        $this->rc = new RollingCurl(array($this, 'processPage'));
    }

    function processPage($response, $info){
      //...
    }

    function run($urls){
        foreach ($urls as $url){
            $request = new RollingCurlRequest($url);
            $this->rc->add($request);
        }
        $this->rc->execute();
    }
}

$collector = new MyInfoCollector();
$collector->run(array(
    'http://google.com/',
    'http://yahoo.com/'
));



<?php 
/*
Authored by Josh Fraser (www.joshfraser.com)
Released under Apache License 2.0

Maintained by Alexander Makarov, http://rmcreative.ru/

$Id$
*/

/**
 * Class that represent a single curl request
 */
class RollingCurlRequest {
	public $url = false;
	public $method = 'GET';
	public $post_data = null;
	public $headers = null;
	public $options = null;

    /**
     * @param string $url
     * @param string $method
     * @param  $post_data
     * @param  $headers
     * @param  $options
     * @return void
     */
    function __construct($url, $method = "GET", $post_data = null, $headers = null, $options = null) {
        $this->url = $url;
        $this->method = $method;
        $this->post_data = $post_data;
        $this->headers = $headers;
        $this->options = $options;
    }

    /**
     * @return void
     */
    public function __destruct() {
        unset($this->url, $this->method, $this->post_data, $this->headers, $this->options);
    }
}

/**
 * RollingCurl custom exception
 */
class RollingCurlException extends Exception {}

/**
 * Class that holds a rolling queue of curl requests.
 *
 * @throws RollingCurlException
 */
class RollingCurl {
    /**
     * @var int
     *
     * Window size is the max number of simultaneous connections allowed.
	 * 
     * REMEMBER TO RESPECT THE SERVERS:
     * Sending too many requests at one time can easily be perceived
     * as a DOS attack. Increase this window_size if you are making requests
     * to multiple servers or have permission from the receving server admins.
     */
    private $window_size = 5;

    /**
     * @var float
     *
     * Timeout is the timeout used for curl_multi_select.
     */
    private $timeout = 10;

    /**
     * @var string|array
     *
     * Callback function to be applied to each result.
     */
    private $callback;

    /**
     * @var array
     *
     * Set your base options that you want to be used with EVERY request.
     */
    protected $options = array(
		CURLOPT_SSL_VERIFYPEER => 0,
        CURLOPT_RETURNTRANSFER => 1,
        CURLOPT_CONNECTTIMEOUT => 30,
        CURLOPT_TIMEOUT => 30
	);
	
    /**
     * @var array
     */
    private $headers = array();

    /**
     * @var Request[]
     *
     * The request queue
     */
    private $requests = array();

    /**
     * @var RequestMap[]
     *
     * Maps handles to request indexes
     */
    private $requestMap = array();

    /**
     * @param  $callback
     * Callback function to be applied to each result.
     *
     * Can be specified as 'my_callback_function'
     * or array($object, 'my_callback_method').
     *
     * Function should take three parameters: $response, $info, $request.
     * $response is response body, $info is additional curl info.
     * $request is the original request
     *
     * @return void
     */
	function __construct($callback = null) {
        $this->callback = $callback;
    }

    /** 返回 属性
     * @param string $name
     * @return mixed
     */
    public function __get($name) {
        return (isset($this->{$name})) ? $this->{$name} : null;
    }

    /** 可以设置headers 和 options
     * @param string $name
     * @param mixed $value
     * @return bool
     */
    public function __set($name, $value){
        // append the base options & headers
        if ($name == "options" || $name == "headers") {
            $this->{$name} = $value + $this->{$name};
        } else {
            $this->{$name} = $value;
        }
        return true;
    }

    /**
     * Add a request to the request queue
     *
     * @param Request $request
     * @return bool
     */
    public function add($request) {
         $this->requests[] = $request;
         return true;
    }

    /**
     * Create new Request and add it to the request queue
     *
     * @param string $url
     * @param string $method
     * @param  $post_data
     * @param  $headers
     * @param  $options
     * @return bool
     */
    public function request($url, $method = "GET", $post_data = null, $headers = null, $options = null) {
         $this->requests[] = new RollingCurlRequest($url, $method, $post_data, $headers, $options);
         return true;
    }

    /**
     * Perform GET request
     *
     * @param string $url
     * @param  $headers
     * @param  $options
     * @return bool
     */
    public function get($url, $headers = null, $options = null) {
        return $this->request($url, "GET", null, $headers, $options);
    }

    /**
     * Perform POST request
     *
     * @param string $url
     * @param  $post_data
     * @param  $headers
     * @param  $options
     * @return bool
     */
    public function post($url, $post_data = null, $headers = null, $options = null) {
        return $this->request($url, "POST", $post_data, $headers, $options);
    }

    /**
     * Execute the curl
     *
     * @param int $window_size Max number of simultaneous connections
     * @return string|bool
     */
    public function execute($window_size = null) {
        // rolling curl window must always be greater than 1
        if (sizeof($this->requests) == 1) {
            return $this->single_curl();
        } else {
            // start the rolling curl. window_size is the max number of simultaneous connections
            return $this->rolling_curl($window_size);
        }
    }

    /**
     * Performs a single curl request
     *
     * @access private
     * @return string
     */
    private function single_curl() {
        $ch = curl_init();		
        $request = array_shift($this->requests);
        $options = $this->get_options($request);
        curl_setopt_array($ch,$options);
        $output = curl_exec($ch);
        $info = curl_getinfo($ch);

        // it's not neccesary to set a callback for one-off requests
        if ($this->callback) {
            $callback = $this->callback;
            if (is_callable($this->callback)){
                call_user_func($callback, $output, $info, $request);
            }
        }
		else
            return $output;
	return true;
    }

    /**
     * Performs multiple curl requests
     *
     * @access private
     * @throws RollingCurlException
     * @param int $window_size Max number of simultaneous connections
     * @return bool
     */
    private function rolling_curl($window_size = null) {
        if ($window_size)
            $this->window_size = $window_size;

        // make sure the rolling window isn't greater than the # of urls
        if (sizeof($this->requests) < $this->window_size)
            $this->window_size = sizeof($this->requests);
        
        if ($this->window_size < 2) {
            throw new RollingCurlException("Window size must be greater than 1");
        }

        $master = curl_multi_init();        

        // start the first batch of requests
        for ($i = 0; $i < $this->window_size; $i++) {
            $ch = curl_init();

            $options = $this->get_options($this->requests[$i]);

            curl_setopt_array($ch,$options);
            curl_multi_add_handle($master, $ch);

            // Add to our request Maps
            $key = (string) $ch;
            $this->requestMap[$key] = $i;
        }

        do {
            while(($execrun = curl_multi_exec($master, $running)) == CURLM_CALL_MULTI_PERFORM);
            if($execrun != CURLM_OK)
                break;
            // a request was just completed -- find out which one
            while($done = curl_multi_info_read($master)) {

                // get the info and content returned on the request
                $info = curl_getinfo($done['handle']);
                $output = curl_multi_getcontent($done['handle']);

                // send the return values to the callback function.
                $callback = $this->callback;
                if (is_callable($callback)){
	            $key = (string)$done['handle'];
                    $request = $this->requests[$this->requestMap[$key]];
                    unset($this->requestMap[$key]);
                    call_user_func($callback, $output, $info, $request);
                }

                // start a new request (it's important to do this before removing the old one)
                if ($i < sizeof($this->requests) && isset($this->requests[$i]) && $i < count($this->requests)) {
                    $ch = curl_init();
                    $options = $this->get_options($this->requests[$i]);
                    curl_setopt_array($ch,$options);
                    curl_multi_add_handle($master, $ch);

                    // Add to our request Maps
                    $key = (string) $ch;
                    $this->requestMap[$key] = $i;
                    $i++;
                }

                // remove the curl handle that just completed
                curl_multi_remove_handle($master, $done['handle']);

            }

	    // Block for data in / output; error handling is done by curl_multi_exec
	    if ($running)
                curl_multi_select($master, $this->timeout);

        } while ($running);
        curl_multi_close($master);
        return true;
    }


    /**
     * Helper function to set up a new request by setting the appropriate options
     *
     * @access private
     * @param Request $request
     * @return array
     */
    private function get_options($request) {
        // options for this entire curl object
        $options = $this->__get('options');
		if (ini_get('safe_mode') == 'Off' || !ini_get('safe_mode')) {
			/*
				要使用CURL中的CURLOPT_FOLLOWLOCATION(允许被抓取的链接跳转)
				必须关闭安全模式 以及关闭open_basedir
				这样的话就会影响服务器的安全性
					我也遇到过
					301 302跳转后采集的问题
					解决办法是合租换 vps 
					自己配置环
			*/

            $options[CURLOPT_FOLLOWLOCATION] = 1;			//允许被抓取的链接跳转
			$options[CURLOPT_MAXREDIRS] = 5;					//可以限定递归返回的数量
        }
        $headers = $this->__get('headers');

		// append custom options for this specific request
		if ($request->options) {
            $options = $request->options + $options;
        }

		// set the request URL
        $options[CURLOPT_URL] = $request->url;

        // posting data w/ this request?
        if ($request->post_data) {
            $options[CURLOPT_POST] = 1;
            $options[CURLOPT_POSTFIELDS] = $request->post_data;
        }
        if ($headers) {
            $options[CURLOPT_HEADER] = 0;
            $options[CURLOPT_HTTPHEADER] = $headers;
        }

        return $options;
    }

    /**
     * @return void
     */
    public function __destruct() {
        unset($this->window_size, $this->callback, $this->options, $this->headers, $this->requests);
	}
}






在实际项目或者自己编写小工具(比如新闻聚合,商品价格监控,比价)的过程中, 通常需要从第3方网站或者API接口获取数据, 在需要处理1个URL队列时, 为了提高性能, 可以采用cURL提供的curl_multi_*族函数实现简单的并发.

本文将探讨两种具体的实现方法, 并对不同的方法做简单的性能对比.




curl_multi_select

(PHP 5)
curl_multi_select — 等待所有cURL批处理中的活动连接

说明 ¶

int curl_multi_select ( resource $mh [, float $timeout = 1.0 ] )
阻塞直到cURL批处理连接中有活动连接。

参数 ¶

mh
由 curl_multi_init() 返回的 cURL 多个句柄。

timeout
以秒为单位,等待响应的时间。

返回值 ¶

成功时返回描述符集合中描述符的数量。失败时,select失败时返回-1,否则返回超时(从底层的select系统调用).

参见 ¶


--------------------------------------------------

while (($code = curl_multi_exec($queue$active)) == CURLM_CALL_MULTI_PERFORM) ;  


相当于

do{

($code = curl_multi_exec($queue$active)) )

}while($code == CURLM_CALL_MULTI_PERFORM)


==================================================================================




1. 经典cURL并发机制及其存在的问题

经典的cURL实现机制在网上很容易找到, 比如参考PHP在线手册的如下实现方式:

function classic_curl($urls, $delay) {
    $queue = curl_multi_init();
    $map = array();
 
    foreach ($urls as $url) {
        // create cURL resources
        $ch = curl_init();
 
        // set URL and other appropriate options
        curl_setopt($ch, CURLOPT_URL, $url);
 
        curl_setopt($ch, CURLOPT_TIMEOUT, 1);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
        curl_setopt($ch, CURLOPT_HEADER, 0);
        curl_setopt($ch, CURLOPT_NOSIGNAL, true);
 
        // add handle
        curl_multi_add_handle($queue, $ch);
        $map[$url] = $ch;
    }
 
    $active = null;
 
    // execute the handles 当curl批处理有错误的时候,循环批处理,直到没错误,$active === &$still_running
    批处理命令发出后,操作会要等一段时间才能完成,
    do {
        $mrc = curl_multi_exec($queue, $active);
    } while ($mrc == CURLM_CALL_MULTI_PERFORM);
 
    while ($active > 0 && $mrc == CURLM_OK) {
        if (curl_multi_select($queue, 0.5) != -1) {
            do {
                $mrc = curl_multi_exec($queue, $active);
            } while ($mrc == CURLM_CALL_MULTI_PERFORM);
        }
    }
 
    $responses = array();
    foreach ($map as $url=>$ch) {
        $responses[$url] = callback(curl_multi_getcontent($ch), $delay);
        curl_multi_remove_handle($queue, $ch);
        curl_close($ch);
    }
 
    curl_multi_close($queue);
    return $responses;
}


2. 改进的Rolling cURL并发方式

仔细分析不难发现经典cURL并发还存在优化的空间, 优化的方式时当某个URL请求完毕之后尽可能快的去处理它, 边处理边等待其他的URL返回, 而不是等待那个最慢的接口返回之后才开始处理等工作, 从而避免CPU的空闲和浪费. 闲话不多说, 下面贴上具体的实现:


function rolling_curl($urls, $delay) {
    $queue = curl_multi_init();
    $map = array();
 
    foreach ($urls as $url) {
        $ch = curl_init();
 
        curl_setopt($ch, CURLOPT_URL, $url);
        curl_setopt($ch, CURLOPT_TIMEOUT, 1);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
        curl_setopt($ch, CURLOPT_HEADER, 0);
        curl_setopt($ch, CURLOPT_NOSIGNAL, true);
 
        curl_multi_add_handle($queue, $ch);
        $map[(string) $ch] = $url;
    }
 
    $responses = array();
    do {
        while (($code = curl_multi_exec($queue, $active)) == CURLM_CALL_MULTI_PERFORM) ;
 
        if ($code != CURLM_OK) { break; }
 
        // a request was just completed -- find out which one
        while ($done = curl_multi_info_read($queue)) {
 
            // get the info and content returned on the request
            $info = curl_getinfo($done['handle']);
            $error = curl_error($done['handle']);
            $results = callback(curl_multi_getcontent($done['handle']), $delay);
            $responses[$map[(string) $done['handle']]] = compact('info', 'error', 'results');
 
            // remove the curl handle that just completed
            curl_multi_remove_handle($queue, $done['handle']);
            curl_close($done['handle']);
        }
 
        // Block for data in / output; error handling is done by curl_multi_exec
        if ($active > 0) {
            curl_multi_select($queue, 0.5);
        }
 
    } while ($active);
 
    curl_multi_close($queue);
    return $responses;
}




改进前后的性能对比试验在LINUX主机上进行, 测试时使用的并发队列如下:

var_dump(rolling_curl(array(
    'http://item.jd.com/1120135.html',
    'http://item.jd.com/164112.html',


),100));



简要说明下实验设计的原则和性能测试结果的格式: 为保证结果的可靠, 每组实验重复20次, 在单次实验中, 给定相同的接口URL集合, 分别测量Classic(指经典的并发机制)和Rolling(指改进后的并发机制)两种并发机制的耗时(秒为单位), 耗时短者胜出(Winner), 并计算节省的时间(Excellence, 秒为单位)以及性能提升比例(Excel. %). 为了尽量贴近真实的请求而又保持实验的简单, 在对返回结果的处理上只是做了简单的正则表达式匹配, 而没有进行其他复杂的操作. 另外, 为了确定结果处理回调对性能对比测试结果的影响, 可以使用usleep模拟现实中比较负责的数据处理逻辑(如提取, 分词, 写入文件或数据库等).

性能测试中用到的回调函数为:


function callback($data, $delay)
{
    preg_match_all('/<h3>(.+)<\/h3>/iU', $data, $matches);
    usleep($delay);
    return compact('data', 'matches');
}


数据处理回调无延迟时: Rolling Curl略优, 但性能提升效果不明显.

------------------------------------------------------------------------------------------------
Delay: 0 micro seconds, equals to 0 milli seconds
------------------------------------------------------------------------------------------------
Counter         Classic         Rolling         Winner          Excellence      Excel. %
------------------------------------------------------------------------------------------------
1               0.1193          0.0390          Rolling         0.0803          67.31%
2               0.0556          0.0477          Rolling         0.0079          14.21%
3               0.0461          0.0588          Classic         -0.0127         -21.6%
4               0.0464          0.0385          Rolling         0.0079          17.03%
5               0.0534          0.0448          Rolling         0.0086          16.1%
6               0.0540          0.0714          Classic         -0.0174         -24.37%
7               0.0386          0.0416          Classic         -0.0030         -7.21%
8               0.0357          0.0398          Classic         -0.0041         -10.3%
9               0.0437          0.0442          Classic         -0.0005         -1.13%
10              0.0319          0.0348          Classic         -0.0029         -8.33%
11              0.0529          0.0430          Rolling         0.0099          18.71%
12              0.0503          0.0581          Classic         -0.0078         -13.43%
13              0.0344          0.0225          Rolling         0.0119          34.59%
14              0.0397          0.0643          Classic         -0.0246         -38.26%
15              0.0368          0.0489          Classic         -0.0121         -24.74%
16              0.0502          0.0394          Rolling         0.0108          21.51%
17              0.0592          0.0383          Rolling         0.0209          35.3%
18              0.0302          0.0285          Rolling         0.0017          5.63%
19              0.0248          0.0553          Classic         -0.0305         -55.15%
20              0.0137          0.0131          Rolling         0.0006          4.38%
------------------------------------------------------------------------------------------------
Average         0.0458          0.0436          Rolling         0.0022          4.8%
------------------------------------------------------------------------------------------------
Summary: Classic wins 10 times, while Rolling wins 10 times

数据处理回调延迟5毫秒: Rolling Curl完胜, 性能提升40%左右.

------------------------------------------------------------------------------------------------
Delay: 5000 micro seconds, equals to 5 milli seconds
------------------------------------------------------------------------------------------------
Counter         Classic         Rolling         Winner          Excellence      Excel. %
------------------------------------------------------------------------------------------------
1               0.0658          0.0352          Rolling         0.0306          46.5%
2               0.0728          0.0367          Rolling         0.0361          49.59%
3               0.0732          0.0387          Rolling         0.0345          47.13%
4               0.0783          0.0347          Rolling         0.0436          55.68%
5               0.0658          0.0286          Rolling         0.0372          56.53%
6               0.0687          0.0362          Rolling         0.0325          47.31%
7               0.0787          0.0337          Rolling         0.0450          57.18%
8               0.0676          0.0391          Rolling         0.0285          42.16%
9               0.0668          0.0351          Rolling         0.0317          47.46%
10              0.0603          0.0317          Rolling         0.0286          47.43%
11              0.0714          0.0350          Rolling         0.0364          50.98%
12              0.0627          0.0215          Rolling         0.0412          65.71%
13              0.0617          0.0401          Rolling         0.0216          35.01%
14              0.0721          0.0226          Rolling         0.0495          68.65%
15              0.0701          0.0428          Rolling         0.0273          38.94%
16              0.0674          0.0352          Rolling         0.0322          47.77%
17              0.0452          0.0425          Rolling         0.0027          5.97%
18              0.0596          0.0366          Rolling         0.0230          38.59%
19              0.0679          0.0480          Rolling         0.0199          29.31%
20              0.0657          0.0338          Rolling         0.0319          48.55%
------------------------------------------------------------------------------------------------
Average         0.0671          0.0354          Rolling         0.0317          47.24%
------------------------------------------------------------------------------------------------
Summary: Classic wins 0 times, while Rolling wins 20 times

通过上面的性能对比, 在处理URL队列并发的应用场景中Rolling cURL应该是更加的选择, 并发量非常大(1000+)时, 可以控制并发队列的最大长度, 比如20, 每当1个URL返回并处理完毕之后立即加入1个尚未请求的URL到队列中, 这样写出来的代码会更加健壮, 不至于并发数太大而卡死或崩溃. 详细的实现请参考:http://code.google.com/p/rolling-curl/

5. 参考资料和延伸阅读

文章出处:http://www.searchtb.com/2012/06/rolling-curl-best-practices.html

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值