php url 请求最快,快速验证PHP中的大量URL?

我有一个包含自由文本的内容数据库,大约有11000行数据,每行有87列.因此(可能)有大约957000个字段来检查URL是否有效.

我做了一个正则表达式来提取所有看起来像URL(http / s等)的东西,并建立了一个名为$urls的数组.然后,我遍历它,将每个$url传递给我的curl_exec()调用.

我试过了cURL(对于每个$url):

$ch = curl_init();

curl_setopt($ch, CURLOPT_CONNECTTIMEOUT_MS, 250);

curl_setopt($ch, CURLOPT_NOBODY, 1);

curl_setopt($ch, CURLOPT_FAILONERROR, 1);

curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);

curl_setopt($ch, CURLOPT_CONNECT_ONLY, 1);

curl_setopt($ch, CURLOPT_HEADER, 1);

curl_setopt($ch, CURLOPT_HTTPGET, 1);

foreach ($urls as $url) {

curl_setopt($ch, CURLOPT_URL, $url);

$exec = curl_exec($ch);

// Extra stuff here... it does add overhead, but not that much.

}

curl_close($ch);

据我所知,此功能应尽快运行,但每个URL大约需要2-3秒.

必须有一个更快的方法吗?

我打算通过cron作业运行此文件,然后首先检查我的本地数据库(如果过去30天内已检查过此URL),如果没有,则进行检查,因此随着时间的流逝,它会越来越少,但是我只想知道cURL是否是最好的解决方案,是否想让它更快?

编辑:

基于下面的评论bby Nick Zulu,我现在看这段代码:

function ODB_check_url_array($urls, $debug = true) {

if (!empty($urls)) {

$mh = curl_multi_init();

foreach ($urls as $index => $url) {

$ch[$index] = curl_init($url);

curl_setopt($ch[$index], CURLOPT_CONNECTTIMEOUT_MS, 10000);

curl_setopt($ch[$index], CURLOPT_NOBODY, 1);

curl_setopt($ch[$index], CURLOPT_FAILONERROR, 1);

curl_setopt($ch[$index], CURLOPT_RETURNTRANSFER, 1);

curl_setopt($ch[$index], CURLOPT_CONNECT_ONLY, 1);

curl_setopt($ch[$index], CURLOPT_HEADER, 1);

curl_setopt($ch[$index], CURLOPT_HTTPGET, 1);

curl_multi_add_handle($mh, $ch[$index]);

}

$running = null;

do {

curl_multi_exec($mh, $running);

} while ($running);

foreach ($ch as $index => $response) {

$return[$ch[$index]] = curl_multi_getcontent($ch[$index]);

curl_multi_remove_handle($mh, $ch[$index]);

curl_close($ch[$index]);

}

curl_multi_close($mh);

return $return;

}

}

解决方法:

让我们来看看..

>使用curl_multi api(这是在PHP中执行此操作的唯一明智的选择)

>具有最大同时连接数限制,不要只为每个url创建一个连接(如果仅创建一百万个同时连接,则会出现内存不足或资源不足的错误.我什至不会如果您同时创建了一百万个连接,请相信超时错误)

>仅获取标头,因为下载主体会浪费时间和带宽

这是我的尝试:

// if return_fault_reason is false, then the return is a simple array of strings of urls that validated.

// otherwise it's an array with the url as the key containing array(bool validated,int curl_error_code,string reason) for every url

function validate_urls(array $urls, int $max_connections, int $timeout_ms = 10000, bool $consider_http_300_redirect_as_error = true, bool $return_fault_reason) : array

{

if ($max_connections < 1) {

throw new InvalidArgumentException("max_connections MUST be >=1");

}

foreach ($urls as $key => $foo) {

if (!is_string($foo)) {

throw new \InvalidArgumentException("all urls must be strings!");

}

if (empty($foo)) {

unset($urls[$key]); //?

}

}

unset($foo);

$urls = array_unique($urls); // remove duplicates.

$ret = array();

$mh = curl_multi_init();

$workers = array();

$work = function () use (&$ret, &$workers, &$mh, &$return_fault_reason) {

// > If an added handle fails very quickly, it may never be counted as a running_handle

while (1) {

curl_multi_exec($mh, $still_running);

if ($still_running < count($workers)) {

break;

}

$cms=curl_multi_select($mh, 10);

//var_dump('sr: ' . $still_running . " c: " . count($workers)." cms: ".$cms);

}

while (false !== ($info = curl_multi_info_read($mh))) {

//echo "NOT FALSE!";

//var_dump($info);

{

if ($info['msg'] !== CURLMSG_DONE) {

continue;

}

if ($info['result'] !== CURLM_OK) {

if ($return_fault_reason) {

$ret[$workers[(int)$info['handle']]] = array(false, $info['result'], "curl_exec error " . $info['result'] . ": " . curl_strerror($info['result']));

}

} elseif (CURLE_OK !== ($err = curl_errno($info['handle']))) {

if ($return_fault_reason) {

$ret[$workers[(int)$info['handle']]] = array(false, $err, "curl error " . $err . ": " . curl_strerror($err));

}

} else {

$code = (string)curl_getinfo($info['handle'], CURLINFO_HTTP_CODE);

if ($code[0] === "3") {

if ($consider_http_300_redirect_as_error) {

if ($return_fault_reason) {

$ret[$workers[(int)$info['handle']]] = array(false, -1, "got a http " . $code . " redirect, which is considered an error");

}

} else {

if ($return_fault_reason) {

$ret[$workers[(int)$info['handle']]] = array(true, 0, "got a http " . $code . " redirect, which is considered a success");

} else {

$ret[] = $workers[(int)$info['handle']];

}

}

} elseif ($code[0] === "2") {

if ($return_fault_reason) {

$ret[$workers[(int)$info['handle']]] = array(true, 0, "got a http " . $code . " code, which is considered a success");

} else {

$ret[] = $workers[(int)$info['handle']];

}

} else {

// all non-2xx and non-3xx are always considered errors (500 internal server error, 400 client error, 404 not found, etcetc)

if ($return_fault_reason) {

$ret[$workers[(int)$info['handle']]] = array(false, -1, "got a http " . $code . " code, which is considered an error");

}

}

}

curl_multi_remove_handle($mh, $info['handle']);

assert(isset($workers[(int)$info['handle']]));

unset($workers[(int)$info['handle']]);

curl_close($info['handle']);

}

}

//echo "NO MORE INFO!";

};

foreach ($urls as $url) {

while (count($workers) >= $max_connections) {

//echo "TOO MANY WORKERS!\n";

$work();

}

$neww = curl_init($url);

if (!$neww) {

trigger_error("curl_init() failed! probably means that max_connections is too high and you ran out of resources", E_USER_WARNING);

if ($return_fault_reason) {

$ret[$url] = array(false, -1, "curl_init() failed");

}

continue;

}

$workers[(int)$neww] = $url;

curl_setopt_array($neww, array(

CURLOPT_NOBODY => 1,

CURLOPT_SSL_VERIFYHOST => 0,

CURLOPT_SSL_VERIFYPEER => 0,

CURLOPT_TIMEOUT_MS => $timeout_ms

));

curl_multi_add_handle($mh, $neww);

//curl_multi_exec($mh, $unused_here); LIKELY TO BE MUCH SLOWER IF DONE IN THIS LOOP: TOO MANY SYSCALLS

}

while (count($workers) > 0) {

//echo "WAITING FOR WORKERS TO BECOME 0!";

//var_dump(count($workers));

$work();

}

curl_multi_close($mh);

return $ret;

}

这是一些测试代码

$urls = [

'www.example.org',

'www.google.com',

'https://www.google.com',

];

var_dump(validate_urls($urls, 1000, 1, true, false));

回报

array(0) {

}

因为它们都超时了(1毫秒超时),并且失败原因报告被禁用(这是最后一个参数),

$urls = [

'www.example.org',

'www.google.com',

'https://www.google.com',

];

var_dump(validate_urls($urls, 1000, 1, true, true));

回报

array(3) {

["www.example.org"]=>

array(3) {

[0]=>

bool(false)

[1]=>

int(28)

[2]=>

string(39) "curl_exec error 28: Timeout was reached"

}

["www.google.com"]=>

array(3) {

[0]=>

bool(false)

[1]=>

int(28)

[2]=>

string(39) "curl_exec error 28: Timeout was reached"

}

["https://www.google.com"]=>

array(3) {

[0]=>

bool(false)

[1]=>

int(28)

[2]=>

string(39) "curl_exec error 28: Timeout was reached"

}

}

将超时限制增加到1000

var_dump(validate_urls($urls, 1000, 1000, true, false));

=

array(3) {

[0]=>

string(14) "www.google.com"

[1]=>

string(22) "https://www.google.com"

[2]=>

string(15) "www.example.org"

}

var_dump(validate_urls($urls, 1000, 1000, true, true));

=

array(3) {

["www.google.com"]=>

array(3) {

[0]=>

bool(true)

[1]=>

int(0)

[2]=>

string(50) "got a http 200 code, which is considered a success"

}

["www.example.org"]=>

array(3) {

[0]=>

bool(true)

[1]=>

int(0)

[2]=>

string(50) "got a http 200 code, which is considered a success"

}

["https://www.google.com"]=>

array(3) {

[0]=>

bool(true)

[1]=>

int(0)

[2]=>

string(50) "got a http 200 code, which is considered a success"

}

}

依此类推:)速度应取决于您的带宽和$max_connections变量,该变量是可配置的.

标签:php,curl

来源: https://codeday.me/bug/20191013/1908455.html

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值