仅用于交流和学习,禁止利用本资源从事任何违反本国(地区)法律法规的活动,一切遵守《网络安全法》
Tips:只是提供一个思路,实际项目中还需维护代理池可用性等细节部分
实战步骤
- 框架及核心库部署
- 定时更新代理池进程
- 定时爬取列表页进程
- 主进程定时从Redis中读取列表页任务,有则将每一项丢给异步任务执行
环境
- CentOS 7.2
- PHP7.2
- Swoole 4.3.5
- Google Chrome 78.0.3904.108
- ChromeDriver 78.0.3904.105
Composer
- facebook/webdriver=1.7
- easyswoole/easyswoole=3.1.18
- easyswoole/curl=1.0.1
框架及核心库部署
1、安装EasySwoole 3.1.18版本
[root@ar414.com phpseleniumdemo] composer require easyswoole/easyswoole=3.1.18
[root@ar414.com phpseleniumdemo] php vendor/easyswoole/easyswoole/bin/easyswoole install
______ _____ _
| ____| / ____| | |
| |__ __ _ ___ _ _ | (___ __ __ ___ ___ | | ___
| __| / _` | / __| | | | | \___ \ \ \ /\ / / / _ \ / _ \ | | / _ \
| |____ | (_| | \__ \ | |_| | ____) | \ V V / | (_) | | (_) | | | | __/
|______| \__,_| |___/ \__, | |_____/ \_/\_/ \___/ \___/ |_| \___|
__/ |
|___/
install success,enjoy!
2.安装核心库facebook/webdriver、easyswoole/curl
[root@ar414.com phpseleniumdemo]# composer require facebook/webdriver=1.7
[root@ar414.com phpseleniumdemo]# composer require easyswoole/curl=1.0.1
3、确认运行没报错
[root@ar414.com phpseleniumdemo]# php easyswoole start
| ____| / ____| | |
| |__ __ _ ___ _ _ | (___ __ __ ___ ___ | | ___
>| __| / _` | / __| | | | | \___ \ \ \ /\ / / / _ \ / _ \ | | / _ \
>| |____ | (_| | \__ \ | |_| | ____) | \ V V / | (_) | | (_) | | | | __/
>|______| \__,_| |___/ \__, | |_____/ \_/\_/ \___/ \___/ |_| \___|
> __/ |
> |___/
main server SWOOLE_WEB
listen address 0.0.0.0
listen port 9501
sub server1 CONSOLE => SWOOLE_TCP@127.0.0.1:9500
....
定时更新代理池进程
Tips:代理资源请自行解决,这里只提供例子,实际是用不了的
1、 创建项目主目录
[root@ar414.com phpseleniumdemo]# mkdir App
#composer 指定App作用域
[root@ar414.com phpseleniumdemo]# cat composer.json
{
"autoload": {
"psr-4": { "App\\": "App/"
}
},
"require": {
"easyswoole/easyswoole": "3.1.18",
"facebook/webdriver": "^1.7",
"easyswoole/curl": "1.0.1"
}
}
#更新composer autoload
[root@ar414.com phpseleniumdemo]# composer dump-autoload
2、创建进程目录(将代理池更新作为一个子进程随项目启动运行)
[root@ar414.com phpseleniumdemo]# mkdir App/Process
3、代理池定时爬取(使用Redis List类型保证最新代理IP在头部,爬虫逻辑每次从头部获取,一个代理IP只用一次)
Tips:代理资源请自行解决,这里只提供例子,实际是用不了的
<?php
/**
* Created by PhpStorm.
* User: ar414.com@gmail.com
* Date: 2019/12/7
* Time: 21:00
*/
namespace App\Process;
use App\Lib\Curl;
use App\Lib\Kv;
use EasySwoole\Component\Process\AbstractProcess;
class UpdateProxyPool extends AbstractProcess
{
//这里的代理IP都只支持socks5协议
private $proxyListApi = "http://www.zdopen.com/ShortS5Proxy/GetIP/?api=%s&akey=%s&order=2&type=3";
const