说明文档:
#===================================================================#
# 抓取框架简易说明文档
#
# @author blogdaren@163.com
# @modify 2014.07.07
# @version v1.0.0
#
#===================================================================
运行环境:
Ubuntu 12.04 LTS
PHP 5.4.7
Mysql 5.5.27
Apache 2.4.3
(pecl) Net_Gearman 0.2.3
gearmand 1.1.12
(pecl) gearman 1.1.12
安装gearman-job-server:
# pear channel-update pear.php.net
# pear install channel://pear.php.net/Net_Gearman-0.2.3
# wget https://launchpad.net/gearmand/1.2/1.1.12/+download/gearmand-1.1.12.tar.gz
# wget http://downloads.sourceforge.net/project/boost/boost/1.55.0/boost_1_55_0.zip
# unzip boost_1_55_0.zip
# sh bootstrap.sh
# ./bjam install --prefix=/usr/local/boost
# ln -s /usr/local/boost/include/boost/ /usr/include/boost
# cd /usr/local/boost && sudo find $PWD/lib/*.* -type f -exec ln -s {} /usr/lib/ \;
# sudo apt-get install gperf
# wget https://github.com https://github.com/downloads/libevent/libevent/libevent-2.0.21-stable.tar.gz
# wget http://downloads.sourceforge.net/project/libuuid/libuuid-1.0.2.tar.gz
安装PHP的gearman扩展:
# wget http://pecl.php.net/get/gearman-1.1.2.tgz
# ./configure --with-gearman=/usr/local/sbin/gearmand/ --with-php-config=/opt/lampp/bin/php-config
# sudo make && sudo make install
gearman-monitor管理工具:
# wget https://github.com/yugene/Gearman-Monitor/archive/master.zip
# http://www.ub.gearman-monitor.com
抓取框架目录结构:
=====================================
限于尺寸, 这里只展示了两级目录结构
=====================================
|-- application (具体应用)
| |-- bootstrap.php
| |-- config
| |-- download
| |-- extracter
| |-- filter
| |-- handler
| |-- helper
| `-- trigger
|-- core (框架内核)
| |-- assign.php
| |-- check
| |-- context.php
| |-- crontab
| |-- crontab.php
| |-- gearman
| |-- handler
| |-- handler.php
| |-- main.php
| |-- obj
| |-- oop
| |-- resources.php
| `-- sbin
|-- data (抓取临时数据)
| `-- file
|-- library (第三方库)
| `-- phpQuery-onefile.php
|-- README.md (框架说明)
`-- sbin
|-- cron.php (辅助调试)
|-- debug.php (辅助调试)
|-- killtornado.sh (辅助调试)
|-- task.php (辅助调试)
|-- tornado.php (启动入口)
`-- workers (gearman workers directory)
核心进程:
Core_Sbin_Tornado 守护进程,负责向集群加入节点,并初始化和管理子进程
Tornado_Check_System 该子进程负责检查系统是否正常运行,并与集群进行心跳检测,故障转移,主从自荐
Tornado_Crontab_Main(写操作) 子进程根据表配置负责启动计划任务
Tornado_Crontab_Task(写操作) 子进程负责添加计划任务
Tornado_Assign_Main(读操作) 子进程负责调度计划任务
Core_Handler 负责执行任务并进行任务处理
抓取框架核心点说明:
Gearman 多任务分发
PHP多进程编程
框架高度面向OOP: 抽象出了几大组件:分析器--下载器--抽取器--过滤器--触发器等, 随意抓取一个网站,你只需配置一下,然后集中精力编写对应各组件的最常用功能即可,比如:写正则,写数据库等
启动抓取框架:
/path/to/php /path/to/tornado.php -vvvv >> /tmp/tornado.log
效果截图: