Deployment of demo.pyspider.org #demo.pyspider.org部署 =============================== [demo.pyspider.org](http://demo.pyspider.org/)运行在3个虚拟机上,使用[tinc](http://www.tinc-vpn.org/)实现私有网络之间的联系。 [demo.pyspider.org](http://demo.pyspider.org/) is running on three VPSs connected together with private network using [tinc](http://www.tinc-vpn.org/). 1vCore 4GB RAM | 1vCore 2GB RAM * 2 #1核 4GB 内存 一台| 1核 2GB 内存 2台 ---------------|---------------- database<br> #数据库 message queue<br>#消息队列 scheduler | phantomjs * 2<br>#调度器|phantomjs * 2 phantomjs-lb * 1<br>#phantomjs-lb fetcher * 1<br>#提取器*1 fetcher-lb * 1<br>#fetcher-lb*1 processor * 2<br>#处理器*2 result-worker * 1<br>#结果工作者*1 webui * 4<br>#webui*4 webui-lb * 1<br>#webui-lb*1 nginx * 1<br>#nginx*1 #所有组件都运行在docker容器中 All components are running inside docker containers. #数据库/消息队列/调度器 database / message queue / scheduler ------------------------------------ #数据库使用postgresql,消息队列使用redis The database is postgresql and the message queue is redis. #调度器可能有大量的数据库操作,最好把它放在数据库一起 Scheduler may have a lot of database operations, it's better to put it close to the database. ```bash docker run --name postgres -v /data/postgres/:/var/lib/postgresql/data -d -p $LOCAL_IP:5432:5432 -e POSTGRES_PASSWORD="" postgres docker run --name redis -d -p $LOCAL_IP:6379:6379 redis docker run --name scheduler -d -p $LOCAL_IP:23333:23333 --restart=always binux/pyspider \ --taskdb "sqlalchemy+postgresql+taskdb://binux@10.21.0.7/taskdb" \ --resultdb "sqlalchemy+postgresql+resultdb://binux@10.21.0.7/resultdb" \ --projectdb "sqlalchemy+postgresql+projectdb://binux@10.21.0.7/projectdb" \ --message-queue "redis://10.21.0.7:6379/1" \ scheduler --inqueue-limit 5000 --delete-time 43200 ``` other components #其它组件 ---------------- fetcher, processor, result_worker 运行在两个拥有使用[docker-compose]管理的同样配置的boxes中 fetcher, processor, result_worker are running on two boxes with same configuration managed with [docker-compose](https://docs.docker.com/compose/). ```yaml phantomjs: image: 'binux/pyspider:latest' command: phantomjs cpu_shares: 512 environment: - 'EXCLUDE_PORTS=5000,23333,24444' expose: - '25555' mem_limit: 512m restart: always phantomjs-lb: image: 'dockercloud/haproxy:latest' links: - phantomjs restart: always fetcher: image: 'binux/pyspider:latest' command: '--message-queue "redis://10.21.0.7:6379/1" --phantomjs-proxy "phantomjs:80" fetcher --xmlrpc' cpu_shares: 512 environment: - 'EXCLUDE_PORTS=5000,25555,23333' links: - 'phantomjs-lb:phantomjs' mem_limit: 128m restart: always fetcher-lb: image: 'dockercloud/haproxy:latest' links: - fetcher restart: always processor: image: 'binux/pyspider:latest' command: '--projectdb "sqlalchemy+postgresql+projectdb://binux@10.21.0.7/projectdb" --message-queue "redis://10.21.0.7:6379/1" processor' cpu_shares: 512 mem_limit: 256m restart: always result-worker: image: 'binux/pyspider:latest' command: '--taskdb "sqlalchemy+postgresql+taskdb://binux@10.21.0.7/taskdb" --projectdb "sqlalchemy+postgresql+projectdb://binux@10.21.0.7/projectdb" --resultdb "sqlalchemy+postgresql+resultdb://binux@10.21.0.7/resultdb" --message-queue "redis://10.21.0.7:6379/1" result_worker' cpu_shares: 512 mem_limit: 256m restart: always webui: image: 'binux/pyspider:latest' command: '--taskdb "sqlalchemy+postgresql+taskdb://binux@10.21.0.7/taskdb" --projectdb "sqlalchemy+postgresql+projectdb://binux@10.21.0.7/projectdb" --resultdb "sqlalchemy+postgresql+resultdb://binux@10.21.0.7/resultdb" --message-queue "redis://10.21.0.7:6379/1" webui --max-rate 0.2 --max-burst 3 --scheduler-rpc "http://o4.i.binux.me:23333/" --fetcher-rpc "http://fetcher/"' cpu_shares: 512 environment: - 'EXCLUDE_PORTS=24444,25555,23333' links: - 'fetcher-lb:fetcher' mem_limit: 256m restart: always webui-lb: image: 'dockercloud/haproxy:latest' links: - webui restart: always nginx: image: 'nginx' links: - 'webui-lb:HAPROXY' ports: - '0.0.0.0:80:80' volumes: - /home/binux/nfs/profile/nginx/nginx.conf:/etc/nginx/nginx.conf - /home/binux/nfs/profile/nginx/conf.d/:/etc/nginx/conf.d/ restart: always ``` 在这个配置中,当你需要的时候你可以将规模改为`docker-compose scale phantomjs=2 processor=2 webui=4` With the config, you can change the scale by `docker-compose scale phantomjs=2 processor=2 webui=4` when you need. #### load balance #负载均衡 #phantomjs - lb,fetcher - lb,webui-lb是自动配置的haproxy,允许任何数量的向上数据流。 phantomjs-lb, fetcher-lb, webui-lb are automaticlly configed haproxy, allow any number of upstreams. #### phantomjs #phantomjs有内存泄漏问题,实施内存限制,建议每小时重新启动它。 phantomjs have memory leak issue, memory limit applied, and it's recommended to restart it every hour. #### fetcher #fetcher通过异步IO实现,它支持100个并发连接。如果上游队列没有被阻塞,一个fetcher应该足够了。 fetcher is implemented with aync IO, it supportes 100 concurrent connections. If the upstream queue are not choked, one fetcher should be enough. #### processor #processor是CPU绑定组件,推荐的实例数量是CPU核心+1~2或CPU核心*10%~15%当你有20核或者更多的时候 processor is CPU bound component, recommended number of instance is number of CPU cores + 1~2 or CPU cores * 10%~15% when you have more then 20 cores. #### result-worker 如果您没有重写result-worker,它仅仅将结果写入数据库,并且应该非常快。 If you didn't override result-worker, it only write results into database, and should be very fast.
转载于:https://my.oschina.net/sijinge/blog/1528256