rabbitmq 爬虫

21 篇文章 0 订阅

Exchange模式

RabbitMQ提供了四种Exchange:fanout,direct,topic,header,常用的是fanout,direct,topic

Direct

  • 消息传递时需要一个“routing_key”,可以简单的理解为要发送到的队列名字。
  • 这种模式下不需要将Exchange进行任何绑定(binding)操作

接收端

# # -*- coding: utf-8 -*-
import pika

connection = pika.BlockingConnection(pika.ConnectionParameters(host="0.0.0.0", virtual_host="/"))
channel = connection.channel()

channel.exchange_declare(exchange='direct_logs', type='direct')
result = channel.queue_declare(durable=True, queue="direct_key")
def callback(ch, method, properties, body):
    print " [x] Received %s routing_key %s" % (body, method.routing_key)
    ch.basic_ack(method.delivery_tag)


channel.basic_consume(callback, queue=result.method.queue)
channel.start_consuming()

发送端

# # -*- coding: utf-8 -*-
import pika

connection = pika.BlockingConnection(pika.ConnectionParameters(host="0.0.0.0", virtual_host="/"))
channel = connection.channel()
channel.exchange_declare(exchange='direct_logs', type='direct')
channel.basic_publish(exchange='direct_logs',
                      routing_key='k1',
                      body="22222222",
                      properties=pika.BasicProperties(
                          delivery_mode=2,
                      ))
channel.basic_publish(exchange='direct_logs',
                      routing_key='k2',
                      body="22222222",
                      properties=pika.BasicProperties(
                          delivery_mode=2,
                      ))

Fanout

  • 这种模式不需要routing_key
  • 这种模式需要提前将Exchange与Queue进行绑定,一个Exchange可以绑定多个Queue,一个Queue可以同多个Exchange进行绑定。

接收端

# # -*- coding: utf-8 -*-
import pika

connection = pika.BlockingConnection(pika.ConnectionParameters(host="0.0.0.0", virtual_host="/"))
channel = connection.channel()
channel.exchange_declare(exchange='fanout_logs', type='fanout')
result = channel.queue_declare(durable=True)
channel.queue_bind(exchange='fanout_logs', queue=result.method.queue)

def callback(ch, method, properties, body):
    print " [x] Received %s routing_key %s" % (body, method.routing_key)
    ch.basic_ack(method.delivery_tag)


channel.basic_consume(callback, queue=result.method.queue)
channel.start_consuming()

发送端

# # -*- coding: utf-8 -*-
import pika

connection = pika.BlockingConnection(pika.ConnectionParameters(host="0.0.0.0", virtual_host="/"))
channel = connection.channel()
channel.exchange_declare(exchange='fanout_logs', type='fanout')
channel.basic_publish(exchange='fanout_logs',
                      routing_key='k1',
                      body="22222222",
                      properties=pika.BasicProperties(
                          delivery_mode=2,
                      ))
channel.basic_publish(exchange='fanout_logs',
                      routing_key='k2',
                      body="22222222",
                      properties=pika.BasicProperties(
                          delivery_mode=2,
                      ))
channel.basic_publish(exchange='fanout_logs',
                      routing_key='k3',
                      body="22222222",
                      properties=pika.BasicProperties(
                          delivery_mode=2,
                      ))

结果

 [x] Received 22222222 routing_key k1
 [x] Received 22222222 routing_key k2
 [x] Received 22222222 routing_key k3

Topic

  • 这种模式需要RouteKey,也许要提前绑定Exchange与Queue。
  • 在进行绑定时,要提供一个该队列关心的主题,如*.log.*表示该队列关心所有涉及log的消息(一个routing_key为”a.log.error”的消息会被转发到该队列)。

接收端

# # -*- coding: utf-8 -*-
import pika

connection = pika.BlockingConnection(pika.ConnectionParameters(host="0.0.0.0", virtual_host="/"))
channel = connection.channel()
channel.exchange_declare(exchange="topic_logs", type='topic')
result = channel.queue_declare(durable=True)
channel.queue_bind(exchange="topic_logs", queue=result.method.queue, routing_key="*.log.*")
channel.queue_bind(exchange="topic_logs", queue=result.method.queue, routing_key="*.db.cc")

def callback(ch, method, properties, body):
    print " [x] Received %s routing_key %s" % (body, method.routing_key)
    ch.basic_ack(method.delivery_tag)


channel.basic_consume(callback, queue=result.method.queue)
channel.start_consuming()

发送端

# # -*- coding: utf-8 -*-
import pika


connection = pika.BlockingConnection(pika.ConnectionParameters(host="0.0.0.0", virtual_host="/"))
channel = connection.channel()
channel.exchange_declare(exchange='topic_logs', type='topic')
channel.basic_publish(exchange='topic_logs',
                      routing_key='user.log.error',
                      body="22222222",
                      properties=pika.BasicProperties(
                          delivery_mode=2,
                      ))
channel.basic_publish(exchange='topic_logs',
                      routing_key='user.log.success',
                      body="22222222",
                      properties=pika.BasicProperties(
                          delivery_mode=2,
                      ))
channel.basic_publish(exchange='topic_logs',
                      routing_key='ad.db.cc',
                      body="22222222",
                      properties=pika.BasicProperties(
                          delivery_mode=2,
                      ))

基于rabbitmq 简单的分布式爬虫程序

架构

这里写图片描述

  1. Download进程负责下载页面
  2. ParseBase监听Download下载完成的消息,解析页面(URL,EMAIL,……)

使用supervisor 管理进程
使用fabfile部署代码

简单版代码

https://github.com/neo-hu/rabbitmq-crawler

完整版

下载:频率修改,代理(翻墙)设置
页面解析:关键字,分词统计等
web管理页面等功能

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值