scrapy爬取某个手机app的文章数据

简单说明:最近大致了解了一下scrapy框架,爬取自己做了个测试,爬取了某个app上的数据(暂时不公开是哪个),完成了数据抓取,数据去重,数据存储,由于资金和技术水平问题,没有放到服务器上,也没有采用分布式。

前期准备
1. 手机抓包工具采用了fidder,使用方法见http://blog.csdn.net/wuzhiguo1314/article/details/49589227
2. 安装数据采集框架scrapy、键值对数据库redis、数据存储数据库mongodb

开始项目建立
1. scrapy startproject tutorial 新建一个项目
2. 在spider文件夹下新建出ExampleSpider文件
3. 用上面的手机抓包工具fidder工具,获取我们想要的数据网址,分析出我们需要传递的数据,包括头信息,cookie,agent等等
4. 编写ExampleSpider,抽取出我们想要的数据字段,获取新的数据网址,在itmes定义好想要的数据结构,通过yield Request(url, callback= self.parse)添加新的访问,通过yield item返回数据给pipline
5. 定义piplines
(1)去重pipline
去重采用了redis数据库,self.r = redis.Redis(host=’localhost’,port=6379,db=0)数据库初始化,如果键值对存在与数据库中self.r.exists(‘id:%s’ % item[‘mId’]),就引发DropItem异常,不进行该item的存储。代码如下:

class DuplicatesPipeline(object):
    def __init__(self):
        self.r = redis.Redis(host='localhost',port=6379,db=0)

    def process_item(self, item, spider):
        if self.r.exists('id:%s' % item['mId']):
            print 'delete===='+item['content']
            raise DropItem("Duplicate item found: %s" % item)
        else:
            self.r.set('id:%s' % item['mId'],1)
        return item

(2) 数据存储pipline
经过筛选的item在这里进入到mongodb数据库中,有关服务器的设置在settings文件中,后面提到。mongodb数据库操作很简单,插入item的方法self.collection.insert(dict(item)),代码如下:

class RichardscrapyPipeline(object):
    def __init__(self):
        connection = pymongo.MongoClient(
            settings['MONGODB_SERVER'],
            settings['MONGODB_PORT']
        )
        db = connection[settings['MONGODB_DB']]
        self.collection = db[settings['MONGODB_COLLECTION']]



    def process_item(self, item, spider):
        # print item['content']
        self.collection.insert(dict(item))

        return item

(3)settings中设置pipline,以及数据库的配置

ITEM_PIPELINES = {

   'RichardScrapy.pipelines.DuplicatesPipeline': 200,
   'RichardScrapy.pipelines.RichardscrapyPipeline': 300,
}
MONGODB_SERVER = "localhost"
MONGODB_PORT = 27017
MONGODB_DB = "ExampleDb"
MONGODB_COLLECTION = "Example_content"

6 . 动态更换agent
(1)定义一个middlewares.py

import random
from scrapy.contrib.downloadermiddleware.useragent import UserAgentMiddleware

class RotateUserAgentMiddleware(UserAgentMiddleware):
    def __init__(self, user_agent=''):
        self.user_agent = user_agent

    def process_request(self, request, spider):
        #这句话用于随机选择user-agent
        ua = random.choice(self.user_agent_list)
        print '==='+ua+'==='
        if ua:
            request.headers.setdefault('User-Agent', ua)

    #the default user_agent_list composes chrome,I E,firefox,Mozilla,opera,netscape
    #for more user agent strings,you can find it in http://www.useragentstring.com/pages/useragentstring.php
    user_agent_list = [\
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"\
        "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",\
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",\
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",\
        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",\
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",\
        "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",\
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\
        "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",\
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",\
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\
        "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",\
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",\
        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
       ]

(2)setting 中设置下载中间件

DOWNLOADER_MIDDLEWARES = {
   # 'RichardScrapy.middlewares.MyCustomDownloaderMiddleware': 543,
   'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
   'RichardScrapy.middlewares.RotateUserAgentMiddleware':400,
}

需要注意的点
1. 好多网站设置了robot.txt文件来限制爬虫,setting文件中ROBOTSTXT_OBEY = True表示尊重网站的robot.txt命令,默认打开.
2. DOWNLOAD_DELAY 设置时间间隔
3. COOKIES_ENABLED 设置cookie是否可用,2,3有效减少爬虫被ban
4. mongodb数据库可用Robomongo查看

评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值