Scrapy CrawlSpider抓取数据

本文主要是对CrawlSpider爬虫的应用示例

数据爬取对象:中华网科技类新闻

url:https://tech.china.com/articles/

1. 创建项目

scrapy startproject zhonghuawang

2. 创建CrawlSpider爬虫

cd zhonghuawang
scrapy genspider -t crawl china tech.china.com

3. 页面分析

列表页:

通过F12开发者工具,你会发现所有的信息都包含在div[@class="m2left topborder"]这个标签里面,下面的每个字标签div[@class="con_item"]里面就包含每条新闻的信息,包括链接,通过观察,你会发现每条新闻的详情页链接的规律了

这样我们就可以写详情页的链接提取规则了,加之之后我们要在详情页里面提取需要的数据,则肯定需要有回调函数(callback)去处理

Rule(LinkExtractor(allow=r'/article/\d+/\d+.html'), callback='parse_item', follow=False)

接下来,我们再分析一下怎么翻页,通过点击下一页就很容易发现规律了

然后我们再通过F12开发者工具去查看下面翻页页面的信息

 

那么我们下一条Rule规则页就出来了,但它只需要提取链接,并不需要做其它的操作,所以不需要回调函数

Rule(LinkExtractor(allow=r'/articles/index[0-9_]+.html'))

完整的Rule规则如下:

rules = (
        Rule(LinkExtractor(allow=r'/articles/index[0-9_]+.html')),
        Rule(LinkExtractor(allow=r'/article/\d+/\d+.html'), callback='parse_item', follow=False)
    )

详情页:

我们需要提取的信息是:标题,链接,文章内容,发表时间,来源以及站点名称,确定好需要提取的哪些信息字段后,我们就可以写items.py文件了

import scrapy

class ZhonghuawangItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # 标题
    title = scrapy.Field()
    # 链接
    url = scrapy.Field()
    # 正文
    content = scrapy.Field()
    # 发布时间
    datetime = scrapy.Field()
    # 来源
    source = scrapy.Field()
    # 站点名称
    website = scrapy.Field()

接下来就是通过F12开发者工具去查看需要提取的信息分别在哪个标签下,然后使用xpath规则进行提取

def parse_item(self, response):
        item = ZhonghuawangItem()
        # 标题
        item['title'] = response.xpath('//h1[@id="chan_newsTitle"]/text()').extract()[0]
        # 链接
        item['url'] = response.url
        # 内容
        # item['content'] = response.xpath('//div[@id="chan_newsDetail"]/p/text()').extract()[0]
        item['content'] = ''.join(response.xpath('//div[@id="chan_newsDetail"]//text()').extract()).strip()
        # 发布时间
        item['datetime'] = ' '.join(response.xpath('//div[@id="chan_newsInfo"]/text()').extract()[1].strip().split(' ')[:2])
        # 来源
        # item['source'] = response.xpath('//div[@id="chan_newsInfo"]/text()').extract()[1].strip().split(' ')[-1][3:]
        item['source'] = response.xpath('//div[@id="chan_newsInfo"]/text()').re_first('来源:(.*)').strip()
        # 站点名称
        item['website'] = '中华网'
        yield item

4. 保存信息

对需要的信息提取完成后,最后就写入到json文件进行保存(你可以保存到数据库中),那么我们就要写pipelines.py文件了

import json

class ZhonghuawangPipeline(object):
    def __init__(self):
        self.filename = open('zhonghuawnag.json', 'w')

    def process_item(self, item, spider):
        text = json.dumps(dict(item), ensure_ascii=False) + '\n'
        self.filename.write(text)
        return item

    def close_spider(self, spider):
        self.filename.close()

5. 修改settings.py文件

#不遵守robots.txt协议
ROBOTSTXT_OBEY = False

#不使用Cookies
COOKIES_ENABLED = False

#设置请求报头
DEFAULT_REQUEST_HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  # 'Accept-Language': 'en',
}

#开启piprlines
ITEM_PIPELINES = {
   'zhonghuawang.pipelines.ZhonghuawangPipeline': 300,
}

6. 运行代码即可

scrapy crawl china

 

完整代码如下:

items.py

import scrapy


class ZhonghuawangItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # 标题
    title = scrapy.Field()
    # 链接
    url = scrapy.Field()
    # 正文
    content = scrapy.Field()
    # 发布时间
    datetime = scrapy.Field()
    # 来源
    source = scrapy.Field()
    # 站点名称
    website = scrapy.Field()

 

china.py

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from zhonghuawang.items import ZhonghuawangItem


class ChinaSpider(CrawlSpider):
    name = 'china'
    allowed_domains = ['tech.china.com']
    start_urls = ['http://tech.china.com/articles/']

    rules = (
        Rule(LinkExtractor(allow=r'/articles/index[0-9_]+.html')),
        Rule(LinkExtractor(allow=r'/article/\d+/\d+.html'), callback='parse_item', follow=False)
    )

    def parse_item(self, response):
        item = ZhonghuawangItem()
        # 标题
        item['title'] = response.xpath('//h1[@id="chan_newsTitle"]/text()').extract()[0]
        # 链接
        item['url'] = response.url
        # 内容
        # item['content'] = response.xpath('//div[@id="chan_newsDetail"]/p/text()').extract()[0]
        item['content'] = ''.join(response.xpath('//div[@id="chan_newsDetail"]//text()').extract()).strip()
        # 发布时间
        item['datetime'] = ' '.join(response.xpath('//div[@id="chan_newsInfo"]/text()').extract()[1].strip().split(' ')[:2])
        # 来源
        # item['source'] = response.xpath('//div[@id="chan_newsInfo"]/text()').extract()[1].strip().split(' ')[-1][3:]
        item['source'] = response.xpath('//div[@id="chan_newsInfo"]/text()').re_first('来源:(.*)').strip()
        # 站点名称
        item['website'] = '中华网'
        yield item

pipelines.py

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from zhonghuawang.items import ZhonghuawangItem


class ChinaSpider(CrawlSpider):
    name = 'china'
    allowed_domains = ['tech.china.com']
    start_urls = ['http://tech.china.com/articles/']

    rules = (
        Rule(LinkExtractor(allow=r'/articles/index[0-9_]+.html')),
        Rule(LinkExtractor(allow=r'/article/\d+/\d+.html'), callback='parse_item', follow=False)
    )

    def parse_item(self, response):
        item = ZhonghuawangItem()
        # 标题
        item['title'] = response.xpath('//h1[@id="chan_newsTitle"]/text()').extract()[0]
        # 链接
        item['url'] = response.url
        # 内容
        # item['content'] = response.xpath('//div[@id="chan_newsDetail"]/p/text()').extract()[0]
        item['content'] = ''.join(response.xpath('//div[@id="chan_newsDetail"]//text()').extract()).strip()
        # 发布时间
        item['datetime'] = ' '.join(response.xpath('//div[@id="chan_newsInfo"]/text()').extract()[1].strip().split(' ')[:2])
        # 来源
        # item['source'] = response.xpath('//div[@id="chan_newsInfo"]/text()').extract()[1].strip().split(' ')[-1][3:]
        item['source'] = response.xpath('//div[@id="chan_newsInfo"]/text()').re_first('来源:(.*)').strip()
        # 站点名称
        item['website'] = '中华网'
        yield item

settings.py

#不遵守robots.txt协议
ROBOTSTXT_OBEY = False

#不使用Cookies
COOKIES_ENABLED = False

#设置请求报头
DEFAULT_REQUEST_HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  # 'Accept-Language': 'en',
}

#开启piprlines
ITEM_PIPELINES = {
   'zhonghuawang.pipelines.ZhonghuawangPipeline': 300,
}

 

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值