Scrapy CrawlSpider抓取数据

最新推荐文章于 2024-04-01 00:30:34 发布

无处安放的Bug

最新推荐文章于 2024-04-01 00:30:34 发布

阅读量671

点赞数

本文链接：https://blog.csdn.net/weixin_39751188/article/details/81258344

版权

Python 爬虫同时被 3 个专栏收录

1 篇文章 0 订阅

订阅专栏

Scrapy

1 篇文章 0 订阅

订阅专栏

CrawlSpider

1 篇文章 0 订阅

订阅专栏

本文主要是对CrawlSpider爬虫的应用示例

数据爬取对象：中华网科技类新闻

url：https://tech.china.com/articles/

1. 创建项目

scrapy startproject zhonghuawang

2. 创建CrawlSpider爬虫

cd zhonghuawang
scrapy genspider -t crawl china tech.china.com

3. 页面分析

列表页：

通过F12开发者工具，你会发现所有的信息都包含在div[@class="m2left topborder"]这个标签里面，下面的每个字标签div[@class="con_item"]里面就包含每条新闻的信息，包括链接，通过观察，你会发现每条新闻的详情页链接的规律了

这样我们就可以写详情页的链接提取规则了，加之之后我们要在详情页里面提取需要的数据，则肯定需要有回调函数（callback）去处理

Rule(LinkExtractor(allow=r'/article/\d+/\d+.html'), callback='parse_item', follow=False)

接下来，我们再分析一下怎么翻页，通过点击下一页就很容易发现规律了

然后我们再通过F12开发者工具去查看下面翻页页面的信息

那么我们下一条Rule规则页就出来了，但它只需要提取链接，并不需要做其它的操作，所以不需要回调函数

Rule(LinkExtractor(allow=r'/articles/index[0-9_]+.html'))

完整的Rule规则如下：

rules = (
        Rule(LinkExtractor(allow=r'/articles/index[0-9_]+.html')),
        Rule(LinkExtractor(allow=r'/article/\d+/\d+.html'), callback='parse_item', follow=False)
    )

详情页：

我们需要提取的信息是：标题，链接，文章内容，发表时间，来源以及站点名称，确定好需要提取的哪些信息字段后，我们就可以写items.py文件了

import scrapy

class ZhonghuawangItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # 标题
    title = scrapy.Field()
    # 链接
    url = scrapy.Field()
    # 正文
    content = scrapy.Field()
    # 发布时间
    datetime = scrapy.Field()
    # 来源
    source = scrapy.Field()
    # 站点名称
    website = scrapy.Field()

接下来就是通过F12开发者工具去查看需要提取的信息分别在哪个标签下，然后使用xpath规则进行提取

def parse_item(self, response):
        item = ZhonghuawangItem()
        # 标题
        item['title'] = response.xpath('//h1[@id="chan_newsTitle"]/text()').extract()[0]
        # 链接
        item['url'] = response.url
        # 内容
        # item['content'] = response.xpath('//div[@id="chan_newsDetail"]/p/text()').extract()[0]
        item['content'] = ''.join(response.xpath('//div[@id="chan_newsDetail"]//text()').extract()).strip()
        # 发布时间
        item['datetime'] = ' '.join(response.xpath('//div[@id="chan_newsInfo"]/text()').extract()[1].strip().split(' ')[:2])
        # 来源
        # item['source'] = response.xpath('//div[@id="chan_newsInfo"]/text()').extract()[1].strip().split(' ')[-1][3:]
        item['source'] = response.xpath('//div[@id="chan_newsInfo"]/text()').re_first('来源：(.*)').strip()
        # 站点名称
        item['website'] = '中华网'
        yield item

4. 保存信息

对需要的信息提取完成后，最后就写入到json文件进行保存（你可以保存到数据库中），那么我们就要写pipelines.py文件了

import json

class ZhonghuawangPipeline(object):
    def __init__(self):
        self.filename = open('zhonghuawnag.json', 'w')

    def process_item(self, item, spider):
        text = json.dumps(dict(item), ensure_ascii=False) + '\n'
        self.filename.write(text)
        return item

    def close_spider(self, spider):
        self.filename.close()

5. 修改settings.py文件

#不遵守robots.txt协议
ROBOTSTXT_OBEY = False

#不使用Cookies
COOKIES_ENABLED = False

#设置请求报头
DEFAULT_REQUEST_HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  # 'Accept-Language': 'en',
}

#开启piprlines
ITEM_PIPELINES = {
   'zhonghuawang.pipelines.ZhonghuawangPipeline': 300,
}

6. 运行代码即可

scrapy crawl china

完整代码如下：

items.py

import scrapy


class ZhonghuawangItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # 标题
    title = scrapy.Field()
    # 链接
    url = scrapy.Field()
    # 正文
    content = scrapy.Field()
    # 发布时间
    datetime = scrapy.Field()
    # 来源
    source = scrapy.Field()
    # 站点名称
    website = scrapy.Field()

china.py

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from zhonghuawang.items import ZhonghuawangItem


class ChinaSpider(CrawlSpider):
    name = 'china'
    allowed_domains = ['tech.china.com']
    start_urls = ['http://tech.china.com/articles/']

    rules = (
        Rule(LinkExtractor(allow=r'/articles/index[0-9_]+.html')),
        Rule(LinkExtractor(allow=r'/article/\d+/\d+.html'), callback='parse_item', follow=False)
    )

    def parse_item(self, response):
        item = ZhonghuawangItem()
        # 标题
        item['title'] = response.xpath('//h1[@id="chan_newsTitle"]/text()').extract()[0]
        # 链接
        item['url'] = response.url
        # 内容
        # item['content'] = response.xpath('//div[@id="chan_newsDetail"]/p/text()').extract()[0]
        item['content'] = ''.join(response.xpath('//div[@id="chan_newsDetail"]//text()').extract()).strip()
        # 发布时间
        item['datetime'] = ' '.join(response.xpath('//div[@id="chan_newsInfo"]/text()').extract()[1].strip().split(' ')[:2])
        # 来源
        # item['source'] = response.xpath('//div[@id="chan_newsInfo"]/text()').extract()[1].strip().split(' ')[-1][3:]
        item['source'] = response.xpath('//div[@id="chan_newsInfo"]/text()').re_first('来源：(.*)').strip()
        # 站点名称
        item['website'] = '中华网'
        yield item

pipelines.py

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from zhonghuawang.items import ZhonghuawangItem


class ChinaSpider(CrawlSpider):
    name = 'china'
    allowed_domains = ['tech.china.com']
    start_urls = ['http://tech.china.com/articles/']

    rules = (
        Rule(LinkExtractor(allow=r'/articles/index[0-9_]+.html')),
        Rule(LinkExtractor(allow=r'/article/\d+/\d+.html'), callback='parse_item', follow=False)
    )

    def parse_item(self, response):
        item = ZhonghuawangItem()
        # 标题
        item['title'] = response.xpath('//h1[@id="chan_newsTitle"]/text()').extract()[0]
        # 链接
        item['url'] = response.url
        # 内容
        # item['content'] = response.xpath('//div[@id="chan_newsDetail"]/p/text()').extract()[0]
        item['content'] = ''.join(response.xpath('//div[@id="chan_newsDetail"]//text()').extract()).strip()
        # 发布时间
        item['datetime'] = ' '.join(response.xpath('//div[@id="chan_newsInfo"]/text()').extract()[1].strip().split(' ')[:2])
        # 来源
        # item['source'] = response.xpath('//div[@id="chan_newsInfo"]/text()').extract()[1].strip().split(' ')[-1][3:]
        item['source'] = response.xpath('//div[@id="chan_newsInfo"]/text()').re_first('来源：(.*)').strip()
        # 站点名称
        item['website'] = '中华网'
        yield item

settings.py

#不遵守robots.txt协议
ROBOTSTXT_OBEY = False

#不使用Cookies
COOKIES_ENABLED = False

#设置请求报头
DEFAULT_REQUEST_HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  # 'Accept-Language': 'en',
}

#开启piprlines
ITEM_PIPELINES = {
   'zhonghuawang.pipelines.ZhonghuawangPipeline': 300,
}

无处安放的Bug

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
Scrapy CrawlSpider抓取数据

本文主要是对CrawlSpider爬虫的应用示例数据爬取对象：中华网科技类新闻url：https://tech.china.com/articles/1. 创建项目scrapy startproject zhonghuawang2. 创建CrawlSpider爬虫cd zhonghuawangscrapy genspider -t crawl china tech.c...
复制链接

扫一扫

专栏目录