小福利，用scrapy框架中的CrawlSpider类构建爬虫获取信息

最新推荐文章于 2022-05-24 12:07:13 发布

littlespider889

最新推荐文章于 2022-05-24 12:07:13 发布

阅读量238

点赞数 1

分类专栏： python spider 文章标签： xpath python

本文链接：https://blog.csdn.net/littlespider889/article/details/108185705

版权

python 同时被 2 个专栏收录

156 篇文章 9 订阅

订阅专栏

spider

16 篇文章 0 订阅

订阅专栏

大家好，我是天空之城，今天给大家带来小福利，用scrapy框架中的CrawlSpider类构建爬虫获取信息

Scrapy框架中分两类爬虫

Spider类和CrawlSpider类。

crawlspider是Spider的派生类(一个子类)，Spider类的设计原则是只爬取start_url列表中的网页，而CrawlSpider类定义了一些规则(rule)来提供跟进link的方便的机制，从爬取的网页中获取link并继续爬取的工作更适合。

spider有parse函数
crawl spider没有parse函数

crawl spider生成了一个rules,内含一个元祖或者列表,包含rule对象
rule标识规则,包含linkextractor,callback,follow等参数.
linkextractor连接提取器,可以通过正则,或者xpath或者css规则提取.
callback标识经过提取器取出来的url地址响应的回调函数,
重点是follow=true/falase 标识是否在当前页面中继续使用该规则进行深层提取.
如果一个被提取的url满足多个Rule，那么会从rules中选择一个满足匹配条件的Rule执行

下面项目的目录结构图
在这里插入图片描述
最后获取的数据截图

首先是cyg爬虫文件里面的代码

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
import re
'''
1.创建CrawlSpider 创建方式 scrapy genspider -t crawl 爬虫的名字 域名
2.CrawlSpider 需要定义回调函数的时候最好找个函数名字不要以parse命名
3.Rule对象 什么时候你要follow callback的实现方式

'''

class CygSpider(CrawlSpider):
    name = 'cyg'
    allowed_domains = ['wz.sun0769.com']
    start_urls = ['http://wz.sun0769.com/political/index/politicsNewest?id=1']
    # 定义提取url地址的规则
    rules = (
        # LinkExtractor 链接提取器 需要提取的url地址
        # callback 提取Url地址的response会交给回调函数处理
        # follow=True 就是请求连续不断新的url地址
        # 列表页
        Rule(LinkExtractor(allow=r'http://wz.sun0769.com/political/index/politicsNewest\?id=\d+'),follow=True),
        # 详情页
        Rule(LinkExtractor(allow=r'http://wz.sun0769.com/political/politics/index\?id=\d+'), callback='parse_item'),
    )

    def parse_item(self, response):
        item = {}
        #item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get()
        #item['name'] = response.xpath('//div[@id="name"]').get()
        #item['description'] = response.xpath('//div[@id="description"]').get()
        # 详情页的数据


        item['title']=response.xpath("//div[@class='mr-three']/p/text()").extract_first()
        # item['author']=response.xpath("//html/body/div[3]/div[2]/div[2]/div[1]/span[1]/text()")
        item['author'] = response.xpath("//div[@class='mr-three']/div[1]/span[1]/text()")[1]
        item['content'] = response.xpath("//div[@class='details-box']/pre/text()").extract_first()
        #item['date'] = response.xpath("//div[@class='mr-three']/div/span[@class='fl']/text()").extract_first()

        print(item)
        return item

然后是设置文件里面的代码

# -*- coding: utf-8 -*-

# Scrapy settings for yg project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'yg'

SPIDER_MODULES = ['yg.spiders']
NEWSPIDER_MODULE = 'yg.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'yg (+http://www.yourdomain.com)'
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'


# Obey robots.txt rules
# ROBOTSTXT_OBEY = True
# LOG_LEVEL = 'WARNING'
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 0.5
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'yg.middlewares.YgSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'yg.middlewares.YgDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'yg.pipelines.YgPipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'


FEED_URI='./storage/data/%(name)s.csv'
FEED_FORMAT='CSV'
FEED_EXPORT_ENCODING='ansi'

然后是items里面代码

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class YgItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass

然后是pipelines里面代码

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


class YgPipeline:
    def process_item(self, item, spider):
        return item

最后是启动爬虫的代码



from scrapy import cmdline

# cmdline.execute("scrapy crawl db".split())
cmdline.execute(['scrapy','crawl','cyg'])

这里还有个小问题，就是最后在获取author内容时，总是无法直接获得发帖人的姓名，换了好几个xpath语句，都不行，不知道怎么回事，有大牛知道的麻烦给解答下哈

{'title': '石碣华润广场到处黑漆漆，强烈要求亮化', 'author': <Selector xpath="//div[@class='mr-three']/div[1]/span[1]/text()" data='米拉多 '>, 'content': '随着梧桐里生活广场的马上开业，目前华润广场那里一到晚上到处都黑漆漆的，而且广场那边灯已经坏了几个月了，也没人去修，华润广场后面全部黑漆漆，没任何灯，很不安全，东风中路的墙体也没安装夜景灯，目前到处是黑漆漆的\r\n♥强烈要求政府对整个华润广场安装夜景灯彩色灯亮化，然后把东风中路一侧的墙体进行亮化，提升石碣中心城区的夜景景观！随着梧桐里生活广场开业，除了合信广场，华润广场那边也会成为石碣另一个商业中心，强烈要求政府改善华润广场的环境，对华润广场安装好彩色灯，把东风中路的墙体跟东风南路一样，进行亮化！'}

就是这里 'author': <Selector xpath="//div[@class='mr-three']/div[1]/span[1]/text()" data='米拉多 '>,我只想要后的“米拉多  ”，但是总会出现前面的那些selector内容，不知道怎么回事

网页源码是这样的
在这里插入图片描述
其中“自由自在多好啊”就是发帖人的姓名，但是总是无法直接获取到。

littlespider889

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
小福利，用scrapy框架中的CrawlSpider类构建爬虫获取信息

大家好，我是天空之城，今天给大家带来小福利，用scrapy框架中的CrawlSpider类构建爬虫获取信息下面项目的目录结构图最后获取的数据截图首先是cyg爬虫文件里面的代码# -*- coding: utf-8 -*-import scrapyfrom scrapy.linkextractors import LinkExtractorfrom scrapy.spiders import CrawlSpider, Ruleimport re'''1.创建CrawlSpider 创建
复制链接

扫一扫

专栏目录