小福利,用scrapy框架中的CrawlSpider类构建爬虫获取信息

16 篇文章 0 订阅

大家好,我是天空之城,今天给大家带来小福利,用scrapy框架中的CrawlSpider类构建爬虫获取信息

Scrapy框架中分两类爬虫

Spider类和CrawlSpider类。

crawlspider是Spider的派生类(一个子类),Spider类的设计原则是只爬取start_url列表中的网页,而CrawlSpider类定义了一些规则(rule)来提供跟进link的方便的机制,从爬取的网页中获取link并继续爬取的工作更适合。

spider有parse函数
crawl spider没有parse函数

crawl spider生成了一个rules,内含一个元祖或者列表,包含rule对象
rule标识规则,包含linkextractor,callback,follow等参数.
linkextractor连接提取器,可以通过正则,或者xpath或者css规则提取.
callback标识经过提取器取出来的url地址响应的回调函数,
重点是follow=true/falase 标识是否在当前页面中继续使用该规则进行深层提取.
如果一个被提取的url满足多个Rule,那么会从rules中选择一个满足匹配条件的Rule执行

.

下面项目的目录结构图
在这里插入图片描述
最后获取的数据截图
在这里插入图片描述
首先是cyg爬虫文件里面的代码

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
import re
'''
1.创建CrawlSpider 创建方式 scrapy genspider -t crawl 爬虫的名字 域名
2.CrawlSpider 需要定义回调函数的时候最好找个函数名字不要以parse命名
3.Rule对象 什么时候你要follow callback的实现方式

'''

class CygSpider(CrawlSpider):
    name = 'cyg'
    allowed_domains = ['wz.sun0769.com']
    start_urls = ['http://wz.sun0769.com/political/index/politicsNewest?id=1']
    # 定义提取url地址的规则
    rules = (
        # LinkExtractor 链接提取器 需要提取的url地址
        # callback 提取Url地址的response会交给回调函数处理
        # follow=True 就是请求连续不断新的url地址
        # 列表页
        Rule(LinkExtractor(allow=r'http://wz.sun0769.com/political/index/politicsNewest\?id=\d+'),follow=True),
        # 详情页
        Rule(LinkExtractor(allow=r'http://wz.sun0769.com/political/politics/index\?id=\d+'), callback='parse_item'),
    )

    def parse_item(self, response):
        item = {}
        #item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get()
        #item['name'] = response.xpath('//div[@id="name"]').get()
        #item['description'] = response.xpath('//div[@id="description"]').get()
        # 详情页的数据


        item['title']=response.xpath("//div[@class='mr-three']/p/text()").extract_first()
        # item['author']=response.xpath("//html/body/div[3]/div[2]/div[2]/div[1]/span[1]/text()")
        item['author'] = response.xpath("//div[@class='mr-three']/div[1]/span[1]/text()")[1]
        item['content'] = response.xpath("//div[@class='details-box']/pre/text()").extract_first()
        #item['date'] = response.xpath("//div[@class='mr-three']/div/span[@class='fl']/text()").extract_first()

        print(item)
        return item


然后是设置文件里面的代码

# -*- coding: utf-8 -*-

# Scrapy settings for yg project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'yg'

SPIDER_MODULES = ['yg.spiders']
NEWSPIDER_MODULE = 'yg.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'yg (+http://www.yourdomain.com)'
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'


# Obey robots.txt rules
# ROBOTSTXT_OBEY = True
# LOG_LEVEL = 'WARNING'
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 0.5
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'yg.middlewares.YgSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'yg.middlewares.YgDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'yg.pipelines.YgPipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'


FEED_URI='./storage/data/%(name)s.csv'
FEED_FORMAT='CSV'
FEED_EXPORT_ENCODING='ansi'

然后是items里面代码

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class YgItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass


然后是pipelines里面代码

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


class YgPipeline:
    def process_item(self, item, spider):
        return item


最后是启动爬虫的代码



from scrapy import cmdline

# cmdline.execute("scrapy crawl db".split())
cmdline.execute(['scrapy','crawl','cyg'])


这里还有个小问题,就是最后在获取author内容时,总是无法直接获得发帖人的姓名,换了好几个xpath语句,都不行,不知道怎么回事,有大牛知道的麻烦给解答下哈

{'title': '石碣华润广场到处黑漆漆,强烈要求亮化', 'author': <Selector xpath="//div[@class='mr-three']/div[1]/span[1]/text()" data='米拉多 '>, 'content': '随着梧桐里生活广场的马上开业,目前华润广场那里一到晚上到处都黑漆漆的,而且广场那边灯已经坏了几个月了,也没人去修,华润广场后面全部黑漆漆,没任何灯,很不安全,东风中路的墙体也没安装夜景灯,目前到处是黑漆漆的\r\n♥强烈要求政府对整个华润广场安装夜景灯彩色灯亮化,然后把东风中路一侧的墙体进行亮化,提升石碣中心城区的夜景景观!随着梧桐里生活广场开业,除了合信广场,华润广场那边也会成为石碣另一个商业中心,强烈要求政府改善华润广场的环境,对华润广场安装好彩色灯,把东风中路的墙体跟东风南路一样,进行亮化!'}

就是这里 'author': <Selector xpath="//div[@class='mr-three']/div[1]/span[1]/text()" data='米拉多 '>,我只想要后的“米拉多  ”,但是总会出现前面的那些selector内容,不知道怎么回事

网页源码是这样的
在这里插入图片描述
其中“自由自在多好啊”就是发帖人的姓名,但是总是无法直接获取到。

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值