scrapy框架基础学习之囧事百科

最新推荐文章于 2022-11-29 13:38:15 发布

风清俊

最新推荐文章于 2022-11-29 13:38:15 发布

阅读量284

点赞数

分类专栏： # 爬虫文章标签： python

本文链接：https://blog.csdn.net/weixin_43447957/article/details/105478894

版权

爬虫专栏收录该内容

5 篇文章 0 订阅

订阅专栏

基础：
一、安装scrapy框架
pip install scrapy
pip --default-timeout=2000 install -U scrapy
来下载scrapy让它的延迟检测时间变长。
windows下，还需要安装 pip install pypiwin32

二、创建项目和爬虫（同一项目，爬虫名字唯一）
创建项目: scrapy startproject 项目名称
创建传统爬虫: 项目所在路径(show in Explorer)，执行命令: scrapy genspider [爬虫名字] [爬虫的域名] # “qiushibaike.com”

三、项目目录结构
items.py: 用来存放爬虫爬取下来数据的模型
middlewares.py: 用来存放各种中间件的文件
pipelines.py: 用来将items的模型存储到本地磁盘中
settings.py: 本爬虫的一些配置信息（比如请求头,多久发送一次请求,ip代理池等）
修改配置参数：
将ROBOTSTXT_OBEY = False ， True的话，爬虫将会去查找robots协议，没找到，会返回空
DEFAULT_REQUEST_HEADERS = {
‘User-Agent’: ‘Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36’ # 伪装成浏览器访问
}
DOWNLOAD_DELAY = 1 # 设置下载速度,延迟1s,以免过度消耗网站

scrapy.cfg: 项目的配置文件
spiders包: 以后所有的爬虫，都是存放到这里面

四、运行爬虫
方式一、cmd中运行: 项目——右键——show in Explorer——进入项目中——scrapy crawl 爬虫名称
方式二、创建py文件,导入cmdline

from scrapy import cmdline
cmdline.execute("scrapy crawl 爬虫的名称".split())

练习
一、囧事百科数据的爬取（https://www.qiushibaike.com/text/page/1/）
笔记：
1.response是一个’scrapy.http.response.html import HtmlResponse’ 对象,可以执行’xpath’和’css’语法来提取数据
print(type(response)) # <class ‘scrapy.http.response.html.HtmlResponse’>
2.提取出来的数据,是一个’Selector’或者是一个’SelectorList’对象.如果想要获取其中的字符串,那么应该执行’getall’ 或’get’方法
3.getall方法: 获取’Selector’中的所有文本,返回的是一个列表
4.get方法：获取’Selector’中的第一个文本,返回的是一个str类型
5.如果数据解析回来,要传给pipline处理,那么可以使用’yield’来返回,或是收集所有的item,最后统一使用return返回
6.item: 建议在items.py中定义好模型,以后就不用使用字典了
7.pipeline: 这个是专门用来保存数据的,其中有三个方法是会经常使用的
* ‘open_spider(self, spider)’: 当爬虫被打开时执行
* ‘process_item(self, item, spider)’: 当爬虫有item传过来的时候会被调用
* ‘close_spider(self, spider)’: 当爬虫关闭时会被调用
要激活pipline,应该在’settings.py’ 中,设置’ITEM_PIPELINES’. 示例如下:
ITEM_PIPELINES = {
‘scrapy200406.pipelines.Scrapy200406Pipeline’: 300, # 300代表执行优先级,越小,优先级越高
}
8.JsonItemExporter和JsonLinesItemExporter:
保存json数据的时候,可以使用这两个类,让操作变得更简单
8.1.‘JsonItemExporter’: 这个是每次把数据添加到内存中,最后统一写入到磁盘中,存储的数据符号json数据格式,但数据量大时,消耗内存
8.2.‘JsonLinesItemExporter’: 这个是每次调用’export_item’的时候就把这个item存储到硬盘中,非json数据格式,但相对遍历,耗内存少,数据相对安全

# -*- coding: utf-8 -*-

# spider
import scrapy
from scrapy.http.response.html import HtmlResponse
from scrapy.selector.unified import SelectorList
from scrapyall0412.items import Scrapyall0412Item

class Qsbk0412Spider(scrapy.Spider):
    name = 'qsbk0412'
    allowed_domains = ['qiushibaike.com']
    start_urls = ['https://www.qiushibaike.com/text/page/1/']
    base_domain = "https://www.qiushibaike.com"
    page = 1
    all_counts = 0

    def parse(self, response):
        counts = 0
        # SelectorList
        duanZiDivs = response.xpath("//div[@class='col1 old-style-col1']/div")
        # Selector
        for duanZiDiv in duanZiDivs:
            authors = duanZiDiv.xpath(".//h2/text()").get().strip()
            contents = duanZiDiv.xpath(".//div[@class='content']//text()").getall()
            comments = duanZiDiv.xpath(".//div[@class='main-text']//text()").get()
            counts += 1
            item = Scrapyall0412Item(authors=authors, contents=contents, comments=comments)
            yield item
        print(f'第{self.page}页,共:{counts}段')
        self.page += 1
        self.all_counts = self.all_counts+counts
        next_url = response.xpath("//ul[@class='pagination']/li[last()]/a/@href").get()
        if not next_url:
            print(f'爬取{self.page-1}页, 共{self.all_counts}段')
        else:
            yield scrapy.Request(self.base_domain+next_url, callback=self.parse)

# settings.py 需要修改
ROBOTSTXT_OBEY = False  # rebot协议

DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
  'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
}	# 模拟浏览器访问

ITEM_PIPELINES = {
   'scrapyall0412.pipelines.Scrapyall0412Pipeline': 1,
}  # 下载文件,数字越小,优先级越高

DOWNLOAD_DELAY = 3	# 设置下载速度,以免过度消耗网站

# item.py
import scrapy
class Scrapyall0412Item(scrapy.Item):
    authors = scrapy.Field()
    contents = scrapy.Field()
    comments = scrapy.Field()

# pipelines.py
# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

# 使用scrapy自带的导出,这种形式是先放入内存中,最后写入到磁盘,符合json格式,但数据量大时,消耗内存
from scrapy.exporters import JsonItemExporter

class Scrapyall0412Pipeline(object):
    def __init__(self):
        self.fp = open("D:\python\数据\爬虫\duanzi0407.json", 'wb')
        self.exporter = JsonItemExporter(self.fp, ensure_ascii=False, encoding='utf-8')
        self.exporter.start_exporting()

    def open_spider(self, spider):
        print("开始爬取囧事百科了...")

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

    def close_spider(self, spider):
        self.exporter.finish_exporting()
        self.fp.close()
        print("囧事百科爬取结束...")