Scarpy 学习记录--1

最新推荐文章于 2022-06-27 23:21:17 发布

Light Bob

最新推荐文章于 2022-06-27 23:21:17 发布

阅读量192

点赞数

分类专栏：学习计划文章标签： python

本文链接：https://blog.csdn.net/Light_Bob/article/details/109606319

版权

学习计划专栏收录该内容

6 篇文章 1 订阅

订阅专栏

一、创建项目

在cmd窗口输入

scrapy startproject 项目名

在这里插入图片描述
成功后，会在对应目录下生成如下文件

二、创建爬虫

cd 项目名进入到刚刚生成的文件夹下

scrapy genspider 爬虫名  目标网址域名

在这里插入图片描述

会在spiders目录下生成一个爬虫名.py文件

三、爬虫文件介绍

import scrapy


class ZhSpider(scrapy.Spider):
    name = 'zh'
    allowed_domains = ['zongheng.com']
    start_urls = ['http://zongheng.com/']

    def parse(self, response):
        pass

生成的爬虫名.py文件有以上等内容

有一个爬虫类，继承scrapy.Spider
name属性：爬虫名，在项目中必须唯一，后续爬取需要指定爬虫
allowed_domains属性：爬取网站的域名，后续此爬虫只会在此域名下爬取
start_urls 属性：开始爬取网页，这里配置初始网页，可以不受allowed_domains 限制
parse方法：通过response响应，来解析爬取的网页数据
后续可以发送item到pipline进行数据持久化保存，或者构造新的requests请求

四、代码编写

本次准备抓取纵横中文网的小说
目标网页：
http://book.zongheng.com/showchapter/1022044.html
在这里插入图片描述
zh.py

import scrapy
from zongheng.zongheng.items import ZonghengItem


class ZhSpider(scrapy.Spider):
    name = 'zh'
    allowed_domains = ['zongheng.com']
    start_urls = ['http://book.zongheng.com/showchapter/1022044.html']


    def parse(self, response):
        book_title = response.xpath('//div[@class="book-meta"]/h1/text()').get()

        hrefs = response.xpath("//li[@class=' col-4']/a/@href").getall()

        for href in hrefs:
            item = ZonghengItem()
            item['book_title'] = book_title
            yield scrapy.Request(
                href,
                callback=self.detail_parse,
                meta={'item':item,}
            )


    def detail_parse(self, response):
        item = response.meta['item']
        item['chapter_title'] = response.xpath("//div[@class='title_txtbox']/text()").get()
        item['info'] = response.xpath("//div[@class='content']/p/text()").getall()
        item['info'] = '\n'.join(item['info']).strip()
        yield item

item.py

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class ZonghengItem(scrapy.Item):
    book_title = scrapy.Field()
    chapter_title = scrapy.Field()
    info = scrapy.Field()

pipelines.py

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
import os

class ZonghengPipeline:
    def process_item(self, item, spider):
        book_title = item['book_title']
        if not os.path.exists(book_title):
            os.mkdir(book_title)

        chapter_title = item['chapter_title']
        info = item['info']
        with open(chapter_title+'.txt', 'w', encoding='utf-8') as fp:
            fp.write(info)

        fp.close()

另外在settings.py中把pipelines的注释去掉

ITEM_PIPELINES = {
   'zongheng.pipelines.ZonghengPipeline': 300,
}

五、几个注意点

爬虫名.py中，无论是通过scrapy.Request去构造下一个请求，还是把item发送到pipelines中，必须使用yield，生成器函数
scrapy.Request中可以，通过callback指定下一个解析函数；通过meta传入字典把内容传递给下一个解析函数
parse解析函数中的response可以直接使用xpath去定位元素，但是需要使用get或者getal最后把元素取出来
item.py中用来定义抓取的数据的结构体，可以在爬虫或者pipelines中引用

六、爬虫启动

#scrapy crawl 爬虫名
scrapy crawl zh

实现结果如下

在这里插入图片描述

Light Bob

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录