scrapy框架结合项目使用，超级合适新手

最新推荐文章于 2024-01-06 12:53:04 发布

zyc53

最新推荐文章于 2024-01-06 12:53:04 发布

阅读量295

点赞数 1

分类专栏： Python 文章标签： python

本文链接：https://blog.csdn.net/weixin_44943394/article/details/103814221

版权

Python 专栏收录该内容

14 篇文章 0 订阅

订阅专栏

一、流程步骤
HelloWorld Scrapy

创建一个工程
- scrapy startproject XXX
创建一个爬虫
- scrapy genspider YYY domain
  - domain 爬取主站地址
运行爬虫
- scrapy crawl YYY
完善爬虫
- 定向获取内容
- parse函数
  - 参数 response
  - response
    - xpath
      - 写规则就可以
      - 会返回提取好的内容
        Selector
        get 获取内容
        extract
        extract_all
    - re
    - css
      二、代码操作：
      1下载：

pip install scrapy

2.终端创建项目ZhouWu

scrapy startproject ZhouWu

3.pycharm打开项目，配置虚拟环境，生成爬虫文件；爬取http://lab.scrapyd.cn/这个网站，执行命令后会生成lab.py文件；

scrapy genspider lab lab.scrapyd.cn

4.运行此蜘蛛文件

scrapy crawl lab

5.scrapy项目架构原理解析：
在这里插入图片描述
爬取流程：在Spiders中编写爬虫，把开始的地址配置好，会交给Scheduler调度器，Scheduler从请求队列中拿出调度器，把Requests发出去，Requests对应互联网资源，给Downloader下载器把资源变成Response，回到Spiders中，Spiders想存，可以通过ItemPipeline；

6.lab.py爬取下一页:

# -*- coding: utf-8 -*-
import scrapy


class LabSpider(scrapy.Spider):
    name = 'lab'
    allowed_domains = ['lab.scrapyd.cn']
    start_urls = ['http://lab.scrapyd.cn/']

    def parse(self, response):
        #写xpth规则，选择你要的内容，这里是详情、作者、标题那一整块内容；
        quote_posts = response.xpath('//div[contains(@class, "quote post")]')
        #把整块div里遍历，取出来标题、作者、和详情
        for quote_post in quote_posts:
            text = quote_post.xpath('./span[contains(@class, "text")]/text()').get()
            author = quote_post.xpath('./span/small[contains(@class, "author")]/text()').get()
            detail = quote_post.xpath('./span/a/@href').get()
            print(text,author,detail)

        next_url = response.xpath('//li[contains(@class, "next")]/a/@href').get()
        print(next_url)

        if next_url:
        #     # 构建一个请求 # url 是爬取的链接, callback是回调函数，请求结束后将结果通过parse函数传递回来
            yield scrapy.Request(url=next_url, callback=self.parse)

在这里插入图片描述

7.现在是可以拿出来，要考虑怎么存储：
items.py：

# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy

class ZhouwuItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass

#自己写的如下比着系统给的文件写：
class LabItem(scrapy.Item):
    text = scrapy.Field()
    author = scrapy.Field()
    detail = scrapy.Field()

lab.py，爬取详情页:

# -*- coding: utf-8 -*-
import scrapy

from ZhouWu.items import LabItem


class LabSpider(scrapy.Spider):
    name = 'lab'
    allowed_domains = ['lab.scrapyd.cn']
    start_urls = ['http://lab.scrapyd.cn/']

    def parse(self, response):
        #写xpth规则，选择你要的内容，这里是详情、作者、标题那一整块内容；
        quote_posts = response.xpath('//div[contains(@class, "quote post")]')
        #把整块div里遍历，取出来标题、作者、和详情
        for quote_post in quote_posts:
            text = quote_post.xpath('./span[contains(@class, "text")]/text()').get()
            author = quote_post.xpath('./span/small[contains(@class, "author")]/text()').get()
            detail = quote_post.xpath('./span/a/@href').get()
            print(text,author,detail)


            #在items中写好后，在这里进行存，使用yield；
            item = LabItem()
            item['text'] = text
            item['author'] = author
            item['detail'] = detail

            yield item

        next_url = response.xpath('//li[contains(@class, "next")]/a/@href').get()
        print(next_url)

        if next_url:
        #     # 构建一个请求 # url 是爬取的链接, callback是回调函数，请求结束后将结果通过parse函数传递回来
            yield scrapy.Request(url=next_url, callback=self.parse)

callback：会将结果封装到函数的参数中；
9.执行命令