【爬虫之scrapy框架——尚硅谷(学习笔记two)--爬取电影天堂（基本步骤）】

喜欢下雨t

已于 2024-05-14 19:47:11 修改

阅读量438

点赞数 4

分类专栏：爬虫学习文章标签：爬虫 scrapy 学习 python

于 2024-05-14 19:45:37 首次发布

本文链接：https://blog.csdn.net/weixin_45753504/article/details/138862731

版权

爬虫学习专栏收录该内容

3 篇文章 0 订阅

订阅专栏

爬虫之scrapy框架--爬取电影天堂——解释多页爬取函数编写逻辑

（1）爬虫文件创建

在这里插入图片描述

（2）检查网址是否正确

在这里插入图片描述

（3）检查反爬

（3.1）简写输出语句，检查是否反爬

在这里插入图片描述

（3.2）检查结果

scrapy crawl mv

在这里插入图片描述

（4）函数编写和需求分析

拿去名字
拿去图片

（4.1）在items中定义数据类型

在这里插入图片描述

（4.2）分析网站xpath结构

在这里插入图片描述

mv.py中编写函数：

（4.2.1）拿到第一页的名字和第二页要请求访问的网址

在这里插入图片描述

（4.2.2）完整第二页的网址和请求函数编写

在这里插入图片描述

（4.2.3）完整代码：

import scrapy
from scrapy_movie_99.items import ScrapyMovie99Item

class MvSpider(scrapy.Spider):
    name = "mv"
    allowed_domains = ["www.dyttcn.com"]
    # start_urls = ["https://www.dyttcn.com/"]
    start_urls = ["https://www.dyttcn.com/xijupian/list_4_1.html"]

    def parse(self, response):

        #根正则表达式列表
        a_list=response.xpath('//div[@class="co_content8"]//td[2]//a[3]')

        for a in a_list:
            #获取第一页的name，和要点击的链接
            name=a.xpath('./text()').extract_first()
            href=a.xpath('./@href').extract_first()

            #第二页的地址
            url='https://www.dyttcn.com'+href
            #访问第二页地址
            #发起访问
            yield scrapy.Request(url=url,callback=self.parse_second,meta={'name':name})

    def parse_second(self,response):
        src =response.xpath('//div[@id="Zoom"]//div/img/@src').extract_first()
        #print(src)
        #接收到请求的meta参数的值
        name=response.meta['name']

        #封装为数据结构
        #导入数据结构
        movie=ScrapyMovie99Item(src=src,name=name)
        yield movie

（5）开启管道

在这里插入图片描述

（6）管道封装（写入数据）

在这里插入图片描述
代码如下：

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter


class ScrapyMovie99Pipeline:
    #打开文件
    def open_spider(self,spider):
        self.fp=open('movie.json','w',encoding='utf-8')

    #文件写入
    def process_item(self, item, spider):

        self.fp.write(str(item))
        return item


    #关闭文件
    def close_spider(self,spider):
        self.fp.close()