Python 爬取校花网资源、批量下载图片，scrapy 框架入门经典

最新推荐文章于 2021-01-10 23:46:37 发布

码点

最新推荐文章于 2021-01-10 23:46:37 发布

阅读量1.1k

点赞数 1

分类专栏： python

本文链接：https://blog.csdn.net/qq_31939617/article/details/85212997

版权

python 专栏收录该内容

19 篇文章 0 订阅

订阅专栏

爬取校花网资源、批量下载图片，scrapy 框架入门
源码下载：
https://download.csdn.net/download/qq_31939617/10886020

先上图：
在这里插入图片描述

1.前面的配置环境，就不说了
创建项目：进入工作目录，cmd，执行命令，scrapy startproject XiaoHua
在这里插入图片描述

2.项目创建好了，进入项目下，创建爬虫

在这里插入图片描述

3.用PyCharm打开项目，目录结构：
在这里插入图片描述
5.我们先运行一次：

200，访问是成功的

6.特别说明一下robots协议,他在settings.py文件里:
在这里插入图片描述

ROBOTSTXT_OBEY = True

robots.txt 是遵循 Robot协议的一个文件，它保存在网站的服务器中，它的作用是，告诉搜索引擎爬虫，本网站哪些目录下的网页不希望你进行爬取收录。在Scrapy启动后，会在第一时间访问网站的 robots.txt 文件，然后决定该网站的爬取范围。

当然，我们并不是在做搜索引擎，而且在某些情况下我们想要获取的内容恰恰是被 robots.txt 所禁止访问的。所以，某些时候，我们就要将此配置项设置为 False ，拒绝遵守 Robot协议！

新建的项目是默认遵守robots协议的，有些项目如果遵守，可能请求不成功。（我们的项目刚才请求返回200，是成功的）建议这里改成：

ROBOTSTXT_OBEY = False

7.我们再次来看数据：

在这里插入图片描述

在这里插入图片描述
数据都有了

8.具体代码：
xiaohuar.py

# -*- coding: utf-8 -*-
import time

import scrapy

from xiao.items import XiaoItem


class XiaohuarSpider(scrapy.Spider):
    name = 'xiaohuar'
    allowed_domains = ['www.xiaohuar.com']

    # 基础url
    url = 'http://www.xiaohuar.com/list-1-'
    # 爬取的起始页
    page = 0
    # 爬取的起始url
    start_urls = ['http://www.xiaohuar.com/hua/list-1-0.html']

    def parse(self, response):
        # print(response)
        # 解析所有校花，获取指定内容
        div_list = response.xpath('//div[@class="item masonry_brick"]')
        # print(div_list)
        # 遍历上面所有的div，找到指定的内容即可
        for div in div_list:
            # 创建item对象 就是我们在items里面定义的类
            item = XiaoItem()

            #image_url = div.xpath('//div[@class="img"]/a/img/@src').extract_first()
            image_url = div.xpath('./div[@class="item_t"]/div[@class="img"]/a/img/@src').extract_first()
            # 处理周半仙图片是以.php结尾的
            if image_url.endswith('.php'):
                image_url = image_url.replace('.php', '.jpg')
            # 如果url不全，则拼接图片的全路径
            image_url = 'http://www.xiaohuar.com' + image_url

            # 如果url不全，则拼接图片的全路径
            # for ur in url_list:
            #     if ur[:4] not in 'http':
            #         ur = url + ur
            #         image_list.append(ur)
            # print(image_url)

            # name = div.xpath('//div[@class="item_t"]/div[@class="title"]/span/a/text()').extract()
            # name = response.xpath('//div[@class="item_t"]/div[@class="img"]/span/text()').extract()
            name = div.xpath('./div[@class="item_t"]/div[@class="img"]/span[@class="price"]/text()').extract_first()
            # print(name)

            school = div.xpath('./div[@class="item_t"]/div[@class="img"]/div[@class="btns"]/a/text()').extract_first()
            # school = response.xpath('///div[@class="btns"]/a/text()').extract()
            # print(school)
            like = div.xpath('./div[@class="item_b clearfix"]/div[@class="items_likes fl"]/em/text()').extract_first()
            # print(like)


            # 将上面提取的属性保存到对象中
            item['image_url'] = image_url
            item['name'] = name
            item['school'] = school
            item['like'] = like
            # 将该item对象返回
            yield item

        # url = 'http://www.xiaohuar.com/hua/list-1-'
        # page = 0
        # 当处理完第一页的时候，要接着发送请求，处理下一页

        self.page += 1
        if self.page <= 43:
            url = self.url + str(self.page) + '.html'
            # 再次的发送请求，并且指定回调处理函数进行处理对应的请求
            yield scrapy.Request(url=url, callback=self.parse)

9.管道配置：
pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import json
import os
import urllib


class XiaoPipeline(object):
    # 重写构造方法，在这打开文件
    def __init__(self):
        # 文件的打开写到这里，仅会执行一次
        self.fp = open('xiaohuar.json', 'w', encoding='utf-8')

    def open_spider(self, spider):
        pass

    def process_item(self, item, spider):
        # 将这个对象转化为字典
        obj = dict(item)

        # 将图片下载到本地
        # 获取当前目录的绝对路径
        file_root_path = os.path.dirname(os.path.abspath(__file__))
        # 拼接需要保存的路径
        img_dir_path = os.path.join((file_root_path), "E:/python/python_work/xiao/image")
        is_have_img_dir = os.path.exists(img_dir_path)
        if is_have_img_dir:
            pass
        else:
            os.makedirs(img_dir_path)
            print(img_dir_path)
        # 获取图片后缀名
        suffix = os.path.splitext(obj['image_url'])[-1]

        # 拼接文件名

        filename = obj['like'] + '_' + obj['school'] + '_' + obj['name'] + suffix

        # 将文件路径和文件名拼接出来文件的全路径
        filepath = os.path.join(img_dir_path, filename)
        # 下载图片
        urllib.request.urlretrieve(obj['image_url'], filepath)

        # 将obj转化为字符串
        string = json.dumps(obj, ensure_ascii=False)
        self.fp.write(string + '\n')

        return item

        # 重写这个方法，在关闭spider的时候将文件资源关闭
        def close_spider(self, spider):
            self.fp.close()

10.打开爬取管道：
在这里插入图片描述
感谢大神博客：https://blog.csdn.net/haeasringnar/article/details/82289095
源码下载：
https://download.csdn.net/download/qq_31939617/10886020

码点

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
Python 爬取校花网资源、批量下载图片，scrapy 框架入门经典

爬取校花网资源、批量下载图片，scrapy 框架入门经典project1.前面的配置环境，就不说了创建项目：进入工作目录，cmd，执行命令，scrapy startproject XiaoHua2.项目创建好了，进入项目下，创建爬虫3.用PyCharm打开项目，目录结构：5.我们先运行一次：200，访问是成功的...
复制链接

扫一扫