Scrapy爬虫框架实战二：爬取图片

最新推荐文章于 2024-05-07 07:02:33 发布

极客✌

最新推荐文章于 2024-05-07 07:02:33 发布

阅读量402

点赞数

文章标签： python xpath

本文链接：https://blog.csdn.net/qq_43469111/article/details/104768532

版权

爬取地址：http://desk.zol.com.cn/bizhi/8673_106969_2.html
（浏览器解析网页的此处跳过了）
1、创建爬虫项目
在当前终端下输入：scrapy startproject tupian
2、创建新的python文件，编写自己的爬虫

scrapy genspider zol desk.zol.com.cn

3、开启图片管道，修改settings文件

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, likeGecko) Chrome/64.0.3282.140 Safari/537.36 Edge/18.17763'

	ROBOTSTXT_OBEY: False
	ITEM_PIPELINES = {
    'tupian.pipelines.ImagePipline': 300,  #先用这个
}


IMAGES_STORE = 'E:/nginx-1.17.8/nginx-1.17.8/html/images' #这个是下载的图片保存位置，需要自己添加

4、进入zol.py ，修改
start_urls = ['http://desk.zol.com.cn/bizhi/8673_106969_2.html']
5、根据xpath，得到下一页的url和图片名，并通过yield推送到pipelines.py，对最后一页进行判断。
右键检查元素，定位到下一页

代码如下：


 def parse(self, response):
        image_urls = response.xpath('//img[@id="bigImg"]/@src').extract()
        image_name = response.xpath('string(//h3)').extract_first()
        yield{
            "image_urls": image_urls,
            "image_name": image_name
        }
        next_url = response.xpath('//a[@id="pageNext"]/@href').extract_first()
        if next_url.find('.html') != -1:
            yield scrapy.Request(response.urljoin(next_url), callback=self.parse)

6、开始写pipelines.py
新建一个类，继承自ImagesPipeline
代码如下：

class ImagePipline(ImagesPipeline):
    def get_media_requests(self, item, info):
        for image_url in item['image_urls']:
            yield scrapy.Request(image_url, meta={"image_name":item['image_name']})
    def file_path(self, request, response=None, info=None):
        file_name = request.meta['image_name'].strip().replace('\r\n\t\t','') + '.jpg' #替换掉不要的空格等字符
        return file_name.replace('/','_')   #注意/的意思，当前节点，会创建很多文件夹

7、在tupian 中新建一个main文件，写入启动代码：

from scrapy.cmdline import execute
execute("scrapy crawl zol".split())

8、启动main文件，抓取图片
在这里插入图片描述

极客✌

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Scrapy爬虫框架实战二：爬取图片

爬取地址：http://desk.zol.com.cn/bizhi/8673_106969_2.html（浏览器解析网页的此处跳过了）1、创建爬虫项目在当前终端下输入：scrapy startproject tupian2、创建新的python文件，编写自己的爬虫scrapy genspider zol desk.zol.com.cn3、开启图片管道，修改settings文件USE...
复制链接

扫一扫