使用scrapy爬取网站上的所有图片

最新推荐文章于 2024-04-21 13:16:33 发布

wly2014

最新推荐文章于 2024-04-21 13:16:33 发布

阅读量5.6k

点赞数

分类专栏： Python 文章标签： python 爬虫 scrapy 图片

本文链接：https://blog.csdn.net/u014271114/article/details/53080447

版权

Python 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

主要的代码逻辑为：

从start_url开始，下载页面，根据正则表达式提取其中的图片，使用xpath提取<a>标签中的网址链接。
对于获取的图片链接，先判断之前是否已经爬取过（去重），没有的话，将图片链接拼接成完整的url格式，保存到img.txt中，使用其他的下载软件更快速的下载。（没有直接使用python下载，这样方便调试，检查自己的筛选规则是否正确）
对于提取到的网址，首先要去重，接着还要判断是否为完整的网址格式,最后发起新的请求：yield Request(url, callback=self.parse)。一般提取到的网址以及当前的页面的网址有以下几种情况：

当前页面网址格式	提取的网址格式
http://www.meizu.com	http://www.meizu.com/index.html
http://www.meizu.com/	/index.html
http://www.meizu.com/index.html	index.html

Spider代码

# -*- coding: utf-8 -*-
import scrapy
import re
from scrapy.http import Request
from ImgInWebsite.items import ImgItem

class ImgSpider(scrapy.Spider):
    name = "ImgSpider"
    allowed_domains = ["www.meizu.com"]
    start_urls = ['http://www.meizu.com/']
    
    URL=[]
    IMG=[]

    def parse(self, response):
        allUrl=response.url
        print allUrl
        # 爬取图片连接
        s=allUrl.split('/')
        root='/'.join(s[0:3])
        # 下面的正则会匹配多个括号
        pattern=re.compile(r'((((https|http):\/\/)|\/)[0-9a-zA-Z\/\.@-_%]*?\.(jpg|png))')
        items=pattern.findall(response.text)
        for item in items:
            img=item[0]
            if img in self.IMG:
                continue
            else:
                self.IMG.append(img)
            
            if img.startswith('/'):
                img=root+img
            elif img.startswith('http') or img.startswith('www'):
                img=img
            else:
                # 当前url加上img
                # 严格还要判断allUrl是不是以'/'结尾
                if allUrl.endswith('/'):
                    img=allUrl+img
                else:
                    img='/'.join(allUrl.split('/')[0:-1])+'/'+img
            imgItem=ImgItem()
            imgItem['img_url']=img
            yield imgItem

        # 爬取继续爬取的网址
        urls=response.xpath('//a/@href').extract()

        for url in urls:
            if url in self.URL:
                continue
            else:
                self.URL.append(url)
            # 组合成完整的url
            if url.startswith('/'):
                url='/'.join(allUrl.split('/')[0:3])+url
            elif url.startswith('http') or url.startswith('www'):
                url=url
            else:
                # 相对路径
                # 还要判断allUrl是不是以'/'结尾
                if allUrl.endswith('/'):
                    url=allUrl+url
                else:
                    url='/'.join(allUrl.split('/')[0:-1])+'/'+url
            
            yield Request(url, callback=self.parse)

项目地址：

https://github.com/wly2014/ImageSpider

wly2014

关注

0
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
使用scrapy爬取网站上的所有图片

主要的代码逻辑为： 1，从start_url开始，下载页面，根据正则表达式提取其中的图片，使用xpath提取<a>标签中的网址链接。 2，对于获取的图片链接，先判断之前是否已经爬取过（去重），没有的话，将图片链接拼接成完整的url格式，保存到img.txt中，使用其他的下载软件更快速的下载。（没有直接使用python下载，这样方便调试，检查自己的筛选规则是否正确） 3，对于提取到的网址，首先要
复制链接

扫一扫