图片抓取【scrapy、splash】

最新推荐文章于 2021-05-27 03:13:07 发布

pp_lan

最新推荐文章于 2021-05-27 03:13:07 发布

阅读量285

点赞数

分类专栏： python

本文链接：https://blog.csdn.net/pp_lan/article/details/102620672

版权

python 专栏收录该内容

11 篇文章 0 订阅

订阅专栏

环境说明：

使用了爬虫框架scrapy, 并通过splash进行渲染（不然爬虫使用过程中，很多网站异步加载的情况下是无法抓取内容的）。

scrapy框架安装（直接下载离线安装即可）【https://blog.csdn.net/pp_lan/article/details/90642614】

splash安装（过程比较麻烦）【https://blog.csdn.net/pp_lan/article/details/90692510】

进行图片抓取

核心代码：

# 抓取图片
import os
import re
import urllib

import scrapy
from scrapy_splash import SplashRequest


class AutoHomeImgSpider(scrapy.Spider):
    name = 'autohome_img_spider'
    allowed_domains = ['autohome.com']

    def start_requests(self):
        img_url = "https://club.autohome.com.cn/bbs/thread/b25da065245156d4/83662885-1.html#pvareaid=2592101"
        try:
            yield SplashRequest(img_url
                                , callback=self.parse_config
                                , args={'wait': '2',
                                        'timeout': '10'})
        except:
            print("异常model\t")

    def parse_config(self, response):
        imgSum = 0
        badImg = 0
        hrefCmp = re.compile("""<img.*?name="F06".*?src="(.*?)".*?>""")
        hreflist = hrefCmp.findall(str(response.text))
        drive = "F:\\pyworkspace\\mySpider\\img"
        if not os.path.exists(drive):
            os.mkdir(drive)
        for href in hreflist:
            if href.find("""http://""") == 0 or href.find("https://") == 0:
                try:
                    imageName = href[href.rindex("/") + 1:]
                    urllib.request.urlretrieve(href, os.path.join(drive, imageName))
                    imgSum += 1
                    print(imageName + "    OK")
                except:
                    print("cannot download this image:" + imageName)
            else:
                try:
                    href = 'http:' + href
                    imageName = href[href.rindex("/") + 1:]
                    urllib.request.urlretrieve(href, os.path.join(drive, imageName))
                    imgSum += 1
                    print(imageName + "    OK")
                except:
                    print("cannot download this image:" + imageName)
                    badImg += 1
                    print(href)
        print("Sucess:", imgSum, "    Failed:", badImg)

启动：

from scrapy.cmdline import execute

execute(["scrapy", "crawl", "autohome_img_spider"])

结果示例：

pp_lan

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
图片抓取【scrapy、splash】

环境说明：使用了爬虫框架scrapy, 并通过splash进行渲染（不然爬虫使用过程中，很多网站异步加载的情况下是无法抓取内容的）。scrapy框架安装（直接下载离线安装即可）【https://blog.csdn.net/pp_lan/article/details/90642614】splash安装（过程比较麻烦）【https://blog.csdn.net/pp_lan/ar...
复制链接

扫一扫

专栏目录