爬取谷歌图片

最新推荐文章于 2024-07-24 14:44:00 发布

Wenweno0o

最新推荐文章于 2024-07-24 14:44:00 发布

阅读量2.3k

点赞数 3

分类专栏：数据分析与挖掘文章标签： python 爬虫

本文链接：https://blog.csdn.net/wenweno0o/article/details/121487706

版权

数据分析与挖掘专栏收录该内容

9 篇文章 2 订阅

订阅专栏

python爬虫：爬取谷歌图片

前言

前言

由于工作需要，需要从网上获取大量图片。（使用谷歌图片需自备梯子）

涉及到的库

pip 安装

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time
import urllib.request
from bs4 import BeautifulSoup as bs
import os

需要使用的插件

谷歌浏览器插件：chromedriver
下载地址：http://chromedriver.storage.googleapis.com/index.html
下载与自己浏览器版本对应的插件，版本查看在浏览器的设置中查找
在这里插入图片描述

实现过程

- 代码分析

# -*- coding: UTF-8 -*-"""
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time
import urllib.request
from bs4 import BeautifulSoup as bs
import re
import os

# ****************************************************
base_url_part1 = 'https://www.google.com/search?q='
base_url_part2 = '&source=lnms&tbm=isch'                  # base_url_part1以及base_url_part2都是固定不变的，无需更改
search_query = '******'                                   # 检索的关键词，可自己输入你想检索的关键字
location_driver = r'***\***\chromedriver'                 # Chrome驱动程序在电脑中的位置


class Crawler:
    def __init__(self):
        self.url = base_url_part1 + search_query + base_url_part2

    # 启动Chrome浏览器驱动
    def start_brower(self):
        chrome_options = Options()
        chrome_options.add_argument("--disable-infobars")
        # 启动Chrome浏览器
        driver = webdriver.Chrome(executable_path=location_driver, chrome_options=chrome_options)
        # 最大化窗口，因为每一次爬取只能看到视窗内的图片
        driver.maximize_window()
        # 浏览器打开爬取页面
        driver.get(self.url)
        return driver

    def downloadImg(self, driver):
        t = time.localtime(time.time())
        picpath = r'E:\data'  # 下载到的本地目录
        # 路径不存在时创建一个
        if not os.path.exists(picpath): os.makedirs(picpath)
        # 下载图片的本地路径 /home/LQ/ImageDownload/xxx

        # 记录下载过的图片地址，避免重复下载
        img_url_dic = {}
        x = 0
        # 当鼠标的位置小于最后的鼠标位置时,循环执行
        for i in range(50):  # 此处可自己设置爬取范围
            pos = i * 500  # 每次下滚500
            js = "document.documentElement.scrollTop=%d" % pos
            driver.execute_script(js)
            time.sleep(1)
            # 获取页面源码
            html_page = driver.page_source
            # 利用Beautifulsoup4创建soup对象并进行页面解析
            soup = bs(html_page, "html.parser")
            # 通过soup对象中的findAll函数图像信息提取
            imglist = soup.findAll('img', {'class': 'rg_i Q4LuWd'})

            for imgurl in imglist:
                try:
                print(x)
                    if imgurl['src'] not in img_url_dic:
                        target = '{}/{}.jpg'.format(picpath, str(x))
                        # print ('Downloading image to location: ' + target + '\nurl=' + imgurl['src'])
                        img_url_dic[imgurl['src']] = ''
                        urllib.request.urlretrieve(imgurl['src'], target)
                        time.sleep(1)
                        x += 1
                except KeyError:
                    print("ERROR!")
                    continue

    def run(self):
        print(
            '\t\t\t**************************************\n\t\t\t**\t\tWelcome to Use Spider\t\t**\n\t\t\t**************************************')
        driver = self.start_brower()
        self.downloadImg(driver)
        driver.close()
        print("Download has finished.")


if __name__ == '__main__':
    craw = Crawler()
    craw.run()