简易的google图片爬虫(若爬百度可适当针对源码修改)

最新推荐文章于 2024-07-16 21:58:18 发布

疯狂的大山鸡

最新推荐文章于 2024-07-16 21:58:18 发布

阅读量556

点赞数

分类专栏：爬虫

本文链接：https://blog.csdn.net/hlpower/article/details/103515547

版权

爬虫专栏收录该内容

4 篇文章 0 订阅

订阅专栏

简单的google image爬虫

背景
关键代码
其他
- 爬取百度图片
- 添加超时功能

背景

为了做一些漫画风格的自动生成网络训练，需要使用到google去爬取图片，看到一个写的比较通俗的源码，觉得未来做其他的爬虫大概率会用的上，所以在这里做一个记录。github源码地址:GoogleImagesDownloader

关键代码

requirement

首先需要安装一下库

python 3.5
selenium 3.6.0
Firefox
geckodriver
其中 selenium 和 geckodriver 将打开一个firefox浏览器，并使用python指令做相应的页面的请求，并模拟界面滑动以及按钮点击 ,由于google 单词请求仅仅只显示100张图片，而此功能是能在google上获取更多图片的关键。

其核心代码如下，重要部分我加上注释

获取链接模块

	from selenium import webdriver
	#创建firefox模拟器
    driver = webdriver.Firefox() 
    for i in range(len(supplemented_keywords)):
        search_query = quote(main_keyword + ' ' + supplemented_keywords[i])
        url = "https://www.google.com/search?q="+search_query+"&source=lnms&tbm=isch"
        driver.get(url) # 请求url

        for _ in range(number_of_scrolls):
            for __ in range(10):
                # multiple scrolls needed to show all 400 images
                driver.execute_script("window.scrollBy(0, 1000000)")
                time.sleep(2)
            # to load next 400 images
            time.sleep(5)
            try:
                #由于我的firefox默认是葡萄牙语，所以这个地方需要使用开启的浏览器，在开发者模式下找到对应的代码部分实际的value值填上
                driver.find_element_by_xpath("//input[@value='Mais resultados']").click()
            except Exception as e:
                print("Process-{0} reach the end of page or get the maximum number of requested images".format(main_keyword))
                break

        #
        imges = driver.find_elements_by_xpath('//div[contains(@class,"rg_meta")]')
        for img in imges:
            img_url = json.loads(img.get_attribute('innerHTML'))["ou"]
            #将所有需要下载的链接保存到set中以防反复下载
            img_urls.add(img_url)
        print('Process-{0} add keyword {1} , got {2} image urls so far'.format(main_keyword, supplemented_keywords[i], len(img_urls)))
    print('Process-{0} totally get {1} images'.format(main_keyword, len(img_urls)))
    driver.quit()

下载模块

from urllib.parse import urlparse, quote
from user_agent import generate_user_agent
import urllib.request

o = urlparse(link)
ref = o.scheme + '://' + o.hostname
#ref = 'https://www.google.com'
ua = generate_user_agent()
headers['User-Agent'] = ua
headers['referer'] = ref
print('\n{0}\n{1}\n{2}'.format(link.strip(), ref, ua))
req = urllib.request.Request(link.strip(), headers = headers)
response = urllib.request.urlopen(req)
data = response.read()
file_path = img_dir + '{0}.jpg'.format(count)
with open(file_path,'wb') as wf:
    wf.write(data)
print('Process-{0} download image {1}/{2}.jpg'.format(main_keyword, main_keyword, count))
count += 1
if count % 10 == 0:
    print('Process-{0} is sleeping'.format(main_keyword))
    time.sleep(5)

其他

爬取百度图片

如果要爬取百度的图片相对应的可以修改链接获取模块

添加超时功能

使用signal库
singal库的使用可以参考signal信号模块
也可以直接看我下面写的Demo代码
signal设置了一个定时器2秒促发，而我下面接着等待了3秒，所以会收到signal信号触发handler进而触发Exception，如果想在下载过程中忽略等待时间长的就在Exception时候continue掉就可以。
如下：

import signal
import time

class TimeLimitError(Exception):
    def __init__(self, value):
        Exception.__init__(self)
        self.value = value

    def __str__(self):
        return self.value


def handler(signum, frame):
    raise TimeLimitError('Time limit exceeded')


signal.signal(signal.SIGALRM, handler)
try:
    signal.alarm(2) # set a timeout(alarm)
    time.sleep(3)
except TimeLimitError as e:
    print('TimeLimitError: {0}'.format(e.value))
finally:
    signal.alarm(0) # disable the alarm
print("over")