Python爬虫爬取Google图片 -续- ：使用Selenium进行网页操作

Ice星空

已于 2023-08-25 11:27:28 修改

阅读量4.1k

点赞数 11

分类专栏： python 文章标签： python 爬虫 selenium

于 2020-07-20 16:46:22 首次发布

本文链接：https://blog.csdn.net/Lyn_B/article/details/107461556

版权

python 专栏收录该内容

23 篇文章 0 订阅

订阅专栏

在之前的 Python爬虫爬取Google图片，给出了从谷歌图片搜索结果的动态网页上爬取图片的方法，这个方法需要我们手动通过检查元素进行页面元素相关代码的复制，但是显然，我们更希望让脚本也可以完成这个工作，而我们只需要告诉这个脚本哪部分元素是我们需要的，这就涉及到了两个问题：

如何模拟用户和浏览器的交互操作
如何检查元素

我们需要用到一个工具集——Selenium

Introduction

Selenium 是一个工具集，并不仅仅只是 python 的一个 package，对于 Java，Ruby 等也有相关的实现，可以实现远程控制浏览器并模拟用户和模拟器的交互操作。

Installation

这里给出 python 的安装方法

libraries

pip install selenium

如果是使用源代码安装的话，则需要下载相关的压缩文件，然后执行：

python setup.py install

drivers

安装对应浏览器的驱动。对应大部分的浏览器的驱动，都需要指定一个可执行文件使得 Selenium 可以和浏览器进行对接，以 Windows 为例：

创建一个文件夹例如 C:\WebDriver\bin
新增相应的值到环境变量 Path 中

WebDriver

一个适用于不同语言的API以及协议（protocol），用于处理 selenium 和浏览器之间的交流，从而控制浏览器的行为。对于几乎所有的主流浏览器都有对应的接口。
截至 2020-07-19 对不同浏览器的支持情况：

Browser	Maintainer	Versions Supported
Chrome	Chromium	All versions
Firefox	Mozilla	54 and newer
Internet Explorer	Selenium	6 and newer
Opera	Opera Chromium / Presto	10.5 and newer
Safari	Apple	10 and newer

Find Element

By

顾名思义，即我们搜索元素的依据

from selenium import webdriver
from selenium.webdriver.common.by import By

driver.find_element(s)

元素搜索的函数：

from selenium import webdriver
from selenium.webdriver.common.by import By

driver = webdriver.Chrome()
driver.get("https://www.example.com")

# 获取第一个 div 标签元素
div = driver.find_element(By.TAG_NAME, "div")

# 获取所有段落
ps = driver.find_elements(By.TAG_NAME, "p")
for p in ps:
	print(e.text)

driver.switch_to.active_element

切换到当前选中的元素（即 active），并获取相关标签信息：

# 通过模拟用户的输入得到一个活跃的元素
driver = webdriver.Chrome()
driver.get("https://www.google.com")
driver.find_element(By.CSS_SELECTOR, '[name="q"]').send_keys("webElement")

# 切换到当前的活跃元素，获取相关标签的内容
attr = driver.switch_to.active_element.get_attribute("title")

Keyboard

模拟用户的键盘输入：

send_keys

输入文本，并且可以模拟键盘输入：

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys

# 模拟页面跳转
driver.get("https://www.google.com")

# 输入 “webdriver” 并且模拟 enter 键输入
driver.find_element(By.NAME, "q").send_keys("webdriver" + Keys.ENTER)

key_down

模拟 shift，ctrl，alt 的按键效果，例如 ctrl + A 进行页面全选：

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys

# 模拟页面跳转
driver.get("https://www.google.com")

# 全选页面
driver.ActionChains(driver).key_down(Keys.CONTROL).send_keys("a").perform()

key_up

结合 key_down 使用，即输入包含大小写交替出现时，通过 key_down 和 key_up 切换，同样通过 ActionChains 实现用户的串联输入操作。

clear

清除内容：

text = driver.find_element(By.TAG_NAME, "input")
text.send_keys("pokemon")
# 清除内容
text.clear()

Mouse

模拟用户的鼠标操作

Click

同样通过 ActionChains 来实现

# 获取隐藏目录，例如下拉目录
menu = driver.find_element_by_css_selector(".nav")
hidden_submenu = driver.find_element_by_css_selector(".nav #submenu1")

actions = ActionChains(driver)
# 将鼠标定位到相应的元素
actions.move_to_element(menu)
# 点击操作
actions.click(hidden_submenu)
actions.perform()

Drag-and-drop

通过 ActionChains1 的 drag_and_drop 来实现，将一个元素拖拽到另一个元素上：

source = driver.find_element(By.ID, "source")
target = driver.find_element(By.ID, "target")
ActionChains(driver).drag_and_drop(source, target).perform()

Http proxies

定义 Http 传输协议

from selenium import webdriver

PROXY = "<HOST:PORT>"
webdriver.DesiredCapabilities.FIREFOX['proxy'] = {
    "httpProxy": PROXY,
    "ftpProxy": PROXY,
    "sslProxy": PROXY,
    "proxyType": "MANUAL",

}

with webdriver.Firefox() as driver:
    # Open URL
    driver.get("https://selenium.dev")

Page loading strategy

设置浏览器的加载策略：

normal：等待网页完全加载完毕
eager：只加载初始文档，不包括初始文档加载的内容（即只能抓取静态网页的内容，不能抓取动态网页），即 DOMContentLoaded 事件结果返回时
none：只加载初始文档

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.page_load_strategy = 'strategy' # normal, eager, none
driver = webdriver.Chrome(options=options)
# Navigate to url
driver.get("http://www.google.com")
driver.quit()

Selenium 爬取谷歌图片

正题，现在我们尝试完全依靠 python 脚本实现自动爬取谷歌图片。我们需要实现：

自动搜索并跳转
从网页元素中找到包含所有图片的页面元素，得到所有图片标签中的 url
下载图片

自动搜索

我们需要模拟鼠标点击通过 ActionChains 来实现：

# search on google
# navigate to url
self.driver.get(self.url)
# locate input field
search_input = self.driver.find_element(By.NAME, 'q')
# emulate user input and enter to search
webdriver.ActionChains(self.driver).move_to_element(search_input).send_keys("pokemon" + Keys.ENTER).perform()

页面跳转

点击跳转到图片搜索结果：

# navigate to google image
# find navigation buttons
self.driver.find_element(By.LINK_TEXT, '图片').click()

得到所有图片元素

这里可以结合我们之前使用的 BeautifulSoup 来实现，此时需要读取检查元素页面中对应部分的 html 代码，并且大部分情况下，使用 BeautifulSoup 是更加高效的，因为 WebDriver 需要对所有的 DOM 元素进行遍历。但是 WebDriver 的优势在于可以自动实现网页的下拉，从而可以爬取到所有的搜索结果，这里我们介绍使用 Selenium 的相关实现：

下拉到底部

# load more images as many as possible
# 通过让驱动执行js脚本实现下拉
self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
# 找到加载按钮
show_more_button = self.driver.find_element(By.CSS_SELECTOR, "input[value='显示更多搜索结果']")
try:
	while True:
		# 根据浏览器信息
		message = self.driver.find_element(By.CSS_SELECTOR, 'div.OuJzKb.Bqq24e').get_attribute('textContent')
		# print(message)
		if message == '正在加载更多内容，请稍候':
			self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
		elif message == '新内容已成功加载。向下滚动即可查看更多内容。':
			# scrolling to bottom
			self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
			# 当出现加载更多的按钮时，点击加载更多
			if show_more_button.is_displayed():
				show_more_button.click()
		# 没有更多图片可以加载时退出
		elif message == '看来您已经看完了所有内容':
			break
		# 点击重试，这个地方没有测试
		elif message == '无法加载更多内容，点击即可重试。':
			show_more_button.click()
		else:
			self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
except Exception as err:
	print(err)

获取所有图片元素

# find all image elements in google image result page
imgs = self.driver.find_elements(By.CSS_SELECTOR, "img.rg_i.Q4LuWd")

下载图片

这里和 BeautifulSoup 部分实现相同，：

img_count = 0
for img in imgs:
	try:
		# image per second
		time.sleep(1)
		print('\ndownloading image ' + str(img_count) + ': ')
		img_url = img.get_attribute("src")
		path = os.path.join(imgs_dir, str(img_count) + "_img.jpg")
		request.urlretrieve(url = img_url, filename = path, reporthook = progress_callback, data = None)
		img_count = img_count + 1
	except error.HTTPError as http_err:
		print(http_err)
	except Exception as err:
		print(err)

完整代码

最终，完整部分的代码如下：

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options

from urllib import error
from urllib import request
import os
import time
import sys

# default url
# replace for yours
url = "https://www.google.com"
explorer = "Chrome"
# directory
imgs_dir = "./images"


# report hook with three parameters passed
# count_of_blocks  The number of blocks transferred
# block_size The size of block
# total_size Total size of the file
def progress_callback(count_of_blocks, block_size, total_size):
    # determine current progress
    progress = int(50 * (count_of_blocks * block_size) / total_size)
    if progress > 50:
        progress = 50
    # update progress bar
    sys.stdout.write("\r[%s%s] %d%%" % ('█' * progress, '  ' * (50 - progress), progress * 2))
    sys.stdout.flush()


class CrawlSelenium:

	def __init__(self, explorer="Chrome", url="https://www.google.com"):
		self.url = url
		self.explorer = explorer

	def set_loading_strategy(self, strategy="normal"):
		self.options = Options()
		self.options.page_load_strategy = strategy


	def crawl(self):
		# instantiate driver according to corresponding explorer
		if self.explorer == "Chrome":
			self.driver = webdriver.Chrome(options=self.options)
		if self.explorer == "Opera":
			self.driver = webdriver.Opera(options=self.options)
		if self.explorer == "Firefox":
			self.driver = webdriver.Firefox(options=self.options)
		if self.explorer == "Edge":
			self.driver = webdriver.Edge(options=self.options)

		# search on google
		# navigate to url
		self.driver.get(self.url)
		# locate input field
		search_input = self.driver.find_element(By.NAME, 'q')
		# emulate user input and enter to search
		webdriver.ActionChains(self.driver).move_to_element(search_input).send_keys("pokemon" + Keys.ENTER).perform()
		

		# navigate to google image
		# find navigation buttons
		self.driver.find_element(By.LINK_TEXT, '图片').click()

		# load more images as many as possible
		# scrolling to bottom
		self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
		# get button
		show_more_button = self.driver.find_element(By.CSS_SELECTOR, "input[value='显示更多搜索结果']")
		try:
			while True:
				# do according to message
				message = self.driver.find_element(By.CSS_SELECTOR, 'div.OuJzKb.Bqq24e').get_attribute('textContent')
				# print(message)
				if message == '正在加载更多内容，请稍候':
					self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
				elif message == '新内容已成功加载。向下滚动即可查看更多内容。':
					# scrolling to bottom
					self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
					if show_more_button.is_displayed():
						show_more_button.click()
				elif message == '看来您已经看完了所有内容':
					break
				elif message == '无法加载更多内容，点击即可重试。':
					show_more_button.click()
				else:
					self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
		except Exception as err:
			print(err)

		# find all image elements in google image result page
		imgs = self.driver.find_elements(By.CSS_SELECTOR, "img.rg_i.Q4LuWd")
		
		img_count = 0
		for img in imgs:
			try:
				# image per second
				time.sleep(1)
				print('\ndownloading image ' + str(img_count) + ': ')
				img_url = img.get_attribute("src")
				if img_url == None:
					continue
				path = os.path.join(imgs_dir, str(img_count) + "_img.jpg")
				request.urlretrieve(url = img_url, filename = path, reporthook = progress_callback, data = None)
				img_count = img_count + 1
			except error.HTTPError as http_err:
				print(http_err)
			except Exception as err:
				print(err)



def main():
	# setting
	crawl_s = CrawlSelenium(explorer, url)
	crawl_s.set_loading_strategy("normal")
	# make directory
	if not os.path.exists(imgs_dir):
		os.mkdir(imgs_dir)
	# crawling
	crawl_s.crawl()


if __name__ == "__main__":
	main()

结果

Ice星空

关注

11
点赞
踩
27

收藏

觉得还不错? 一键收藏
20
评论
Python爬虫爬取Google图片 -续- ：使用Selenium进行网页操作

文章目录IntroductionInstallationlibrariesdriversWebDriverFind ElementBydriver.find_element(s)driver.switch_to.active_elementKeyboardsend_keyskey_downkey_upclearMouseClickDrag-and-dropHttp proxiesPage loading strategySelenium 爬取谷歌图片自动检查页面元素得到所有图片元素下载图片完整代码在之前的
复制链接

扫一扫