Selenium+Chrome的爬取unsplash图片实践

最新推荐文章于 2020-12-07 11:32:39 发布

alex_mist

最新推荐文章于 2020-12-07 11:32:39 发布

阅读量212

点赞数 1

分类专栏： python

本文链接：https://blog.csdn.net/weixin_40710708/article/details/105243902

版权

python 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

借鉴了阿里波特的思路：
https://www.cnblogs.com/Albert-Lee/p/6238866.html

Selenium是一种可以模拟用户操作的web应用程序测试的工具，例如点击按键，拖动滚动条等，就像真正的用户在操作一样；而这些操作的模拟是通过javascript脚本来实现的。
（我感觉Selenium+Chrome就像是模拟在Chrome浏览器上的一切用户操作）

框架底层使用JavaScript模拟真实用户对浏览器进行操作。测试脚本执行时，浏览器自动按照脚本代码做出点击，输入，打开，验证等操作，就像真实用户所做的一样，从终端用户的角度测试应用程序。

和selenium配套使用的以前用PhantomJs:
Phantom是一个headless的浏览器，它有和chrome，IE一样完整的浏览器内核,包括js解析引擎,渲染引擎,请求处理等,但是不包括显示和用户交互页面的浏览器。
但是后面Chrome和FireFox也推出了headless模式，Phantom不再更新维护，凉凉了
所以现在一般用Chrome作为selenium的webdriver，流程：

1.创建个webdriver，用Chrome的headless模式作为webdriver，相对于打开了个Chrome浏览器，这不过没有图形化显示出来
2.用这个webdriver去get一个url，获得的数据中的page_source就是这个url的html内容
3.用BeautifulSoup去解析（html.parser作为解析器）这个html，获取图片的url
4.将每个图片的url转换为md5，作为图片的名字来存储，方便去重（但是也有可能不同的url中的图片内容却相同）
5.用requests库去get图片的url，将图片的content写入文件即可

代码如下：

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

def main():
    chrome_options = Options()
    chrome_options.add_argument('--headless')
    chrome_options.add_argument('--disable-gpu')
    driver = webdriver.Chrome(executable_path='./chromedriver', chrome_options=chrome_options)

下拉爬取unplash图片：

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import requests
from bs4 import BeautifulSoup
import os
import time
import hashlib
from selenium.webdriver.chrome.options import Options


class Picture():
    def __init__(self):
        self.header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'}
        self.web_url = 'https://unsplash.com/s/photos/sexy'
        self.dir = 'D:/python-pic'

    def mkdir(self,path):
    # 创建存储图片的文件夹，先判断文件夹是否已存在
        path = path.strip()
        is_exist = os.path.exists(path)
        if is_exist:
            print("The folder is already exist.")
        else:
            print("Creating " + path)
            os.mkdir(path)
            print("Creating Successfully!")

    def save_pic(self,url,pic_name):
    # 通过url将图片存储
        print("Saving the picture, will take some times...")
        img = requests.get(url,self.header)
        f = open(pic_name,'ab')
        f.write(img.content)
        print("Saving pricture:" + pic_name + " Successfully!")
        f.close()

    def scroll_down(self,driver,height):
    # 通过selenium+chrome实现下拉操作
            driver.execute_script("window.scrollTo(0,%d);"%height)
            time.sleep(30)

    def trans_md5(self,img_url):
    # 将图片的url转换为MD5，方便对比已保存的文件是否有重复的
        m = hashlib.md5()
        m.update(img_url.encode("utf8"))
        return m.hexdigest()

    def get_pic(self,times):
    # 功能实现
        print('Starting send the get request...')
        chrome_options = Options()
        chrome_options.add_argument('--headless')
        chrome_options.add_argument('--disable-gpu')
        driver = webdriver.Chrome(executable_path="D:\python3.8.0\chromedriver.exe",options=chrome_options)
        # 使用headless的chrome
        driver.get(self.web_url)
        self.mkdir(self.dir)
        os.chdir(self.dir)
        for i in range(0,times):
            print('Starting get all the picture urls...')
            all_img = BeautifulSoup(driver.page_source, 'html.parser').find_all('img', class_='_2zEKz')
            print('Get all the urls successfully!')
            print('Starting create the storage folder...')
            for img_url in all_img:
                img = img_url['src']
                img_name = self.trans_md5(img) + '.jpg'
                file_list = os.listdir(self.dir)
                if img_name not in file_list:
                # 判断去重
                    self.save_pic(img, img_name)
            self.scroll_down(driver,(i+1)*7000)


hhh = Picture()
hhh.get_pic(20)

alex_mist

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Selenium+Chrome的爬取unsplash图片实践

借鉴了阿里波特的思路：https://www.cnblogs.com/Albert-Lee/p/6238866.htmlSelenium是一种可以模拟用户操作的web应用程序测试的工具，例如点击按键，拖动滚动条等，就像真正的用户在操作一样；而这些操作的模拟是通过javascript脚本来实现的。（我感觉Selenium+Chrome就像是模拟在Chrome浏览器上的一切用户操作）框架底层使...
复制链接

扫一扫

专栏目录