用自动化测试软件selenium爬取splash上的图片

最新推荐文章于 2023-11-28 17:43:21 发布

killeri

最新推荐文章于 2023-11-28 17:43:21 发布

阅读量619

点赞数

分类专栏： python爬虫(除scrapy框架）文章标签：自动化测试 selenium 图片爬虫

本文链接：https://blog.csdn.net/killeri/article/details/79987068

版权

python爬虫(除scrapy框架）专栏收录该内容

17 篇文章 2 订阅

订阅专栏

因为刚开始学爬虫的时候是有一个项目是爬取一个网站的图片，一张。[网站链接](https://www.splash.com/)，
今天在想项目的时候就又想到了这个网站，现在想，用selenium实现页面的向下滚动，从而实现爬取多个页面的内容，这次总共爬取了160张图片（下滑了10个页面）
代码量：50行

# coding：utf-8

from selenium import webdriver
import time,requests
from bs4 import BeautifulSoup

driver=webdriver.Chrome()

def get_page(driver): # 转到指定的页面首页
    page=driver.get('https://unsplash.com/')

def sroll_page(driver): # 滚动页面，并返回页面资源
    get_page(driver)
    # js 脚本，实现页面向下滚动到该页面底部
    js='window.scrollTo(0, document.body.scrollHeight);'
    for i in range(0,10): # 使页面滚动十次
        # 这个是执行js脚本
        driver.execute_script(js)
        # 因为页面到最底部要继续向下的话，要留一个等待页面加载成功的时间，我这边网速很慢，所以留了十秒，然后继续滚动。
        time.sleep(10)
    return driver.page_source # 返回页面资源
#解析页面的函数
def parser_page(html):
    url_list=[]
    Soup=BeautifulSoup(html,'lxml')
    div=Soup.find_all('div',class_="_1OvAL _2T3hc _27nWV")
    for i in div:
        try: # 一般我都喜欢加一个错误捕捉，防止有一两个元素抽筋，导致整个程序都停止了，这里，这个网站做的很规范，没有捕捉到错误
            x=i.find_all('a',itemprop="contentUrl")
            try:
                for z in x:
                    url_raw=z.get('href')
                        url='https://unsplash.com/'+url_raw+'/download?force=true'
                    url_list.append(url)
            except Exception as e:
                print('小链接错误',e)
        except Exception as f:
            print('大链接错误',f)

    return url_list
#下载图片
def download_pic(list):
    #print(list,len(list))
    for i in range(len(list)):
        adress='D://图片/{0}.png'.format(i)
        html=requests.get(list[i],verify=False)
        with open(adress, 'ab') as f:
            print('正在下载第{0}张图片'.format(i+1))
            f.write(html.content)
            print('第{0}张照片写入成功'.format(i+1))


def main(driver):
    html=sroll_page(driver)
    urls_list=parser_page(html)
    download_pic(urls_list)



if  __name__=='__main__':
    main(driver)