解决selenium爬取动态网页的坑（2）

eye123456789

已于 2022-03-11 17:15:12 修改

阅读量1k

点赞数 2

文章标签： selenium 爬虫测试工具

于 2022-03-11 16:43:32 首次发布

本文链接：https://blog.csdn.net/eye123456789/article/details/123428617

版权

坑一：关于滚动条加载到底部的解决。

上一次的爬取，出现了尽管加上driver.execute_script("scroll(0,100000)") ，依旧不能加载到底部。而且经过观察发现，此次加载到最后停留的地方都是在同一个地方，这就发现了应该与这个scroll设置的极限参数有关系，果不其然，将参数设的很小的时候，获取到的图片仅有30多张。所以将100000这个参数再次加上一个数量级，观察滚动条是否会滚动到最后，最后真的滚到了最后！这样才能获取到全部的element代码。

# 将滚动条下拉至最低，才能得到全部的element代码！！！
all_window_height =  []  # 创建一个列表，用于记录每一次拖动滚动条后页面的最大高度
all_window_height.append(driver.execute_script("return document.body.scrollHeight;")) #当前页面的最大高度加入列表
while True:
    driver.execute_script("scroll(0,1000000)") # 执行拖动滚动条操作,滚动条底部的参数改成了1000000，终于没加载一半就停止了！
    time.sleep(3)
    check_height = driver.execute_script("return document.body.scrollHeight;")
    if check_height == all_window_height[-1]:  #判断拖动滚动条后的最大高度与上一次的最大高度的大小，相等表明到了最底部
        break
    else:
        all_window_height.append(check_height) #如果不想等，将当前页面最大高度加入列表。

坑二：关于使用requests.get获取出现

import os
import time
import requests
import  json
import lxml
from bs4 import BeautifulSoup
import random
import  urllib3
import re
import selenium
import asyncio
import aiofiles
import aiohttp




from selenium import webdriver#selenium可以模拟浏览器，可以解决反爬，之前直接使用requests.get请求是403（访问不了）
from bs4 import BeautifulSoup
import requests
from lxml import  etree
#踩坑记录：前两天一直报错，就是因为网页是反爬的，后面使用了selenium解决了
#使用不同的浏览器记载同一个网站，可能会出现一个加载很快，一个加载不动的情况（今天就在这里踩坑了，使用webdriver.Chrome()选用的是谷歌浏览器，加载的时候就基本不动，而webdriver.Chrome()加载网站的时候就很快
#又一次发现失败，原来不是因为浏览器的原因，而是因为现在的网站在页面渲染之前就已经对webdriver的属性进行检测了，正常情况这个属性应该是undefined,而我们一旦使用了selenium这个属性就被置为true
import time
from selenium.webdriver.chrome.options import Options
from selenium.webdriver import ChromeOptions
import os
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities


#设置参数 excludeSwitches达到selenium被反爬（在这个地方卡了好久）
option = ChromeOptions()
option.add_experimental_option('excludeSwitches', ['enable-automation'])
option.add_argument("--disable-blink-features")
option.add_argument("--disable-blink-features=AutomationControlled")



driver=webdriver.Chrome(options=option)#实例化一个初始浏览器
url='https://www.pexels.com/zh-tw/'
desired_capabilities = DesiredCapabilities.CHROME
desired_capabilities["pageLoadStrategy"] = "none"
driver.get(url)
driver.maximize_window()
driver.implicitly_wait(30)


# 将滚动条下拉至最低，才能得到全部的element代码！！！
all_window_height =  []  # 创建一个列表，用于记录每一次拖动滚动条后页面的最大高度
all_window_height.append(driver.execute_script("return document.body.scrollHeight;")) #当前页面的最大高度加入列表
while True:
    driver.execute_script("scroll(0,1000000)") # 执行拖动滚动条操作,滚动条底部的参数改成了1000000，终于没加载一半就停止了！
    time.sleep(3)
    check_height = driver.execute_script("return document.body.scrollHeight;")
    if check_height == all_window_height[-1]:  #判断拖动滚动条后的最大高度与上一次的最大高度的大小，相等表明到了最底部
        break
    else:
        all_window_height.append(check_height) #如果不想等，将当前页面最大高度加入列表。



#解析数据部分
driver.enconding='UTF-8'
soup=BeautifulSoup(driver.page_source,'html.parser')#得到全部的element代码
body=soup.find('div',attrs={'class':'l-container home-page'})
body=body.find('div',attrs={'class':'photos'})




async def download(url,pic_path):
    async with aiohttp.ClientSession() as session:  # aiohttp.ClientSession() 等价于 requests
        async with session.get(url) as resp:  # get请求的时候会花时间
            async with aiofiles.open(pic_path,mode="wb") as f:  # with能够操控文件管理 自动close()
                await f.write(await resp.content.read())
#保存数据
count=0
path='//nas/LargeSave/高清图像数据/httpswww.pexels.comzh-tw/'
f = open('D:/photos_related_tags.txt', 'w') #装文本标签网页的路径
for column in body.find_all('div',attrs={'class':'photos__column'}):
    for img in column.find_all('a',attrs={'class':'js-photo-link photo-item__link'}):

        #得到图片网址
        img_label=img.find('img')#去img标签
        src_string=img_label['data-big-src']#取img标签中属性名为data-big-src的对应数据
        flag='?'
        img_url=src_string[:src_string.index(flag)]#截取网址'？'之前的字符串，即得到的网址就是下载原图片大小的
        count += 1
        print("第{}张:".format(count))
        print(img_url)


        #保存标签网址
        pattern = re.compile(r'\d+')#正则表达式，找字符串中的数字
        img_id=pattern.findall(img_url)[0]
        txt_url='https://www.pexels.com/zh-tw/photo/'+img_id
        f.write(txt_url)
        f.write('\n')

        #写入图片，改用协程
        #image=requests.get(img_url)
        #byte=image.content
        if os.path.isdir(path + str(count)):
            pass
        else:
            os.mkdir(path + str(count))
        document_path = path + str(count)
        pic_path = document_path + '/' + str(count) + '.jpg'
        loop = asyncio.get_event_loop()
        loop.run_until_complete(download(img_url,pic_path))
        # 对需要ssl验证的网页，需要250ms左右等待底层连接关闭
        loop.run_until_complete(asyncio.sleep(0.25))

        '''
        document_path=path + str(count)
        pic_path=document_path + '/'+str(count) + '.jpg'#这里用了‘/’来构成路径
        fp = open(pic_path,'wb')
        fp.write(byte)
        fp.close()
        '''

f.close()#关闭txt文档
loop.close()
print('爬取图片总数：',count)

eye123456789

关注

2
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
解决selenium爬取动态网页的坑（2）

坑一：关于滚动条加载到底部的解决。上一次的爬取，出现了尽管加上driver.execute_script("scroll(0,100000)") ，依旧不能加载到底部。而且经过观察发现，此次加载到最后停留的地方都是在同一个地方，这就发现了应该与这个scroll设置的极限参数有关系，果不其然，将参数设的很小的时候，获取到的图片仅有30多张。所以将100000这个参数再次加上一个数量级，观察滚动条是否会滚动到最后，最后真的滚到了最后！这样才能获取到全部的element代码。# 将滚动条下拉至最低，才能
复制链接

扫一扫