【Python学习】Day-029 Day-030 Day-031 selenium的使用、驱动下载、页面点击、切换标签页、页面滚动、网页启动配置项；requests和selenium使用cookie

ChenAi140

已于 2022-08-18 17:35:46 修改

阅读量455

点赞数

分类专栏： Python学习文章标签： python 学习 selenium 网络爬虫

于 2022-08-17 20:39:04 首次发布

本文链接：https://blog.csdn.net/ChenAi_140/article/details/126393883

版权

Python学习专栏收录该内容

26 篇文章 0 订阅

订阅专栏

本文详细介绍了如何下载并使用Selenium进行浏览器控制，包括基础操作、切换标签页、页面滚动、requests与selenium携带cookie实现登录，以及代理IP的配置。涵盖了自动登录、网络代理等实用技巧。

摘要由CSDN通过智能技术生成

文章目录

1. selenium驱动下载

下载
Edge下载----------Chrome下载
到seleinum官网或国内镜像网站下载和自己浏览器对应版本的driver，如果没有找到就下载最接近的版本
使用
将下载的driver放到Python.exe所在的目录下
如果Edge无法启动driver，需要将msedgedriver.exe重命名为：MicrosoftWebDriver.exe
selenium第三方库安装
- 这里选择使用3.14.1版本的selenium：pip install selenium==3.14.1
- Edge浏览器若是需要用到options配置项则需要下载另一个第三方库
  pip install msedge.selenium_tools
  这个库可以连接selenium和Edge

2. seleinum的基础用法

# from selenium.webdriver import Chrome
from selenium.webdriver import Edge

# 1. 创建浏览器对象（浏览器对象如果是全局变量，浏览器不会自动关闭）
# web = Chrome()
web = Edge()

# 2. 打开网页（你需要爬的数据在哪个网页里面，就打开哪个网页）
web.get('https://movie.douban.com/top250')

# 3. 获取网页源代码(获取到的一定是页面中加载出来的)
print(web.page_source)

# 4. 关闭浏览器
web.close()

3. selenium控制浏览器的基本行为

from selenium.webdriver import Edge
# from selenium.webdriver import Chrome
from time import sleep

web = Edge()
# web = Chrome()
web.get('https://www.jd.com')

# 1. 输入框输入内容
# 1）找到输入框
input_tag = web.find_element_by_id('key')
# 2）输入框输入内容，\n类似于回车
input_tag.send_keys('电脑\n')

sleep(2)
print(web.page_source)

# 2. 点击按钮
# 1)找到需要点击的标签
btn = web.find_element_by_css_selector('#navitems-group2 .b')
# 2)点击标签
btn.click()

# 手动结束
input('是否结束：')
web.close()

4. selenium切换浏览器标签页

from selenium.webdriver import Edge
# from selenium.webdriver import Chrome
from time import sleep
from bs4 import BeautifulSoup as bs
# 1. 基本操作
web = Edge()
# web = Chrome()                        # 创建浏览器

web.get('https://www.cnki.net/')      # 打开中国知网
search_tag = web.find_element_by_id('txt_SearchText')     # 获取输入框
search_tag.send_keys('数据分析\n')      # 输入框输入'数据分析'，然后按回车
sleep(1)        # 切换界面最后做一个等待操作

# 获取需要点击的所有标签： 如果拿到标签后需要点击或者输入，必须通过浏览器获取标签
all_result = web.find_elements_by_css_selector('.result-table-list .name>a')
# 点击第一个结果（这儿会打开一个新的标签页）
all_result[0].click()
sleep(1)

# 2. 切换标签页
# 注意：selenium中，浏览器对象(web)默认指向一开始打开的标签页，除非用代码切换，否则浏览器对象指向的标签页不会变
# 1）获取当前浏览器上所有的窗口(标签页): 浏览器.window_handles
# 2）切换选项卡
web.switch_to.window(web.window_handles[-1])

# 3)解析内容
soup = bs(web.page_source, 'lxml')
result = soup.select_one('#ChDivSummary').text
print(result)

web.close()           # 关闭当前指向的窗口(最后一个窗口)，窗口关闭后，浏览器对象的指向不会发生改变


# 回到第一个窗口，点击下一个搜索结果
web.switch_to.window(web.window_handles[0])
all_result[1].click()
sleep(1)

web.switch_to.window(web.window_handles[-1])

soup = bs(web.page_source, 'lxml')
result = soup.select_one('#ChDivSummary').text
print(result)

web.close()
# .
# .
# .
input('结束:')
web.close()

5. selenium操作页面滚动

from selenium.webdriver import Edge
# from selenium.webdriver import Chrome
from time import sleep
from bs4 import BeautifulSoup as bs

web = Edge()
# web = Chrome()
web.get('https://www.jd.com')
web.find_element_by_id('key').send_keys('电脑\n')
sleep(1)

# 1. 执行滚动操作  -  执行js中鼓动代码:  window.scrollBy(x方向偏移量, y方向偏移量)
# web.execute_script('window.scrollBy(0, 1800)')
for x in range(6100):
    web.execute_script('window.scrollBy(0, 2)')
    sleep(1)

soup = bs(web.page_source, 'lxml')
goods_li = soup.select('#J_goodsList>ul>li')
print(len(goods_li))

input('关闭:')
web.close()

6. requests和selenium携带cookie实现登录

自动登录原理：人工在浏览器上完成登录操作，获取登录后的cookie信息(登录信息)，再通过代码发送请求的时候携带登陆后的cookie

6.1 requests

import requests
headers = {
    'user-agent':'*****',
    'cookie':'******'
}
rsp = requests.get('https://www.zhihu.com/', headers=headers)
print(rsp.text)

6.2 selenium

获取coookies

from selenium.webdriver import Edge
# from selenium.webdriver import Chrome
from json import dumps

web = Edge()
# web = Chrome()

# 1. 打开需要完成自动登录的网站(需要获取cookie的网站)
web.get('https://www.taobao.com/')

# 2. 给足够长的时间让人工完成自动登录并且人工刷新出登录后的页面
# 强调：一定要把第一个页面刷新出登录之后的状态
# 登录成功后回车继续执行代码
input('已经完成登录:')

# 3. 获取登录后的cookie并且将获取到的cookie保存到本地文件
cookies = web.get_cookies()
print(cookies)


with open('./files/tb_ck.txt', 'w', encoding='utf-8') as f:
    f.write(dumps(cookies))

使用cookies登录

from selenium.webdriver import Edge
# from selenium.webdriver import Chrome
from json import loads

web = Edge()
# web = Chrome()

# 1. 打开需要自动登录网页
web.get('https://www.taobao.com/')
sleep(1)

# 2. 添加cookie
with open('./files/tb_ck.txt', encoding='utf-8') as f:
    ck_lst = load(f)

for i in ck_lst:
    if 'expiry' in i:
        del i['expiry']
    else:
        web.add_cookie(i)

# 3. 重新打开需要登录的网页(刷新页面)
web.refresh()
# web.get('https://www.taobao.com/')
sleep(1)

7. 代理ip

7.1 requests

import requests
from time import sleep

def get_ip():
    url = '生成代理ip的api'
    while True:
        rsp = requests.get(url)
        if rsp.text[0] == '{':
            print('提取ip失败，重试！')
            sleep(1)
            continue
        return rsp.text

def get_douban_film():
    headers = {
        'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36'}

    ip = get_ip()
    proxies = {
        'http': ip,
        'https': ip
    }
    rsp = requests.get('https://movie.douban.com/top250', headers=headers, proxies=proxies)
    print(rsp.text)

if __name__ == '__main__':
    get_douban_film()

7.2 selenium

from msedge.selenium_tools import Edge,EdgeOptions
# from selenium.webdriver import Chrome, ChromeOptions

# 获取ip
ip = *******

options  = EdgeOptions()
# options = ChromeOptions()
options.add_argument(f'--proxy-server=http://{ip}')
web = Edge(options=options)
# web = Chrome(options=options)
web.get('https://movie.douban.com/top250')

8. selenium其它配置项

from msedge.selenium_tools import Edge,EdgeOptions
# from selenium.webdriver import Chrome, ChromeOptions

# 获取ip
ip = *******

options  = EdgeOptions()
# options = ChromeOptions()

# 设置代理ip
options.add_argument(f'--proxy-server=http://{ip}')

# 设置取消测试环境
options.add_experimental_option('excludeSwitches', ['enable-automation'])

# 设置取消图片加载
options.add_experimental_option("prefs", {"profile.managed_default_content_settings.images": 2})

web = Edge(options=options)
# web = Chrome(options=options)

web.get('https://www.jd.com')

# 隐式等待
web.implicitly_wait(5)

# 显式等待
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as ec
from selenium.webdriver.common.by import By
# 等待超时时间100s
wait = WebDriverWait(web, 100)
# 等待直到搜索框出现了'电脑'继续执行代码
wait.until(ec.text_to_be_present_in_element_value((By.ID, 'key'), '电脑'))