爬虫入门三（bs4模块、遍历文档树、搜索文档树、css选择器、selenium介绍与安装、无界面浏览器、搜索标签即其他操作、等待元素、执行JS代码、切换选项卡、模拟浏览器前进后退、selenium登）

0Jchen

已于 2024-02-25 15:10:20 修改

阅读量451

点赞数 8

分类专栏：爬虫文章标签：爬虫 python

于 2024-02-20 21:40:50 首次发布

本文链接：https://blog.csdn.net/achen_m/article/details/136198111

版权

爬虫专栏收录该内容

4 篇文章 0 订阅

订阅专栏

文章目录

一、bs4模块
二、遍历文档树
三、搜索文档树
四、css选择器
五、selenium介绍与安装
六、selenium模拟登录（百度）
七、无界面浏览器
八、搜索标签
九、标签属性，位置，大小，文本
十、等待元素
十一、执行JS代码
十二、切换选项卡
十三、模拟浏览器前进后退
十四、selenium登录cnblogs获取cookie
十五、使用获取的cookie登录cnblogs

一、bs4模块

beautifulsoup4从HTML或XML文件中提取数据的Python库,用它来解析爬取回来的xml。

	1.安装
	    pip install beautifulsoup4 # 下载bs4模块
		pip install lxml  #解析库
	2. 用法
	    '第一个参数，是要总的字符串'
	    '第二个参数，使用哪个解析库：html.parser(内置的，无需额外安装，速度慢一些)、lxml(需额外安装pip install lxml)'
	    soup=BeautifulSoup('要解析的内容str类型','html.parser/lxml')

二、遍历文档树

from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" id='id_xx' xx='zz'>lqz <b>The Dormouse's story <span>彭于晏</span></b>  xx</p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

if __name__ == '__main__':
    # soup = BeautifulSoup(html_doc, 'html.parser')
    soup = BeautifulSoup(html_doc, 'lxml')  # pip install lxml
    # print(soup.find_all(name='html'))

    1.文档容错能力
    res = soup.prettify()
    print(res)

    2.遍历文档树：文档树(html开头---->html结尾，中间包含了很多标签)
    # 通过 .来查找标签 ，且只能找到最先查找到的第一个
    print(soup.html)
    print(soup.html.body.p)  # 一层一层的查找到指定的标签
    print(soup.p)  # 跨层级，直接查找

    3.获取标签名称
    print(soup.body.name)

    4.获取标签的属性
    p = soup.html.body.p
    print(p.attrs)  # 获取p标签的所有属性
    print(p.attrs['class'])  # 获取指定的一个属性 html类属性可以填写多个所以放在列表中 ['title']
    print(p.attrs.get('xx'))
    print(soup.a.attrs['href'])

    5.获取标签的内容
    # 标签对象.text
    print(soup.p.b.text) # 获取b标签的所有文本
    # 标签对象.string  使用string指定的标签下，只有自己的文本即可获取，嵌套了标签则为None
    print(soup.p.b.string)  # None  string不能有子 、孙标签
    print(soup.p.b.span.string)  # 彭于晏
    # 标签对象.strings，strings拿到的是一个生成器对象，会把子子孙孙的文本内容都放入生成器中
    print(soup.p.b.strings) # 和text很像，不过更节约内存
    print(list(soup.p.b.strings)) #["The Dormouse's story ", '彭于晏']

    6.嵌套选择
    print(soup.html.head.title)

    '''------了解内容------'''
    7.子节点、子孙节点
    print(soup.p.contents) # 获取p标签下所有的子节点，只取一个p
    print(soup.p.children) # 直接子节点,得到一个迭代器，包含p标签下所有子节点
    for i,child in enumerate(soup.p.children):  # list_iterator 迭代器
        print(i,child)
    print(soup.p.descendants) # 获取子孙节点，p标签下所有的标签都会选择出来
    for i,child in enumerate(soup.p.descendants): # generator 生成器
        print(i,child)


    8.父节点、祖先节点
    print(soup.a.parent)  # 获取a标签的父节点
    print(soup.a.parents) # 找到a标签所有的祖先节点 generator
    print(list(soup.a.parents))


    9.兄弟节点
    print(soup.a.next_sibling)  # 下一个兄弟标签
    print(soup.a.previous_sibling) # 上一个兄弟标签
    print(list(soup.a.next_siblings))  # 下面的兄弟们=>生成器对象
    print(soup.a.previous_siblings)  # 上面的兄弟们=>生成器对象

三、搜索文档树

from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p id="my_p" class="title"><b id="bbb" class="boldest">The Dormouse's story</b>
</p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc,'lxml')
"""五种过滤器: 字符串、正则表达式、列表、True、方法 """
# find:找第一个，find_all:找所有
1.字符串----->查询条件是字符串
res = soup.find(id='my_p')
res=soup.find(class_='boldest')
res=soup.find(href='http://example.com/elsie')
res=soup.find(name='a',href='http://example.com/elsie',id='link1') # 多个and条件
'可以写成下面的,但是里面不能写name'
res = soup.find(attrs={'class':'sister','href':'http://example.com/elsie'})
print(res)


2.正则表达式
import re
res = soup.find_all(href=re.compile('^http'))  # href属性以http为开头的所有
res = soup.find_all(class_=re.compile('^s'))  # 所有class中以s为开头的
print(res)

3.列表
res = soup.find_all(name=['a','b']) # 拿到所有的a/b标签列表
res = soup.find_all(class_=['sister','boldest']) # 拿到类名为sister、boldest的标签
print(res)


4.布尔
res = soup.find_all(id=True) # 拿到所有带有id的标签列表
res = soup.find_all(href=True)  # 所有href属性的标签
res = soup.find_all(class_=True)  # 所有class_属性的标签
print(res)


5.方法
def has_class_but_no_id(tag):
    # 查询所有有id但是没有class的标签
    return tag.has_attr('class') and not tag.has_attr('id')
print(soup.find_all(has_class_but_no_id))


6.搜索文档树可以结合遍历文档树来使用
print(soup.html.body.find_all('p')) # 速度会更快一些，缩小范围查找


7.recursive=True   limit=1 limit 参数
print(soup.find_all(name='p',limit=2)) # 只拿前两个p标签 限制拿取条数
print(soup.find_all(name='p',recursive=False)) # 是否递归查找

四、css选择器

from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p id="my_p" class="title">asdfasdf<b id="bbb" class="boldest">The Dormouse's story</b>
</p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc,'lxml')
'''select内写css选择器'''
res = soup.select('a.sister')
res = soup.select('#link1')
res = soup.select('p#my_p b')
print(res)

'''可以在网页中控制台里面，对应的标签中右键点击Copy selector'''
import requests
header={
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36'
}
res=requests.get('https://www.zdaye.com/free/',headers=header)
# print(res.text)
soup=BeautifulSoup(res.text,'lxml')
res = soup.select('#ipc > tbody > tr:nth-child(2) > td.mtd')
print(res[0].text)

五、selenium介绍与安装

selenium最初是一个自动化测试工具,而爬虫中使用它主要是为了解决requests无法直接执行JavaScript代码的问题

selenium本质是通过驱动浏览器，完全模拟浏览器的操作，比如跳转、输入、点击、下拉等，来拿到网页渲染之后的结果，可支持多种浏览器

官网地址：https://selenium-python.readthedocs.io/

安装与使用

	1.安装selenium 模块
		pip install selenium
		
	'已知最新版selenium模块，无需在执行下载浏览器驱动这一步骤(如果需要驱动就下载)'
	2.下载浏览器驱动(驱动版本跟所对应的浏览器'如chrome、edge等'的版本要对应)
		-chrome驱动下载：https://googlechromelabs.github.io/chrome-for-testing/
		-然后把下载的驱动环境变量中或者直接放在项目根目录即可

	3.基础使用
		from selenium import webdriver
		import time
		bro = webdriver.Chrome()  # 手动打开了浏览器
		bro.get('https://www.bing.com')  # 在浏览器中输入要访问的网址,并访问
		time.sleep(5)
		print(bro.page_source)  # 获取页面内容
		# 这样就可以继续使用bs4去解析想要的内容，并且是执行完js的内容，最全面
		bro.close()  # 关闭浏览器

六、selenium模拟登录（百度）

	from selenium import webdriver
	from selenium.webdriver.common.by import By
	import time
	bro = webdriver.Chrome()  # 浏览器
	bro.get('https://www.baidu.com/')
	# 设置等待时间，如果超过等待时间则停止
	bro.implicitly_wait(10)  # 为了等待页面加载
	bro.maximize_window()    # 全屏打开浏览器
	time.sleep(3)  # 睡眠3秒后执行以下步骤
	# 使用选择器找到登录标签
	submit_btn = bro.find_element(by=By.ID,value='s-top-loginbtn')
	# submit_btn = bro.find_element(by=By.LINK_TEXT,value='登录')  # 通过a标签文字查找
	# 点击标签
	submit_btn.click()
	time.sleep(5)
	'''切换到短信登录后再切换到账号登录'''
	sms_submit = bro.find_element(by=By.ID,value='TANGRAM__PSP_11__changeSmsCodeItem') # 关键字传参
	sms_submit.click()
	time.sleep(2)
	username_submit = bro.find_element(By.ID,'TANGRAM__PSP_11__changePwdCodeItem') # 位置传参
	username_submit.click()
	time.sleep(2)
	'''然后输入用户名和密码，并勾选接受后点击登录'''
	# 用户名
	username = bro.find_element(By.ID,'TANGRAM__PSP_11__userName')
	username.send_keys('123456789963') # 向输入框中写入内容
	time.sleep(1)
	# 密码
	password = bro.find_element(By.ID,'TANGRAM__PSP_11__password')
	password.send_keys('abcdefjhijkl')
	time.sleep(1)
	# 点击勾选接受协议
	accept= bro.find_element(By.ID,'TANGRAM__PSP_11__isAgree')
	accept.click()
	time.sleep(1)
	# 点击登录
	submit = bro.find_element(By.ID,'TANGRAM__PSP_11__submit')
	submit.click()
	
	bro.close()

七、无界面浏览器

做爬虫，不希望有一个浏览器打开，谷歌支持无头浏览器，后台运行，没有浏览器的图形化（GUI）界面

	import time
	from selenium import webdriver
	from selenium.webdriver.chrome.options import Options
	
	options = Options()
	options.add_argument('blink-settings=imagesEnabled=false')  # 不加载图片, 提升速度
	options.add_argument('--headless')  # 浏览器不提供可视化页面. linux下如果系统不支持可视化不加这条会启动失败
	bro = webdriver.Chrome(options=options)
	bro.get('https://www.baidu.com') # 目标网址
	print(bro.page_source) # 浏览器中看到的页面内容
	time.sleep(2)
	bro.close()  # 关闭tab(标签)页
	bro.quit()  # 关闭浏览器

八、搜索标签

	'''############1 搜索标签##############
	# By.ID  # 根据id号查找标签
	# By.NAME  # 根据name属性查找标签
	# By.TAG_NAME  # # 根据标签查找标签
	# By.CLASS_NAME # 按类名找
	# By.LINK_TEXT # a标签文字
	# By.PARTIAL_LINK_TEXT # a标签文字，模糊匹配
	---------以上是selenium自己的--------
	
	# By.CSS_SELECTOR # 按css选择器找
	# By.XPATH  #按xpath找
	'''
	
	import time
	from selenium import webdriver
	from selenium.webdriver.common.by import By
	bro = webdriver.Chrome()
	bro.get('https://www.cnblogs.com/liuqingzheng/p/16005896.html')
	# bro.implicitly_wait(10)
	bro.maximize_window()
	
	bro.find_element() # 找一个
	bro.find_elements() # 找所有
	
	1 按id找---找到点赞---》点击它--->使用id找
	number=bro.find_element(By.ID,'digg_count')
	number.click()
	
	2 按标签名找   找出页面中所有a标签  按标签名找
	a_list=bro.find_elements(By.TAG_NAME,'a')
	print(len(a_list))
	
	3 按 类名找
	dig=bro.find_element(By.CLASS_NAME,'diggit')
	dig.click()
	
	4 按 a 标签 文字找 By.LINK_TEXT
	res=bro.find_element(By.LINK_TEXT,'分布式爬虫')
	print(res.text)
	print(res.get_attribute('href'))
	res.click()
	
	5 a标签文字，模糊匹配  By.PARTIAL_LINK_TEXT
	res=bro.find_element(By.PARTIAL_LINK_TEXT,'分布式')
	print(res.text)
	print(res.get_attribute('href'))
	res.click()
	
	6 css 解析
	res=bro.find_element(By.CSS_SELECTOR,'a#cb_post_title_url>span')
	res=bro.find_element(By.CSS_SELECTOR,'#cb_post_title_url > span')
	print(res.get_attribute('role'))
	print(res.text)
	
	7 xpath解析--->不会xpath语法
	res=bro.find_element(By.XPATH,'//*[@id="cb_post_title_url"]/span')
	print(res.get_attribute('role'))
	print(res.text)
	
	time.sleep(5)
	bro.close()

九、标签属性，位置，大小，文本

	'''
	############2 获取标签的属性，文本，id(不是id属性，没用)，大小，位置##############
	print(tag.get_attribute('src'))
	print(tag.text)
	print(tag.id)  # 这个id不是id号，不需要关注
	print(tag.location)
	print(tag.tag_name)
	print(tag.size)
	# 使用位置和大小---》后续咱们会截图验证码-->破解验证码
	'''
	import time
	
	from selenium import webdriver
	from selenium.webdriver.common.by import By
	
	bro = webdriver.Chrome()
	bro.get('https://kyfw.12306.cn/otn/resources/login.html')
	bro.implicitly_wait(10)
	bro.maximize_window()
	a = bro.find_element(By.LINK_TEXT, '扫码登录')
	a.click()
	time.sleep(1)
	bro.save_screenshot('main.png')  # 截图整个页面
	
	# 打印标签位置和坐标
	img=bro.find_element(By.ID,'J-qrImg')
	print(img.location)
	print(img.size)
	
	time.sleep(5)
	bro.close()

十、等待元素

	1 隐士等待
	bro.implicitly_wait(10)  
	# 设置隐士等待---》我们在find_element 找标签时候，标签有可能还没加载出来---》
	# 而代码执行非常快---》取不到标签就会报错
	
	'''加了这一句---》当咱们取标签的时候，如果标签，没加载好---》
	等待最多10s---》等标签加载出来后--》找到了继续往后走'''
	
	
	2 显示等待---》不好用
		-每找一个标签，都要给它设置一次等待---》太麻烦了
		-这种忽略，不用它即可
	
	    
	'以后，都在访问到某个地址后，加入这句话即可'
	bro.implicitly_wait(10)

十一、执行JS代码

	import time
	from selenium import webdriver
	from selenium.webdriver.common.by import By
	
	bro = webdriver.Chrome()
	bro.get('https://www.pearvideo.com/category_1')
	bro.implicitly_wait(10)
	bro.maximize_window()
	
	1 基本使用
	bro.execute_script('alert("美女")')
	
	2 打印出一些变量
	bro.execute_script('console.log(urlMap)')
	bro.execute_script('alert(JSON.stringify(urlMap))')

	3 新建选项卡
	bro.execute_script('open()')
	
	4 滑动屏幕
	bro.execute_script('scrollTo(0,document.documentElement.scrollHeight)')
	
	5 获取当前访问地址
	bro.execute_script('alert(location)')
	bro.execute_script('location="http://www.baidu.com"')
	
	6 打印cookie
	bro.execute_script('alert(document.cookie)')
	
	
	time.sleep(10)
	bro.close()

十二、切换选项卡

	from selenium import webdriver
	import time
	
	bro = webdriver.Chrome()
	bro.get('https://www.pearvideo.com/')
	bro.implicitly_wait(10)
	print(bro.window_handles)
	# 开启选项卡
	bro.execute_script('window.open()')
	# 获取出所有选项卡
	
	bro.switch_to.window(bro.window_handles[1]) # 切换到某个选项卡
	bro.get('http://www.taobao.com')
	
	time.sleep(2)
	bro.switch_to.window(bro.window_handles[0]) # 切换到某个选项卡
	bro.get('http://www.baidu.com')
	
	time.sleep(2)
	bro.execute_script('window.open()')
	bro.execute_script('window.open()')
	bro.close() # 关闭选项卡
	
	time.sleep(2)
	bro.quit()  # 关闭页面

十三、模拟浏览器前进后退

	from selenium import webdriver
	import time
	bro = webdriver.Chrome()
	bro.get('https://www.pearvideo.com/')
	bro.implicitly_wait(10)
	
	# 获取出所有选项卡
	time.sleep(2)
	bro.get('http://www.taobao.com')
	
	time.sleep(2)
	
	bro.get('http://www.baidu.com')
	time.sleep(2)
	bro.back()
	time.sleep(2)
	bro.back()
	time.sleep(2)
	bro.forward()
	bro.quit()  # 关闭页面

十四、selenium登录cnblogs获取cookie

	import time
	from selenium import webdriver
	from selenium.webdriver.common.by import By
	from selenium.webdriver.chrome.options import Options
	# 绕过浏览器检测到 自动化软件控制
	options = Options()
	options.add_argument("--disable-blink-features=AutomationControlled")  # 去掉自动化控制
	bro = webdriver.Chrome(options=options)
	bro.get('https://www.cnblogs.com/')
	bro.implicitly_wait(10)
	bro.maximize_window()
	
	
	login_btn = bro.find_element(By.LINK_TEXT,'登录')
	login_btn.click()
	time.sleep(2)
	
	# 找到用户名和密码输入框
	username = bro.find_element(By.CSS_SELECTOR, '#mat-input-0')
	password = bro.find_element(By.ID, 'mat-input-1')
	username.send_keys('@qq.com')  # 账号
	time.sleep(2)
	password.send_keys('#')  # 密码
	time.sleep(2)
	
	submit_btn = bro.find_element(By.CSS_SELECTOR,'body > app-root > app-sign-in-layout > div > div > app-sign-in > app-content-container > div > div > div > form > div > button')
	time.sleep(2)
	submit_btn.click()  # 一种情况直接登录成功   一种情况会弹出验证码
	
	# 认证点击
	code=bro.find_element(By.ID,'rectBottom')
	code.click()
	time.sleep(2)
	time.sleep(5)
	# 获取cookies
	cookies = bro.get_cookies() # 获取所有cookie
	# bro.get_cookie() # 获取某个cookie
	# 获取cookie后保存到本地，以json格式，后期可以保存到redis中
	import json
	
	with open('cnblogs.json', 'wt', encoding='utf-8') as f:
	    json.dump(cookies, f)
	    
	bro.close()

十五、使用获取的cookie登录cnblogs

	import time
	from selenium import webdriver
	from selenium.webdriver.common.by import By
	from selenium.webdriver.chrome.options import Options
	# 绕过浏览器检测到 自动化软件控制
	options = Options()
	options.add_argument("--disable-blink-features=AutomationControlled")  # 去掉自动化控制
	bro = webdriver.Chrome(options=options)
	bro.get('https://www.cnblogs.com/')
	bro.implicitly_wait(10)
	bro.maximize_window()
	
	import json
	# 读出cookie，写入到浏览器中。
	with open('cnblogs.json','rt',encoding='utf-8')as f:
	    cookies=json.load(f)
	# 因为是[]套字典的形式
	for cookie in cookies:
	    bro.add_cookie(cookie)
	time.sleep(2)  # 睡眠两秒
	
	bro.refresh() # 刷新页面
	time.sleep(5)
	bro.close()

0Jchen

关注

8
点赞
踩
7

收藏

觉得还不错? 一键收藏
0
评论
爬虫入门三（bs4模块、遍历文档树、搜索文档树、css选择器、selenium介绍与安装、无界面浏览器、搜索标签即其他操作、等待元素、执行JS代码、切换选项卡、模拟浏览器前进后退、selenium登）

文件中提取数据的Python库,用它来解析爬取回来的xml。
复制链接

扫一扫