python/爬虫/BeautifulSoup/bs4的使用

最新推荐文章于 2024-09-27 16:14:00 发布

ededabo

最新推荐文章于 2024-09-27 16:14:00 发布

阅读量452

点赞数 5

文章标签： python 爬虫 beautifulsoup css 正则表达式

本文链接：https://blog.csdn.net/ededabo/article/details/142580660

版权

1.第一步当然是安装啦

安装：需要先安装 Selenium 包和对应浏览器的 WebDriver。

pip install selenium

2.数据处理的准备

soup得到引用，它会补全html结构

#参数1：目标字符串
#参数2：解析器
soup = BeautifulSoup(html,'lxml')

html为设置的变量是准备好的数据

该数据可以通过其他方法，从网页获取该结构的文本。

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

3.查找元素的方法

find(): 查找第一个匹配的元素。
find_all() 或 find_all(): 查找所有匹配的元素。
find_parent(): 查找匹配元素的直接父元素。
find_next(): 查找文档中匹配元素之后的下一个兄弟元素。
find_next_sibling(): 查找匹配元素的下一个兄弟元素。
find_previous(): 查找文档中匹配元素之前的上一个兄弟元素。
find_previous_sibling(): 查找匹配元素的上一个兄弟元素。

4.属性访问

通过属性访问（例如 soup.title）来获取第一个匹配的元素。

5.字符串方法

string: 获取元素或元素列表中的第一个文本内容。

6.CSS选择器

select(): 使用CSS选择器查找所有匹配的元素。

7.导航方法

head: 获取文档的 <head> 部分。
body: 获取文档的 <body> 部分。
title: 获取文档的 <title> 标签。

8.其他方法

contents: 获取一个元素的所有子元素。
children: 获取一个元素的直接子元素生成器。
descendants: 获取一个元素的所有子孙元素生成器。
next_siblings: 获取一个元素之后的所有兄弟元素生成器。
previous_siblings: 获取一个元素之前的所有兄弟元素生成器。
parent: 获取一个元素的父元素。
parents: 获取一个元素的所有父元素生成器。
next: 获取一个元素之后的元素。
previous: 获取一个元素之前的元素。

9.辅助方法

prettify(): 返回一份美化后的HTML或XML字符串。
encode(): 返回HTML实体编码后的字符串。

10.完整例子参考

from bs4 import BeautifulSoup

html_doc = """
<html>
<head>
    <title>The Dormouse's story</title>
</head>
<body>
    <p class="title"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
        <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
        <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
        and they lived at the bottom of a well.</p>
    <p class="story">...</p>
</body>
</html>
"""

soup = BeautifulSoup(html_doc, 'lxml')

# 使用find方法
title_tag = soup.find('title')
print(title_tag)

# 使用find_all方法
a_tags = soup.find_all('a')
for tag in a_tags:
    print(tag.get('href'))

# 使用CSS选择器
css_a_tags = soup.select('a.sister')
for tag in css_a_tags:
    print(tag.get('href'))

# 获取元素内容
print(soup.title.string)

# 使用导航方法
print(soup.head)
print(soup.body)

# 获取元素属性
print(soup.title['attrs'])

# 获取元素的子元素
p_tags = soup.find_all('p')
for p in p_tags.contents:
    if isinstance(p, str):
        print(p.strip())

# 获取元素的直接子元素
children_of_p = p_tags[0].children
for child in children_of_p:
    print(child)

# 获取元素的所有子孙元素
for descendant in p_tags[0].descendants:
    if descendant.string:
        print(descendant.string)

# 获取父元素
print(a_tags[0].parent)

# 获取所有父元素
for parent in a_tags[0].parents:
    print(parent)