BeautifulSoup的用法

最新推荐文章于 2024-08-19 09:22:32 发布

HoneyGrapefruit

最新推荐文章于 2024-08-19 09:22:32 发布

阅读量436

点赞数

文章标签： python 爬虫 pycharm

本文链接：https://blog.csdn.net/lemon_review/article/details/121755162

版权

BeautifulSoup的用法

beautifulSoup是一个灵活又方便的网页解析库，处理高效，支持多种解析器。利用它不用编写正则表达式即可方便地实现网页信息的提取。

安装

通过指令: pip install beautifulsoup4 或者在pycharm第三方库安装页面中搜索安装beautifulsoup4即可。

使用

解析库

解析器	使用方法	优势	劣势
Python标准库	BeautifulSoup(markup, ‘html.parser’)	Python的内置标准库、执行速度适中、文档容错能力强	低版本中文容错能力差
lxml HTML解析器	BeautifulSoup(markup, ‘lxml’)	速度快、文档容错能力强	需要安装C语言库
lxml XML解析器	BeautifulSoup(markup, ‘xml’)	速度快、唯一支持xml的解析器	需要安装C语言库
Html5lib	BeautifulSoup(markup, ‘html5lib’)	最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档	速度慢，不依赖外部扩展

基本使用

创建解析器对象: BeautifulSoup(html文本内容, 解析器)

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html, 'lxml')
print(soup.prettify())
print(soup.title.string)

标签选择器

解析器对象.标签名

# 获取title标签
print(soup.title)
print(type(soup.title))

# 获取 head 标签
print(soup.head)

# 获取 p 标签
print(soup.p)

获取名称

标签对象.name

print(soup.title.name)    # 'title'

获取属性

标签对象.attrs[属性名]

print(soup.a.attrs['href'])   # ’http://example.com/elsie‘

获取内容

标签对象.string
标签对象.get_text()
内容：标签对象.contents

print(soup.p.string)    #  The Dormouse's story

嵌套选择

解析器对象.标签1.标签2

print(soup.head.title.string)

子节点和子孙节点

子节点：标签对象.children
子孙节点：标签对象.descendants

print(soup.p.contents)

for x in soup.div.children:
    print('x:', x)
    
for x in soup.div.descendants:
    print('x:', x)

父节点和祖先节点

父节点：标签对象.parent
祖先节点：标签对象.parents

print(soup.span.parent)

for x in soup.span.parents:
    print('x:', x)

兄弟节点

标签对象.next_siblings
标签对象.previous_siblings

print(list(enumerate(soup.a.next_siblings)))
print(list(enumerate(soup.a.previous_siblings)))

标准选择器

根据标签名查找标签：解析器对象/标签对象.find_all(标签名)
根据指定属性值查找标签：解析器对象/标签对象.find_all(attrs={属性名: 属性值})
根据标签内容查找：解析器对象/标签对象.find_all(text=内容)

find_all表示查找所有，把它改成find表示查找单个

print(soup.find_all('ul'))
print(type(soup.find_all('ul')[0]))
for ul in soup.find_all('ul'):
    print(ul.find_all('li'))
    
print(soup.find_all(attrs={'id': 'list-1'}))
print(soup.find_all(attrs={'name': 'elements'}))

print(soup.find_all(id='list-1'))
print(soup.find_all(class_='element'))

print(soup.find_all(text='Foo'))

find_parents()返回所有祖先节点，find_parent()返回直接父节点。
find_next_siblings()返回后面所有兄弟节点，find_next_sibling()返回后面第一个兄弟节点。
find_previous_siblings()返回前面所有兄弟节点，find_previous_sibling()返回前面第一个兄弟节点。
find_all_next()返回节点后所有符合条件的节点, find_next()返回第一个符合条件的节点
find_all_previous()返回节点后所有符合条件的节点, find_previous()返回第一个符合条件的节点

CSS选择器

标签对象.select(css选择器)

print(soup.select('.panel .panel-heading'))
print(soup.select('ul li'))
print(soup.select('#list-2 .element'))
print(type(soup.select('ul')[0]))

select_one只获取选择器选中的一个标签

总结

推荐使用lxml解析库，必要时使用html.parser
标签选择筛选功能弱但是速度快
建议使用find()、find_all() 查询匹配单个结果或者多个结果
如果对CSS选择器熟悉建议使用select()
记住常用的获取属性和文本值的方法

HoneyGrapefruit

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫