Python爬虫解析一个很好用的库 BeautifulSoup库的基本使用

最新推荐文章于 2024-08-12 23:17:41 发布

九瓜

最新推荐文章于 2024-08-12 23:17:41 发布

阅读量1k

点赞数 3

分类专栏： # Python爬虫文章标签：爬虫 BeautifulSoup库

本文链接：https://blog.csdn.net/weixin_43796109/article/details/88746068

版权

Python爬虫专栏收录该内容

5 篇文章 1 订阅

订阅专栏

首先介绍一下BeautifulSoup库

Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间.

BeautifulSoup可以使用很多解释器

在这里插入图片描述

'''
BeautifulSoup（）里面一般常用两个参数
	第一个参数是下载的页面或者HTML字符串
    第二个参数是使用那种解析器
可以使用的解析器有很多
Python标准库      	 'html.pqoser'
lxml HTML解析器   	 'lxml'
lxml XML解析器		 ['lxml','xml']
html1lib         	 'html1lib' 
'''

接下来进入代码

soup = BeautifulSoup(html_doc,'lxml')  # 这里我使用的是lxml解析器

全部代码

# 导包
from bs4 import BeautifulSoup
# HTML字符串
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

'''
BeautifulSoup（）第一个参数是下载的页面或者HTML字符串
    第二个参数是使用那种解析器
可以使用的解析器有很多
Python标准库       'html.pqoser'
lxml HTML解析器    'lxml'
lxml XML解析器     ['lxml','xml']
html1lib           'html1lib' 
'''

soup = BeautifulSoup(html_doc,'lxml')  # 使用lxml解析器

# 能按照标准的缩进格式的结构输出上面的HTML字符串
print(soup.prettify())

# 简单的浏览结构化数据的方法

# title标签 打印的是标签字符串
print(soup.title)   # 打印：<title>The Dormouse's story</title>

# 打印的是title标签里的内容
print(soup.title.string)   # 打印：The Dormouse's story

# p标签里面的class对象的名
print(soup.p['class'])      # 打印：['title']

# 打印标签的名称
print(soup.title.name)      # 打印：title

# .parent是选择父标签
print(soup.title.parent.name)   # 打印：head

#  打印所有的a标签字符串
print(soup.find_all('a'))

# 打印id = "link3" 的标签字符串
print(soup.find_all(id = "link3"))

# 从文档中找到所有a标签的链接
for i in soup.find_all('a'):
    print(i.get('href'))

# 获取所有的文字内容
print(soup.get_text())

最后
参考文档https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html