Python BeautifulSoup简介

最新推荐文章于 2024-08-30 10:01:19 发布

尘世风

最新推荐文章于 2024-08-30 10:01:19 发布

阅读量887

点赞数

文章标签： python beautifulsoup 开发语言爬虫

本文链接：https://blog.csdn.net/shifengboy/article/details/127236643

版权

BeautifulSoup是Python中用于HTML和XML文档解析的库，它构建文档树以方便遍历和搜索。通过使用它的搜索方法和CSS选择器，开发者能高效地提取网页内容。此外，还介绍了如何处理class属性及应用CSS选择器爬取图片。

摘要由CSDN通过智能技术生成

1.BeautifulSoup简介

BeautifulSoup是一个可以从HTML或XML文件中提取数据的python库；它能够通过转换器实现惯用的文档导航、查找、修改文档的方式。

BeautifulSoup是一个基于re开发的解析库，可以提供一些强大的解析功能；使用BeautifulSoup能够提高提取数据的效率与爬虫开发效率。

2.BeautifulSoup总览

构建文档树

BeautifulSoup进行文档解析是基于文档树结构来实现的，而文档树则是由BeautifulSoup中的四个数据对象构建而成的。

文档树对象	描述
Tag	标签; 访问方式:soup.tag;属性:tag.name(标签名)，tag.attrs(标签属性)
Navigable String	可遍历字符串; 访问方式:soup.tag.string
BeautifulSoup	文档全部内容，可作为Tag对象看待; 属性:soup.name(标签名)，soup.attrs(标签属性)
Comment	标签内字符串的注释; 访问方式:soup.tag.string

import lxml
import requests
from bs4 import BeautifulSoup

html =  """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!--Elsie--></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

#1、BeautifulSoup对象
soup = BeautifulSoup(html,'lxml')
print(type(soup))

#2、Tag对象
print(soup.head,'\n')
print(soup.head.name,'\n')
print(soup.head.attrs,'\n')
print(type(soup.head))

#3、Navigable String对象
print(soup.title.string,'\n')
print(type(soup