BeautifulSoup4认识与应用

最新推荐文章于 2024-04-15 21:58:50 发布

闲鱼!!!

最新推荐文章于 2024-04-15 21:58:50 发布

阅读量995

点赞数 2

分类专栏：爬虫

本文链接：https://blog.csdn.net/weixin_43158056/article/details/96325025

版权

爬虫专栏收录该内容

3 篇文章 0 订阅

订阅专栏

BeautifulSoup4

BeautifulSoup4是什么
Beautiful Soup4的三个特点：
安装配置
BeautifulSoup的基本用法

官方文档:https://www.crummy.com/software/BeautifulSoup/bs4/doc/

BeautifulSoup4是什么

Beautiful Soup是python的一个HTML或XML的解析库，我们可以用它来方便的从网页中提取数据，它拥有强大的API和多样的解析方式。

Beautiful Soup4的三个特点：

Beautiful Soup提供一些简单的方法和python式函数，用于浏览，搜索和修改解析树，它是一个工具箱，通过解析文档为用户提供需要抓取的数据
Beautiful Soup自动将转入稳定转换为Unicode编码，输出文档转换为UTF-8编码，不需要考虑编码，除非文档没有指定编码方式，这时只需要指定原始编码即可
Beautiful Soup位于流行的Python解析器（如lxml和html5lib）之上，允许您尝试不同的解析策略或交易速度以获得灵活性。

安装配置

Beautiful Soup4通过PyPi发布，所以可以通过系统管理包工具安装，包名字为beautifulsoup4

pip install beautifulsoup4

Beautiful Soup在解析时实际上是依赖解析器的，它除了支持python标准库中的HTML解析器外还支持第三方解析器如lxml

Beautiful Soup支持的解析器,以及它们的优缺点:

解析器	使用方法	优势	劣势
Python标准库	BeautifulSoup(markup,“html.parser”)	1.Python的内置标准库2.执行速度适中3. 文档容错能力强	Python 2.7.3 or 3.2.2)前的版本中文档容错能力差
lxml HTML 解析器	BeautifulSoup(markup,“lxml”)	1.速度快2.文档容错能力强	需要安装C语言库
lxml XML 解析器	BeautifulSoup(markup,[“lxml”, “xml”])BeautifulSoup(markup,“xml”)	1.速度快2.唯一支持XML的解析	需要安装C语言库
html5lib	BeautifulSoup(markup,“html5lib”)	1.最好的容错性2.以浏览器的方式解析文档3.生成HTML5格式的文档	速度慢不依赖外部扩展

安装解析器：

pip install lxml
pip install html5lib

推荐使用lxml作为解析器,因为效率更高. 在Python2.7.3之前的版本和Python3中3.2.2之前的版本,必须安装lxml或html5lib, 因为那些Python版本的标准库中内置的HTML解析方法不够稳定

BeautifulSoup的基本用法

Beautiful Soup为我们提供了一些查询方法，如fang_all()和find()等

方法选择器

find_all

源码

   def find_all(self, name=None, attrs={}, recursive=True, text=None,
                 limit=None, **kwargs):
        """Extracts a list of Tag objects that match the given
        criteria.  You can specify the name of the Tag and any
        attributes you want the Tag to have.

        The value of a key-value pair in the 'attrs' map can be a
        string, a list of strings, a regular expression object, or a
        callable that takes a string and returns whether or not the
        string matches for some custom definition of 'matches'. The
        same is true of the tag name."""

find_all(name,attrs,recursive,text,**kwargs)：查询所有符合条件的元素，其中的参数
name表示可以查找所有名字为name的标签(tag)，也可以是过滤器，正则表达式，列表或者是True
attrs表示传入的属性，可以通过attrs参数以字典的形式指定如常用属性id,attrs={‘id’:‘123’}，由于class属性是python中的关键字，所有在查询时需要在class后面加上下划线即class_=‘element’，返回的结果是tag类型的列表
text参数用来匹配节点的文本，传入的形式可以是字符串也可以是正则表达式对象
recursive表示，如果只想搜索直接子节点可以将参数设为false：recursive=Flase
limit参数，可以用来限制返回结果的数量，与SQL中的limit关键字类似

具体实现

soup=BeautifulSoup(html_doc,'lxml')
print(type(soup))
print(soup.find_all('span'))  #标签查找
print(soup.find_all('a',id='link1'))  #属性加标签过滤
print(soup.find_all('a',attrs={'class':'sister','id':'link3'})) #多属性
print(soup.find_all('p',class_='title'))  #class特殊性,此次传入的参数是**kwargs
print(soup.find_all(text=re.compile('Tillie')))  #文本过滤
print(soup.find_all('a',limit=2))  #限制输出数量

注意点:
1.find_all 返回的是一个列表
2.name 这个属性用来指定的是需要匹配的tag

传入的是一个字符串
传入的是正则表达式
传入列表,用于查询多个标签的时候
3.关键字查询
class_由于是关键字

find

源码:

    def find(self, name=None, attrs={}, recursive=True, text=None,
             **kwargs):
        """Return only the first child of this Tag matching the given
        criteria."""
        r = None
        l = self.find_all(name, attrs, recursive, text, 1, **kwargs)
        if l:
            r = l[0]
        return r

find( name , attrs , recursive , text , **kwargs )：它返回的是单个元素，也就是第一个匹配的元素，类型依然是tag类型
参数同find_all()一样
另外还有许多查询方法，其用法和前面介绍的find_all()方法完全相同，只不过查询范围不同，参数也一样
find_parents(name , attrs , recursive , text , **kwargs )和find_parent(name , attrs , recursive , text , **kwargs )：前者返回所有祖先节点，后者返回直接父节点
find_next_siblings(name , attrs , recursive , text , **kwargs )和find_next_sibling(name , attrs , recursive , text , **kwargs )：对当前tag后面的节点进行迭代，前者返回后面的所有兄弟节点，后者返回后面第一个兄弟节点
find_previous_siblings(name , attrs , recursive , text , **kwargs )和find_previous_sibling(name , attrs , recursive , text , **kwargs )：对当前tag前面的节点进行迭代，前者返回前面的所有兄弟节点，后者返回前面的第一个兄弟节点
find_all_next(name , attrs , recursive , text , **kwargs )和find_next(name , attrs , recursive , text , **kwargs )：对当前tag之后的tag和字符串进行迭代，前者返回所有符合条件的节点，后者返回第一个符合条件的节点
find_all_previous()和find_previous()：对当前tag之前的tag和字符串进行迭代，前者返回节点后所有符合条件的节点，后者返回第一个符合条件的节点

返回单个(默认返回最后一个)

html_doc.find()

css选择器

def select(self, selector, namespaces=None, limit=None, **kwargs):
        """Perform a CSS selection operation on the current element.

        This uses the SoupSieve library.

        :param selector: A string containing a CSS selector.

        :param namespaces: A dictionary mapping namespace prefixes
        used in the CSS selector to namespace URIs. By default,
        Beautiful Soup will use the prefixes it encountered while
        parsing the document.

        :param limit: After finding this number of results, stop looking.

        :param kwargs: Any extra arguments you'd like to pass in to
        soupsieve.select().
        """

注意点:返回的是个列表

具体实现

print(html_doc.select('img')	#通过标签查找
print(html_doc.select('.easyList')#  通过类名查找
print(html_doc.select('#easyList')#  通过id选择
print(html_doc.select('img #esayList')#  组合查找
#在img标签中,找到id=esayList的内容
#如果查询不到,返回一个空列表

print(html_doc.select('li[id = "easyList"]')#  按照属性查找

a_obj = html_doc.select('li[id = "easyList"] a')[0]#  获取属性与文字部分
print(a_obj)
print('获取属性',a_obj.attrs['href'])
print('获取文字',a_obj.get_text())

闲鱼!!!

关注

2
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
BeautifulSoup4认识与应用

BeautifulSoup4BeautifulSoup4是什么Beautiful Soup4的三个特点：安装配置BeautifulSoup的基本用法方法选择器find_all具体实现findcss选择器具体实现官方文档:https://www.crummy.com/software/BeautifulSoup/bs4/doc/BeautifulSoup4是什么Beautiful Soup是p...
复制链接

扫一扫