python 爬虫学习入门数据分析BeautufulSoup4 简_soup.prettify() 是什么类型-CSDN博客

本文链接：https://blog.csdn.net/qq_43475705/article/details/113307563

BeautifulSoup4

BeautifulSoup4 和lxml 的提取方法有相似之处，都是通过标签来对内容进行分析，
用beautifulSoup4 对其进行分析的步骤有以下几点:

由于BeautifulSoup4 不能处理字符串类型的文本信息，因此需要对字符串类型的html文本进行转化，转换成html 类型的html 文档

第一步, 文档类型转换

from bs4 import BeautifulSoup

# 通过爬虫获取到的html 文档，是字符串类型

html = "我是字符串类型的html文档"

# 传递要解析的文档，以及解析器，在这里用lxml 解析
soup = BeautifulSoup(html, 'lxml')

第二步，学习如何提取数据（函数学习）

soup.prettify() # 可以将html 文档结构自动补全并按照缩进的格式输出

'''
Bs4 对象的种类

注意： bs4 生成的数据类型是bs4 专有的数据类型，

1.  Tag 类型，Tag 标签类型，跟html 的标签一样，
	div, p, a, span 等等标签属性是一样的

下面的操作都是围着标签展开的
'''

# 四种基本的查找函数

# find_all()  查找文档中所有内容输出为一个列表,

def find_all(self, name=None, attrs={}, recursive=True, text=None,
             limit=None, **kwargs):
    """Look in the children of this PageElement and find all
    PageElements that match the given criteria.

    All find_* methods take a common set of arguments. See the online
    documentation for detailed explanations.

    :param name: A filter on tag name.
    :param attrs: A dictionary of filters on attribute values.
    :param recursive: If this is True, find_all() will perform a
        recursive search of this PageElement's children. Otherwise,
        only the direct children will be considered.
    :param limit: Stop looking after finding this many results.
    :kwargs: A dictionary of filters on attribute values.
    :return: A ResultSet of PageElements.
    :rtype: bs4.element.ResultSet
    """



# 查找文档中所有的a 标签, 其中attrs 传递一个字典来指定所查找的标签的属性值

# 方法1
alist = find_all('a', attrs={'class': 'abc'})


# 方法2
# 其中参数传递的是 CSS 选择器，所有的css 选择器 比如.a #a, ::after等
select_list = soup.select('')

# 这里得到的是一个列表, 如果要对列表中的每个元素处理的话就需要遍历

for a in alist:
	# 查看a 的属性值比如href
	href = a['href']