Beautiful Soup--01节点选择器

最新推荐文章于 2021-11-21 23:23:39 发布

去追风，去看海

最新推荐文章于 2021-11-21 23:23:39 发布

阅读量257

点赞数

分类专栏： #python3网络爬虫 Python

本文链接：https://blog.csdn.net/weixin_40959890/article/details/109565718

版权

Python 同时被 2 个专栏收录

176 篇文章 4 订阅

订阅专栏

#python3网络爬虫

27 篇文章 0 订阅

订阅专栏

Beautiful Soup自动将输入文档转换为Unicode编码，输出文档转换为utf-8编码，不需要考虑编码问题。

Beautiful Soup安装：https://blog.csdn.net/weixin_40959890/article/details/109565842

节点选择器

直接调用节点的名称就可以选择节点元素，再调用string属性就可以得到节点内的文本了，这种选择方式速度非常快。如果单个节点结构层次非常清晰，可以选用这种方式来解析。

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup=BeautifulSoup(html,"lxml")
#首先打印输出title节点的选择结果，输出的就是title节点和里面文字内容。
print(soup.title)
#接下来，输出它的类型是bs4.element.Tag类型，这是Beautiful Soup中一个重要的数据结构。
print(type(soup.title))
#经过选择器选择后，选择结果都是这种Tag类型。Tag具有一些属性，比如string属性，调用该属性，可以得到节点的文本内容。
print(soup.title.string)
#选择了head节点，结果也是节点加其内部的所有内容。
print(soup.head)
#选择p节点。结果是第一个p节点的内容，后面的几个p节点并没有选到。也就是说，当有多个节点时，这种方式只会选择第一个匹配的节点，其他后面节点都会忽略。
print(soup.p)

运行结果：

查看一下soup的属性功能：

print(dir(soup))

运行结果：

['ASCII_SPACES', 'DEFAULT_BUILDER_FEATURES', 'NO_PARSER_SPECIFIED_WARNING', 
'ROOT_TAG_NAME', '__bool__', '__call__', '__class__', '__contains__', '__copy__', 
'__delattr__', '__delitem__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__',
 '__ge__', '__getattr__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', 
'__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', 
'__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', 
'__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', '__unicode__', 
'__weakref__', '_all_strings', '_check_markup_is_url', '_feed', '_find_all', '_find_one', 
'_is_xml', '_lastRecursiveChild', '_last_descendant', '_linkage_fixer', 
'_most_recent_element', '_namespaces', '_popToTag', '_should_pretty_print', 'append', 
'attrs', 'builder', 'can_be_empty_element', 'cdata_list_attributes', 'childGenerator', 
'children', 'clear', 'contains_replacement_characters', 'contents', 'currentTag', 
'current_data', 'declared_html_encoding', 'decode', 'decode_contents', 'decompose', 
'descendants', 'encode', 'encode_contents', 'endData', 'extend', 'extract', 
'fetchNextSiblings', 'fetchParents', 'fetchPrevious', 'fetchPreviousSiblings', 'find', 
'findAll', 'findAllNext', 'findAllPrevious', 'findChild', 'findChildren', 'findNext', 
'findNextSibling', 'findNextSiblings', 'findParent', 'findParents', 'findPrevious', 
'findPreviousSibling', 'findPreviousSiblings', 'find_all', 'find_all_next', 
'find_all_previous', 'find_next', 'find_next_sibling', 'find_next_siblings', 
'find_parent', 'find_parents', 'find_previous', 'find_previous_sibling', 
'find_previous_siblings', 'format_string', 'formatter_for_name', 'get', 'getText', 
'get_attribute_list', 'get_text', 'handle_data', 'handle_endtag', 'handle_starttag', 
'has_attr', 'has_key', 'hidden', 'index', 'insert', 'insert_after', 'insert_before', 
'isSelfClosing', 'is_empty_element', 'is_xml', 'known_xml', 'markup', 'name', 
'namespace', 'new_string', 'new_tag', 'next', 'nextGenerator', 'nextSibling', 
'nextSiblingGenerator', 'next_element', 'next_elements', 'next_sibling', 'next_siblings', 
'object_was_parsed', 'original_encoding', 'parent', 'parentGenerator', 'parents', 
'parse_only', 'parserClass', 'parser_class', 'popTag', 'prefix', 
'preserve_whitespace_tag_stack', 'preserve_whitespace_tags', 'prettify', 'previous', 
'previousGenerator', 'previousSibling', 'previousSiblingGenerator', 'previous_element',
 'previous_elements', 'previous_sibling', 'previous_siblings', 'pushTag', 
'recursiveChildGenerator', 'renderContents', 'replaceWith', 'replaceWithChildren', 
'replace_with', 'replace_with_children', 'reset', 'select', 'select_one', 'setup', 
'smooth', 'string', 'strings', 'stripped_strings', 'tagStack', 'text', 'unwrap', 'wrap']

前后带'__'的一般会不用到，我们看到有一个'attrs'，可以用来获取属性值，每个节点可能有多个属性，比如id和class等，选择这个节点元素后，可以调用attrs获取所有属性:

print(soup.p.attrs)
print(soup.p.attrs['name'])

运行结果如下:

{'class': ['title'], 'name': 'dromouse'}
dromouse

嵌套选择

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
"""
from bs4 import BeautifulSoup
soup=BeautifulSoup(html,'lxml')
print(soup.head.title)
print(type(soup.head.title))
print(soup.head.title.string)

运行结果如下:

<title>The Dormouse's story</title>
<class 'bs4.element.Tag'>
The Dormouse's story

去追风，去看海

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录