BeautifulSoup学习笔记1

124 篇文章 0 订阅

BeautifulSoup将复杂的HTML文档转换成一个复杂的树形结构,每个节点都是有个Python对象,对象可以归纳为4种:
Tag;
NavigableString;
BeautifulSoup;
Comment。

下面的一段HTML代码是爱丽丝梦游仙境的一段内容,将作为例子来介绍BeautifulSoup的对象(BeautifulSoup 库的名字取自刘易斯 ·卡罗尔在《爱丽丝梦游仙境》里的同名诗歌。)。

>>> html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

1 BeautifulSoup

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html_doc,"html.parser")
>>> soup

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>
>>> 

BeautifulSoup 对象表示的是一个文档的全部内容.
大部分时候,可以把它当作 Tag 对象,它支持 遍历文档树 和 搜索文档树 中描述的大部分的方法.
因为 BeautifulSoup 对象并不是真正的HTML或XML的tag,所以它没有name和attribute属性.但有时查看它的 .name 属性是很方便的,所以 BeautifulSoup 对象包含了一个值为 “[document]” 的特殊属性 .name:

>>> soup.name
'[document]'
>>> 

2 Tag

Tag对象和XML,HTML文档中的标签相同:

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html_doc,"html.parser")
>>> tag1 = soup.a
>>> tag1
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
>>> type(tag1)
<class 'bs4.element.Tag'>
>>> 

Tag有很多方法和属性:

>>> dir(tag1)
['HTML_FORMATTERS', 'XML_FORMATTERS', '__bool__', '__call__', '__class__', '__contains__', '__copy__', '__delattr__', '__delitem__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', '__unicode__', '__weakref__', '_all_strings', '_attr_value_as_string', '_attribute_checker', '_find_all', '_find_one', '_formatter_for_name', '_is_xml', '_lastRecursiveChild', '_last_descendant', '_select_debug', '_selector_combinators', '_should_pretty_print', '_tag_name_matches_and', 'append', 'attribselect_re', 'attrs', 'can_be_empty_element', 'childGenerator', 'children', 'clear', 'contents', 'decode', 'decode_contents', 'decompose', 'descendants', 'encode', 'encode_contents', 'extract', 'fetchNextSiblings', 'fetchParents', 'fetchPrevious', 'fetchPreviousSiblings', 'find', 'findAll', 'findAllNext', 'findAllPrevious', 'findChild', 'findChildren', 'findNext', 'findNextSibling', 'findNextSiblings', 'findParent', 'findParents', 'findPrevious', 'findPreviousSibling', 'findPreviousSiblings', 'find_all', 'find_all_next', 'find_all_previous', 'find_next', 'find_next_sibling', 'find_next_siblings', 'find_parent', 'find_parents', 'find_previous', 'find_previous_sibling', 'find_previous_siblings', 'format_string', 'get', 'getText', 'get_attribute_list', 'get_text', 'has_attr', 'has_key', 'hidden', 'index', 'insert', 'insert_after', 'insert_before', 'isSelfClosing', 'is_empty_element', 'known_xml', 'name', 'namespace', 'next', 'nextGenerator', 'nextSibling', 'nextSiblingGenerator', 'next_element', 'next_elements', 'next_sibling', 'next_siblings', 'parent', 'parentGenerator', 'parents', 'parserClass', 'parser_class', 'prefix', 'preserve_whitespace_tags', 'prettify', 'previous', 'previousGenerator', 'previousSibling', 'previousSiblingGenerator', 'previous_element', 'previous_elements', 'previous_sibling', 'previous_siblings', 'quoted_colon', 'recursiveChildGenerator', 'renderContents', 'replaceWith', 'replaceWithChildren', 'replace_with', 'replace_with_children', 'select', 'select_one', 'setup', 'string', 'strings', 'stripped_strings', 'tag_name_re', 'text', 'unwrap', 'wrap']

2.1 .name

Every tag has a name, accessible as .name:

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html_doc,"html.parser")
>>> tag1 = soup.a
>>> tag1.name
'a'

2.2 .attrs

A tag may have any number of attributes.
You can access a tag’s attributes by treating the tag like a dictionary:

>>> tag1.attrs
{'href': 'http://example.com/elsie', 'class': ['sister'], 'id': 'link1'}
>>> 
>>> tag1["class"]
['sister']
>>> tag1["class"] = "SISTER"
>>> tag1
<a class="SISTER" href="http://example.com/elsie" id="link1">Elsie</a>
>>> del tag1['id']
>>> tag1
<a class="SISTER" href="http://example.com/elsie">Elsie</a>
>>> tag1['id']
Traceback (most recent call last):
  File "<pyshell#24>", line 1, in <module>
    tag1['id']
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\site-packages\bs4\element.py", line 1011, in __getitem__
    return self.attrs[key]
KeyError: 'id'
>>> 

2.3 Multi-valued attributes

HTML 4 defines a few attributes that can have multiple values. HTML 5 removes a couple of them, but defines a few more. The most common multi-valued attribute is class (that is, a tag can have more than one CSS class). Others include rel, rev, accept-charset, headers, and accesskey. Beautiful Soup presents the value(s) of a multi-valued attribute as a list:

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html_doc,"html.parser")
>>> tag2 = soup.a
>>> tag2
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
>>> tag2['class'] = ['body','strikeout']
>>> tag2
<a class="body strikeout" href="http://example.com/elsie" id="link1">Elsie</a>
>>> tag2['class']
['body', 'strikeout']

If an attribute looks like it has more than one value, but it’s not a multi-valued attribute as defined by any version of the HTML standard, Beautiful Soup will leave the attribute alone:

>>> tag2['class'] = "body strikeout"
>>> tag2
<a class="body strikeout" href="http://example.com/elsie" id="link1">Elsie</a>
>>> tag2['class']
'body strikeout'

(If you parse a document as XML, there are no multi-valued attributes.)

3 NavigableString

A string corresponds to a bit of text within a tag. Beautiful Soup uses the NavigableString class to contain these bits of text:

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('<p>Back to the <a rel="index">homepage</a></p>')
>>> tag2 = soup.p
>>> tag2
<p>Back to the <a rel="index">homepage</a></p>
>>> tag2.string
'homepage'
>>> type(tag2.string)
<class 'bs4.element.NavigableString'>
>>> 

字符串当然重要,大量的网页信息都字符串一般都包含在标签内,上面只是简单的获取标签内的字符串。

4 Comments and other special strings

Comment 对象是一个特殊类型的 NavigableString 对象:

>>> markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
>>> soup3 = BeautifulSoup(markup)
>>> comment = soup3.b.string
>>> comment
'Hey, buddy. Want to buy a used parser?'
>>> type(comment)
<class 'bs4.element.Comment'>
>>> 

当它出现在HTML文档中时, Comment 对象会使用特殊的格式输出:

>>> soup3.b.prettify()
'<b>\n <!--Hey, buddy. Want to buy a used parser?-->\n</b>'

Beautiful Soup中定义的其它类型都可能会出现在XML的文档中: CData , ProcessingInstruction , Declaration , Doctype .与 Comment 对象类似,这些类都是 NavigableString 的子类,只是添加了一些额外的方法的字符串独享.下面是用CDATA来替代注释的例子:

>>> from bs4 import CData
>>> cdata = CData("Paranoia")
>>>> comment.replace_with(cdata)
'Hey, buddy. Want to buy a used parser?'
>>> soup3
<b><![CDATA[Paranoia]]></b>
>>> soup3.b.prettify()
'<b>\n <![CDATA[Paranoia]]>\n</b>'

前两天就该整理出这些笔记,一直没心情。
还有啊,以后就没有寒暑假了,就当多过一个暑假。
把该学的东西学学,和家人多聚一聚,再找工作也是可以的

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值