Beautiful Soup Documentation

最新推荐文章于 2024-07-15 23:20:42 发布

weixin_33910759

最新推荐文章于 2024-07-15 23:20:42 发布

阅读量175

点赞数

文章标签：数据结构与算法 python 开发工具

原文链接：https://my.oschina.net/u/1178546/blog/166765

版权

2019独角兽企业重金招聘Python工程师标准>>>

Beautiful Soup Documentation

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

Beautiful Soup是一个用于从HTML和XML文件中抽取数据的Python库。它可以和你最喜欢的解释器一起工作，提供惯用的方法去导航，搜索和修改剖析树。它可以节省程序员的工作时间。

These instructions illustrate all major features of Beautiful Soup 4, with examples. I show you what the library is good for, how it works, how to use it, how to make it do what you want, and what to do when it violates your expectations.

这部分用法说明阐述了Beautiful Soup 4所有主要的特征及例子，我会展示它善于干什么，如何工作，怎么应用它，怎么根据需要改善它，当背离期望时该做什么。

The examples in this documentation should work the same way in Python 2.7 and Python 3.2.

You might be looking for the documentation for Beautiful Soup 3. If so, you should know that Beautiful Soup 3 is no longer being developed, and that Beautiful Soup 4 is recommended for all new projects. If you want to learn about the differences between Beautiful Soup 3 and Beautiful Soup 4, see Porting code to BS4.

Getting help

If you have questions about Beautiful Soup, or run into problems, send mail to the discussion group. If your problem involves parsing an HTML document, be sure to mention what the diagnose() function says about that document.

Quick Start

Here’s an HTML document I’ll be using as an example throughout this document. It’s part of a story from Alice in Wonderland:

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

Running the “three sisters” document through Beautiful Soup gives us a BeautifulSoup object, which represents the document as a nested data structure:

通过Beautiful Soup生成一个Beautiful Soup对象，意味着生成一个嵌套的数据结构代表HTML文档。

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)

print(soup.prettify())
# <html>
#  <head>
#   <title>
#    The Dormouse's story
#   </title>
#  </head>
#  <body>
#   <p class="title">
#    <b>
#     The Dormouse's story
#    </b>
#   </p>
#   <p class="story">
#    Once upon a time there were three little sisters; and their names were
#    <a class="sister" href="http://example.com/elsie" id="link1">
#     Elsie
#    </a>
#    ,
#    <a class="sister" href="http://example.com/lacie" id="link2">
#     Lacie
#    </a>
#    and
#    <a class="sister" href="http://example.com/tillie" id="link2">
#     Tillie
#    </a>
#    ; and they lived at the bottom of a well.
#   </p>
#   <p class="story">
#    ...
#   </p>
#  </body>
# </html>

Here are some simple ways to navigate that data structure:

一些用于导航数据结构的方法。

soup.title
# <title>The Dormouse's story</title>

soup.title.name
# u'title'

soup.title.string
# u'The Dormouse's story'

soup.title.parent.name
# u'head'

soup.p
# <p class="title"><b>The Dormouse's story</b></p>

soup.p['class']
# u'title'

soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

One common task is extracting all the URLs found within a page’s <a> tags:

for link in soup.find_all('a'):
    print(link.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie

Another common task is extracting all the text from a page:

print(soup.get_text())
# The Dormouse's story
#
# The Dormouse's story
#
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
#
# ...

Does this look like what you need? If so, read on.

Installing Beautiful Soup

If you’re using a recent version of Debian or Ubuntu Linux, you can install Beautiful Soup with the system package manager:

$ apt-get install python-bs4

Beautiful Soup 4 is published through PyPi, so if you can’t install it with the system packager, you can install it with easy_install or pip. The package name is beautifulsoup4, and the same package works on Python 2 and Python 3.

$ easy_install beautifulsoup4

$ pip install beautifulsoup4

(The BeautifulSoup package is probably not what you want. That’s the previous major release, Beautiful Soup 3. Lots of software uses BS3, so it’s still available, but if you’re writing new code you should install beautifulsoup4.)

If you don’t have easy_install or pip installed, you can download the Beautiful Soup 4 source tarball and install it with setup.py.

$ python setup.py install

If all else fails, the license for Beautiful Soup allows you to package the entire library with your application. You can download the tarball, copy its bs4 directory into your application’s codebase, and use Beautiful Soup without installing it at all.

I use Python 2.7 and Python 3.2 to develop Beautiful Soup, but it should work with other recent versions.

Problems after installation

Beautiful Soup is packaged as Python 2 code. When you install it for use with Python 3, it’s automatically converted to Python 3 code. If you don’t install the package, the code won’t be converted. There have also been reports on Windows machines of the wrong version being installed.

If you get the ImportError “No module named HTMLParser”, your problem is that you’re running the Python 2 version of the code under Python 3.

If you get the ImportError “No module named html.parser”, your problem is that you’re running the Python 3 version of the code under Python 2.

In both cases, your best bet is to completely remove the Beautiful Soup installation from your system (including any directory created when you unzipped the tarball) and try the installation again.

If you get the SyntaxError “Invalid syntax” on the line ROOT_TAG_NAME = u'[document]', you need to convert the Python 2 code to Python 3. You can do this either by installing the package:

$ python3 setup.py install

or by manually running Python’s 2to3 conversion script on the bs4 directory:

$ 2to3-3.2 -w bs4

Installing a parser

Beautiful Soup supports the HTML parser included in Python’s standard library, but it also supports a number of third-party Python parsers. One is the lxml parser. Depending on your setup, you might install lxml with one of these commands:
Beautiful Soup 支持HTML解析器，包括Python标准库，但是它也支持一定数量的第三方Python解析器.其中一个是lxml解析器.

$ apt-get install python-lxml

$ easy_install lxml

$ pip install lxml

Another alternative is the pure-Python html5lib parser, which parses HTML the way a web browser does. Depending on your setup, you might install html5lib with one of these commands:

另一个是纯Python的html5lib解析器，想浏览器一样解析HTML。

$ apt-get install python-html5lib

$ easy_install html5lib

$ pip install html5lib

This table summarizes the advantages and disadvantages of each parser library:

If you can, I recommend you install and use lxml for speed. If you’re using a version of Python 2 earlier than 2.7.3, or a version of Python 3 earlier than 3.2.2, it’s essential that you install lxml or html5lib–Python’s built-in HTML parser is just not very good in older versions.

Note that if a document is invalid, different parsers will generate different Beautiful Soup trees for it. See Differences between parsers for details.

Making the soup

To parse a document, pass it into the BeautifulSoup constructor. You can pass in a string or an open filehandle:

解析文档，将它传给Beautiful Soup构造器。你可以传入字符串或者文件描述符。

from bs4 import BeautifulSoup

soup = BeautifulSoup(open("index.html"))

soup = BeautifulSoup("<html>data</html>")

First, the document is converted to Unicode, and HTML entities are converted to Unicode characters:

首先，文档被转化成Unicode，HTML实体被转化成Unicode字符。

BeautifulSoup("Sacr&eacute; bleu!")
<html><head></head><body>Sacré bleu!</body></html>

Beautiful Soup then parses the document using the best available parser. It will use an HTML parser unless you specifically tell it to use an XML parser. (See Parsing XML.)

Beautiful Soup 可以使用最合适的解析器解析文档。

Kinds of objects

Beautiful Soup transforms a complex HTML document into a complex tree of Python objects. But you’ll only ever have to deal with about four kinds of objects.

Beautiful Soup把一个复杂的HTML文档转化成一个复杂的Python对象树。

Tag

A Tag object corresponds to an XML or HTML tag in the original document:

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b
type(tag)
# <class 'bs4.element.Tag'>

Tags have a lot of attributes and methods, and I’ll cover most of them in Navigating the tree and Searching the tree. For now, the most important features of a tag are its name and attributes.

Name

Every tag has a name, accessible as .name:

tag.name
# u'b'

If you change a tag’s name, the change will be reflected in any HTML markup generated by Beautiful Soup:

如果你改变标签的名字，这个改变会反映Beautiful Soup生成的HTML标签中。

tag.name = "blockquote"
tag
# <blockquote class="boldest">Extremely bold</blockquote>

Attributes

A tag may have any number of attributes. The tag has an attribute “class” whose value is “boldest”. You can access a tag’s attributes by treating the tag like a dictionary:

标签可以有任何数量的属性。

tag['class']
# u'boldest'

You can access that dictionary directly as .attrs:

tag.attrs
# {u'class': u'boldest'}

You can add, remove, and modify a tag’s attributes. Again, this is done by treating the tag as a dictionary:

tag['class'] = 'verybold'
tag['id'] = 1
tag
# <blockquote class="verybold" id="1">Extremely bold</blockquote>

del tag['class']
del tag['id']
tag
# <blockquote>Extremely bold</blockquote>

tag['class']
# KeyError: 'class'
print(tag.get('class'))
# None

Multi-valued attributes

HTML 4 defines a few attributes that can have multiple values. HTML 5 removes a couple of them, but defines a few more. The most common multi-valued attribute is class (that is, a tag can have more than one CSS class). Others include rel, rev, accept-charset, headers, and accesskey. Beautiful Soup presents the value(s) of a multi-valued attribute as a list:

HTML4定义了一些属性，具有多个值。HTML5移除了当初中的一些，新加了一些。Beautiful Soup以列表方式显示属性的多个值。

css_soup = BeautifulSoup('<p class="body strikeout"></p>')
css_soup.p['class']
# ["body", "strikeout"]

css_soup = BeautifulSoup('<p class="body"></p>')
css_soup.p['class']
# ["body"]

If an attribute looks like it has more than one value, but it’s not a multi-valued attribute as defined by any version of the HTML standard, Beautiful Soup will leave the attribute alone:

如果一个属性看来具有多个值，但它并不是一个多值属性，Beautiful Soup会保持属性值的单一性。

id_soup = BeautifulSoup('<p id="my id"></p>')
id_soup.p['id']
# 'my id'

When you turn a tag back into a string, multiple attribute values are consolidated:

rel_soup = BeautifulSoup('<p>Back to the <a rel="index">homepage</a></p>')
rel_soup.a['rel']
# ['index']
rel_soup.a['rel'] = ['index', 'contents']
print(rel_soup.p)
# <p>Back to the <a rel="index contents">homepage</a></p>

If you parse a document as XML, there are no multi-valued attributes:

xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 'xml')
xml_soup.p['class']
# u'body strikeout'

NavigableString

A string corresponds to a bit of text within a tag. Beautiful Soup uses the NavigableString class to contain these bits of text:

Beautiful Soup使用NavigableString类(含有这些文本)。

tag.string
# u'Extremely bold'
type(tag.string)
# <class 'bs4.element.NavigableString'>

A NavigableString is just like a Python Unicode string, except that it also supports some of the features described in Navigating the tree and Searching the tree. You can convert a NavigableString to a Unicode string with unicode():

一个NavigableString像Python Unicode string。你可以使用unicode()转换NavigableString成Unicode string。

unicode_string = unicode(tag.string)
unicode_string
# u'Extremely bold'
type(unicode_string)
# <type 'unicode'>

You can’t edit a string in place, but you can replace one string with another, using replace_with():

tag.string.replace_with("No longer bold")
tag
# <blockquote>No longer bold</blockquote>

NavigableString supports most of the features described in Navigating the tree and Searching the tree, but not all of them. In particular, since a string can’t contain anything (the way a tag may contain a string or another tag), strings don’t support the .contents or .string attributes, or the find() method.

If you want to use a NavigableString outside of Beautiful Soup, you should call unicode() on it to turn it into a normal Python Unicode string. If you don’t, your string will carry around a reference to the entire Beautiful Soup parse tree, even when you’re done using Beautiful Soup. This is a big waste of memory.

BeautifulSoup

The BeautifulSoup object itself represents the document as a whole. For most purposes, you can treat it as a Tag object. This means it supports most of the methods described in Navigating the tree and Searching the tree.

BeautifulSoup对象代表了整个文档。

Since the BeautifulSoup object doesn’t correspond to an actual HTML or XML tag, it has no name and no attributes. But sometimes it’s useful to look at its .name, so it’s been given the special .name “[document]”:

soup.name
# u'[document]'

Comments and other special strings

Tag, NavigableString, and BeautifulSoup cover almost everything you’ll see in an HTML or XML file, but there are a few leftover bits. The only one you’ll probably ever need to worry about is the comment:

Tag,NavigableString, and BeautifulSoup几乎包含了在HTML或者XML文件的所有东西，但是仍有些需要担忧的，比如注释：

markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup)
comment = soup.b.string
type(comment)
# <class 'bs4.element.Comment'>

The Comment object is just a special type of NavigableString:

comment
# u'Hey, buddy. Want to buy a used parser'

But when it appears as part of an HTML document, a Comment is displayed with special formatting:

print(soup.b.prettify())
# <b>
# <!--Hey, buddy. Want to buy a used parser?-->
# </b>

Beautiful Soup defines classes for anything else that might show up in an XML document: CData, ProcessingInstruction, Declaration, and Doctype. Just like Comment, these classes are subclasses of NavigableString that add something extra to the string. Here’s an example that replaces the comment with a CDATA block:

from bs4 import CData
cdata = CData("A CDATA block")
comment.replace_with(cdata)

print(soup.b.prettify())
# <b>
# <![CDATA[A CDATA block]]>
# </b>

Navigating the tree

Here’s the “Three sisters” HTML document again:

html_doc = """
<html><head><title>The Dormouse's story</title></head>

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)

I’ll use this as an example to show you how to move from one part of a document to another.

Going down

Tags may contain strings and other tags. These elements are the tag’s children. Beautiful Soup provides a lot of different attributes for navigating and iterating over a tag’s children.
Note that Beautiful Soup strings don’t support any of these attributes, because a string can’t have children.

Navigating using tag names

The simplest way to navigate the parse tree is to say the name of the tag you want. If you want the <head> tag, just say soup.head:

soup.head
# <head><title>The Dormouse's story</title></head>

soup.title
# <title>The Dormouse's story</title>

You can do use this trick again and again to zoom in on a certain part of the parse tree. This code gets the first tag beneath the <body> tag:

soup.body.b
# <b>The Dormouse's story</b>

Using a tag name as an attribute will give you only the first tag by that name:

soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

If you need to get all the <a> tags, or anything more complicated than the first tag with a certain name, you’ll need to use one of the methods described in Searching the tree, such as find_all():

soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

.contents and .children

A tag’s children are available in a list called .contents:

head_tag = soup.head
head_tag
# <head><title>The Dormouse's story</title></head>

head_tag.contents
[<title>The Dormouse's story</title>]

title_tag = head_tag.contents[0]
title_tag
# <title>The Dormouse's story</title>
title_tag.contents
# [u'The Dormouse's story']

The BeautifulSoup object itself has children. In this case, the <html> tag is the child of the BeautifulSoup object.:

len(soup.contents)
# 1
soup.contents[0].name
# u'html'

A string does not have .contents, because it can’t contain anything:

text = title_tag.contents[0]
text.contents
# AttributeError: 'NavigableString' object has no attribute 'contents'

Instead of getting them as a list, you can iterate over a tag’s children using the .children generator:

for child in title_tag.children:
print(child)
# The Dormouse's story

descendants

The .contents and .children attributes only consider a tag’s direct children. For instance, the <head> tag has a single direct child–the <title> tag:

head_tag.contents
# [<title>The Dormouse's story</title>]

But the <title> tag itself has a child: the string “The Dormouse’s story”. There’s a sense in which that string is also a child of the <head> tag. The .descendants attribute lets you iterate over all of a tag’s children, recursively: its direct children, the children of its direct children, and so on:
for child in head_tag.descendants:

print(child)
# <title>The Dormouse's story</title>
# The Dormouse's story

The <head> tag has only one child, but it has two descendants: the <title> tag and the <title> tag’s child. The BeautifulSoup object only has one direct child (the <html> tag), but it has a whole lot of descendants:

len(list(soup.children))
# 1
len(list(soup.descendants))
# 25

string

If a tag has only one child, and that child is a NavigableString, the child is made available as .string:

title_tag.string
# u'The Dormouse's story'

If a tag’s only child is another tag, and that tag has a .string, then the parent tag is considered to have the same .string as its child:

head_tag.contents
# [<title>The Dormouse's story</title>]
head_tag.string
# u'The Dormouse's story'

If a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to be None:

print(soup.html.string)
# None

strings and stripped_strings

If there’s more than one thing inside a tag, you can still look at just the strings. Use the .strings generator:

for string in soup.strings:
    print(repr(string))
# u"The Dormouse's story"
# u'\n\n'
# u"The Dormouse's story"
# u'\n\n'
# u'Once upon a time there were three little sisters; and their names were\n'
# u'Elsie'
# u',\n'
# u'Lacie'
# u' and\n'
# u'Tillie'
# u';\nand they lived at the bottom of a well.'
# u'\n\n'
# u'...'
# u'\n'

These strings tend to have a lot of extra whitespace, which you can remove by using the .stripped_strings generator instead:

for string in soup.stripped_strings:
    print(repr(string))
# u"The Dormouse's story"
# u"The Dormouse's story"
# u'Once upon a time there were three little sisters; and their names were'
# u'Elsie'
# u','
# u'Lacie'
# u'and'
# u'Tillie'
# u';\nand they lived at the bottom of a well.'
# u'...'

Here, strings consisting entirely of whitespace are ignored, and whitespace at the beginning and end of strings is removed.

Going up

Continuing the “family tree” analogy, every tag and every string has a parent: the tag that contains it.

parent

You can access an element’s parent with the .parent attribute. In the example “three sisters” document, the <head> tag is the parent of the <title> tag:

title_tag = soup.title
title_tag
# <title>The Dormouse's story</title>
title_tag.parent
# <head><title>The Dormouse's story</title></head>

The title string itself has a parent: the <title> tag that contains it:

title_tag.string.parent
# <title>The Dormouse's story</title>

The parent of a top-level tag like <html> is the BeautifulSoup object itself:

html_tag = soup.html
type(html_tag.parent)
# <class 'bs4.BeautifulSoup'>

And the .parent of a BeautifulSoup object is defined as None:

print(soup.parent)
# None

parents

You can iterate over all of an element’s parents with .parents. This example uses .parents to travel from an <a> tag buried deep within the document, to the very top of the document:

link = soup.a
link
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
for parent in link.parents:
if parent is None:
print(parent)
else:
print(parent.name)
# p
# body
# html
# [document]
# None

Going sideways

Consider a simple document like this:

sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>")
print(sibling_soup.prettify())
# <html>
# <body>
# <a>
# <b>
# text1
# </b>
# <c>
# text2
# </c>
# </a>
# </body>
# </html>

The tag and the <c> tag are at the same level: they’re both direct children of the same tag. We call them siblings. When a document is pretty-printed, siblings show up at the same indentation level. You can also use this relationship in the code you write.

next_sibling and .previous_sibling

You can use .next_sibling and .previous_sibling to navigate between page elements that are on the same level of the parse tree:

sibling_soup.b.next_sibling
# <c>text2</c>

sibling_soup.c.previous_sibling
# <b>text1</b>

The tag has a .next_sibling, but no .previous_sibling, because there’s nothing before the tag on the same level of the tree. For the same reason, the <c> tag has a .previous_sibling but no .next_sibling:

print(sibling_soup.b.previous_sibling)
# None
print(sibling_soup.c.next_sibling)
# None

The strings “text1” and “text2” are not siblings, because they don’t have the same parent:

sibling_soup.b.string
# u'text1'

print(sibling_soup.b.string.next_sibling)
# None

In real documents, the .next_sibling or .previous_sibling of a tag will usually be a string containing whitespace. Going back to the “three sisters” document:

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>

You might think that the .next_sibling of the first <a> tag would be the second <a> tag. But actually, it’s a string: the comma and newline that separate the first <a> tag from the second:

link = soup.a
link
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

link.next_sibling
# u',\n'

The second <a> tag is actually the .next_sibling of the comma:

link.next_sibling.next_sibling
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

next_siblings and .previous_siblings

You can iterate over a tag’s siblings with .next_siblings or .previous_siblings:

for sibling in soup.a.next_siblings:
    print(repr(sibling))
# u',\n'
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
# u' and\n'
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
# u'; and they lived at the bottom of a well.'
# None

for sibling in soup.find(id="link3").previous_siblings:
    print(repr(sibling))
# ' and\n'
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
# u',\n'
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
# u'Once upon a time there were three little sisters; and their names were\n'
# None

Going back and forth

Take a look at the beginning of the “three sisters” document:

<html><head><title>The Dormouse's story</title></head>
<p class="title"><b>The Dormouse's story</b></p>

An HTML parser takes this string of characters and turns it into a series of events: “open an <html> tag”, “open a <head> tag”, “open a <title> tag”, “add a string”, “close the <title> tag”, “open a tag”, and so on. Beautiful Soup offers tools for reconstructing the initial parse of the document.

next_element and .previous_element

The .next_element attribute of a string or tag points to whatever was parsed immediately afterwards. It might be the same as .next_sibling, but it’s usually drastically different.

Here’s the final <a> tag in the “three sisters” document. Its .next_sibling is a string: the conclusion of the sentence that was interrupted by the start of the <a> tag.:

last_a_tag = soup.find("a", id="link3")
last_a_tag
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

last_a_tag.next_sibling
# '; and they lived at the bottom of a well.'

But the .next_element of that <a> tag, the thing that was parsed immediately after the <a> tag, is not the rest of that sentence: it’s the word “Tillie”:

last_a_tag.next_element
# u'Tillie'

That’s because in the original markup, the word “Tillie” appeared before that semicolon. The parser encountered an <a> tag, then the word “Tillie”, then the closing </a> tag, then the semicolon and rest of the sentence. The semicolon is on the same level as the <a> tag, but the word “Tillie” was encountered first.

The .previous_element attribute is the exact opposite of .next_element. It points to whatever element was parsed immediately before this one:

last_a_tag.previous_element
# u' and\n'
last_a_tag.previous_element.next_element
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

next_elements and .previous_elements

You should get the idea by now. You can use these iterators to move forward or backward in the document as it was parsed:

for element in last_a_tag.next_elements:
    print(repr(element))
# u'Tillie'
# u';\nand they lived at the bottom of a well.'
# u'\n\n'
# <p class="story">...</p>
# u'...'
# u'\n'
# None

Searching the tree

Beautiful Soup defines a lot of methods for searching the parse tree, but they’re all very similar. I’m going to spend a lot of time explaining the two most popular methods: find() and find_all(). The other methods take almost exactly the same arguments, so I’ll just cover them briefly.

Once again, I’ll be using the “three sisters” document as an example:

html_doc = """
<html><head><title>The Dormouse's story</title></head>

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)

By passing in a filter to an argument like find_all(), you can zoom in on the parts of the document you’re interested in.

Kinds of filters

Before talking in detail about find_all() and similar methods, I want to show examples of different filters you can pass into these methods. These filters show up again and again, throughout the search API. You can use them to filter based on a tag’s name, on its attributes, on the text of a string, or on some combination of these.

A string

The simplest filter is a string. Pass a string to a search method and Beautiful Soup will perform a match against that exact string. This code finds all the tags in the document:

soup.find_all('b')
# [<b>The Dormouse's story</b>]

If you pass in a byte string, Beautiful Soup will assume the string is encoded as UTF-8. You can avoid this by passing in a Unicode string instead.

转载于:https://my.oschina.net/u/1178546/blog/166765

weixin_33910759

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Beautiful Soup Documentation

2019独角兽企业重金招聘Python工程师标准>>> ...
复制链接

扫一扫

Beautiful Soup Documentation

Beautiful Soup Documentation

Getting help

Quick Start

Installing Beautiful Soup

Problems after installation

Installing a parser

Making the soup

Kinds of objects

Tag

Name

Attributes

Multi-valued attributes

NavigableString

BeautifulSoup

Comments and other special strings

Navigating the tree

Going down

Navigating using tag names

.contents and .children

descendants

string

strings and stripped_strings

Going up

parent

parents

Going sideways

next_sibling and .previous_sibling

next_siblings and .previous_siblings

Going back and forth

next_element and .previous_element

next_elements and .previous_elements

Searching the tree

Kinds of filters

A string

“相关推荐”对你有帮助么？