用漂亮的汤在Python中抓取网页：搜索和DOM修改

最新推荐文章于 2023-06-20 04:28:30 发布

cunjie3951

最新推荐文章于 2023-06-20 04:28:30 发布

阅读量351

点赞数

文章标签：字符串列表 python java 正则表达式

在上一教程中，您学习了Beautiful Soup库的基础知识。除了浏览DOM树之外，您还可以搜索具有给定class或id元素。您也可以使用此库修改DOM树。

在本教程中，您将学习可以帮助您进行搜索和修改的不同方法。我们将在上一个教程中刮取有关Python的同一Wikipedia页面。

搜索树的过滤器

Beautiful Soup有很多搜索DOM树的方法。这些方法非常相似，并且采用与参数相同的筛选器。因此，在阅读有关方法之前，有必要正确理解不同的过滤器。我将使用相同的find_all()方法来说明不同过滤器之间的区别。

可以传递给任何搜索方法的最简单的过滤器是字符串。然后，Beautiful Soup将在文档中搜索以找到与字符串完全匹配的标签。

for heading in soup.find_all('h2'):
    print(heading.text)
    
# Contents
# History[edit]
# Features and philosophy[edit]
# Syntax and semantics[edit]
# Libraries[edit]
# Development environments[edit]
# ... and so on.

您还可以将正则表达式对象传递给find_all()方法。这次，Beautiful Soup将通过将所有标签与给定的正则表达式匹配来过滤树。

import re

for heading in soup.find_all(re.compile("^h[1-6]")):
    print(heading.name + ' ' + heading.text.strip())
    
# h1 Python (programming language)
# h2 Contents
# h2 History[edit]
# h2 Features and philosophy[edit]
# h2 Syntax and semantics[edit]
# h3 Indentation[edit]
# h3 Statements and control flow[edit]
# ... an so on.

该代码将查找所有以“ h”开头且后跟1到6的数字的标签。换句话说，它将查找文档中的所有标题标签。

如果不使用正则表达式，则可以通过传递希望Beautiful Soup与文档匹配的所有标签的列表来达到相同的结果。

for heading in soup.find_all(["h1", "h2", "h3", "h4", "h5", "h6"]):
    print(heading.name + ' ' + heading.text.strip())

您还可以将True作为参数传递给find_all()方法。然后，代码将返回文档中的所有标签。下面的输出表示我们正在解析的Wikipedia页面中目前有4,339个标签。

len(soup.find_all(True))
# 4339

如果您仍然无法使用上述任何过滤器找到所需的内容，则可以定义自己的函数，该函数将元素作为唯一参数。如果存在匹配，该函数还需要返回True ，否则返回False 。根据您的需要，可以使功能变得复杂，以完成工作。这是一个非常简单的示例：

def big_lists(tag):
    return len(tag.contents) > 20 and tag.name == 'ul'
    
len(soup.find_all(big_lists))
# 13

上面的功能正在浏览同一Wikipedia Python页面，并查找具有20个以上子级的无序列表。

使用内置函数搜索DOM树

查找DOM的最流行的方法之一是find_all() 。它将遍历标签的所有后代，并返回与您的搜索条件匹配的所有后代的列表。此方法具有以下签名：

find_all(name, attrs, recursive, string, limit, **kwargs)

name参数是您希望此函数在遍历树时搜索的标签的名称。您可以随意提供字符串，列表，正则表达式，函数或值True作为名称。

您还可以根据不同的属性（例如id ， href等）在DOM树中过滤元素。您还可以使用attribute=True来获得具有特定属性的所有元素，而不管其值如何。搜索具有特定类的元素与搜索常规属性不同。由于class是Python中的保留关键字，因此在查找具有特定类的元素时必须使用class_关键字参数。

import re

len(soup.find_all(id=True))
# 425

len(soup.find_all(class_=True))
# 1734

len(soup.find_all(class_="mw-headline"))
# 20

len(soup.find_all(href=True))
# 1410

len(soup.find_all(href=re.compile("python")))
# 102

您可以看到该文档具有1,734个具有class属性的标签和425个具有id属性的标签。如果只需要这些结果的前几个，则可以将一个数字作为limit的值传递给该方法。传递此值将指示Beautiful Soup在达到一定数量后停止寻找更多元素。这是一个例子：

soup.find_all(class_="mw-headline", limit=4)

# <span class="mw-headline" id="History">History</span>
# <span class="mw-headline" id="Features_and_philosophy">Features and philosophy</span>
# <span class="mw-headline" id="Syntax_and_semantics">Syntax and semantics</span>
# <span class="mw-headline" id="Indentation">Indentation</span>

当使用find_all()方法时，您是在告诉Beautiful Soup遍历给定标记的所有后代，以查找所需内容。有时，您只想在标签的直接子元素中查找元素。这可以通过将recursive=False传递给find_all()方法来实现。

len(soup.html.find_all("meta"))
# 6

len(soup.html.find_all("meta", recursive=False))
# 0

len(soup.head.find_all("meta", recursive=False))
# 6

如果您只想为特定搜索查询找到一个结果，则可以使用find()方法找到它，而不是将limit=1传递给find_all() 。这两种方法返回的结果之间唯一的区别是， find_all()返回仅包含一个元素的列表，而find()仅返回结果。

soup.find_all("h2", limit=1)
# [<h2>Contents</h2>]

soup.find("h2")
# <h2>Contents</h2>

find()和find_all()方法搜索给定标记的所有后代，以搜索元素。您可以使用其他十种非常相似的方法在不同方向上遍历DOM树。

find_parents(name, attrs, string, limit, **kwargs)
find_parent(name, attrs, string, **kwargs)

find_next_siblings(name, attrs, string, limit, **kwargs)
find_next_sibling(name, attrs, string, **kwargs)

find_previous_siblings(name, attrs, string, limit, **kwargs)
find_previous_sibling(name, attrs, string, **kwargs)

find_all_next(name, attrs, string, limit, **kwargs)
find_next(name, attrs, string, **kwargs)

find_all_previous(name, attrs, string, limit, **kwargs)
find_previous(name, attrs, string, **kwargs)

find_parent()和find_parents()方法遍历DOM树以找到给定的元素。 find_next_sibling()和find_next_siblings()方法将遍历当前元素之后的所有元素同级。同样， find_previous_sibling()和find_previous_siblings()方法将遍历当前元素之前的所有元素同级。

find_next()和find_all_next()方法将遍历当前元素之后的所有标记和字符串。同样， find_previous()和find_all_previous()方法将遍历当前元素之前的所有标记和字符串。

您还可以在select()方法的帮助下使用CSS选择器搜索元素。这里有一些例子：

len(soup.select("p a"))
# 411

len(soup.select("p > a"))
# 291

soup.select("h2:nth-of-type(1)")
# [<h2>Contents</h2>]

len(soup.select("p > a:nth-of-type(2)"))
# 46

len(soup.select("p > a:nth-of-type(10)"))
# 6

len(soup.select("[class*=section]"))
# 80

len(soup.select("[class$=section]"))
# 20

修改树

您不仅可以搜索DOM树以查找元素，还可以对其进行修改。重命名标签并修改其属性非常容易。

heading_tag = soup.select("h2:nth-of-type(2)")[0]

heading_tag.name = "h3"
print(heading_tag)
# <h3><span class="mw-headline" id="Features_and_philosophy">Feat...

heading_tag['class'] = 'headingChanged'
print(heading_tag)
# <h3 class="headingChanged"><span class="mw-headline" id="Feat...

heading_tag['id'] = 'newHeadingId'
print(heading_tag)
# <h3 class="headingChanged" id="newHeadingId"><span class="mw....

del heading_tag['id']
print(heading_tag)
# <h3 class="headingChanged"><span class="mw-headline"...

从上一个示例继续，您可以使用.string属性用给定的字符串替换标签的内容。如果您不想替换内容，而是在标签的末尾添加一些额外的内容，则可以使用append()方法。

同样，如果要在标签中的特定位置insert() ，可以使用insert()方法。此方法的第一个参数是您要插入内容的位置或索引，第二个参数是内容本身。您可以使用clear()方法删除标记内的所有内容。这只会让您留下标记本身及其属性。

heading_tag.string = "Features and Philosophy"
print(heading_tag)
# <h3 class="headingChanged">Features and Philosophy</h3>

heading_tag.append(" [Appended This Part].")
print(heading_tag)
# <h3 class="headingChanged">Features and Philosophy [Appended This Part].</h3>

print(heading_tag.contents)
# ['Features and Philosophy', ' [Appended This Part].']

heading_tag.insert(1, ' Inserted this part ')
print(heading_tag)
# <h3 class="headingChanged">Features and Philosophy Inserted this part  [Appended This Part].</h3>

heading_tag.clear()
print(heading_tag)
# <h3 class="headingChanged"></h3>

在本节的开头，您从文档中选择了第二级标题，并将其更改为第三级标题。现在，再次使用相同的选择器将向您显示原始文档之后的下一个第二级标题。这是有道理的，因为原始标题不再是二级标题。

现在可以使用h3:nth-of-type(2)选择原始标题。如果您完全想从树中删除元素或标签以及其中的所有内容，则可以使用decompose()方法。

soup.select("h3:nth-of-type(2)")[0]
# <h3 class="headingChanged"></h3>

soup.select("h3:nth-of-type(3)")[0]
# <h3><span class="mw-headline" id="Indentation">Indentation</span>...

soup.select("h3:nth-of-type(2)")[0].decompose()
soup.select("h3:nth-of-type(2)")[0]
# <h3><span class="mw-headline" id="Indentation">Indentation</span>...

分解或删除原始标题后，第三个位置的标题将取代它。

如果要从树中删除标签及其内容，但又不想完全破坏标签，则可以使用extract()方法。此方法将返回其提取的标签。现在，您将可以解析两个不同的树。新树的根将是您刚刚提取的标签。

heading_tree = soup.select("h3:nth-of-type(2)")[0].extract()

len(heading_tree.contents)
# 2

您还可以使用replace_with()方法将树中的标签替换为您选择的其他标签。此方法将返回其替换的标签或字符串。如果要将替换后的内容放在文档中的其他位置，可能会有所帮助。

soup.h1
# <h1 class="firstHeading">Python (programming language)</h1>

bold_tag = soup.new_tag("b")
bold_tag.string = "Python"

soup.h1.replace_with(bold_tag)

print(soup.h1)
# None
print(soup.b)
# <b>Python</b>

在上面的代码中，文档的主要标题已被替换为b标签。该文档不再具有h1标记，这就是为什么print(soup.h1)现在显示None 。