Python爬虫_BeauifulSoup

最新推荐文章于 2023-07-06 21:52:05 发布

苦涩2020

最新推荐文章于 2023-07-06 21:52:05 发布

阅读量392

点赞数

分类专栏： Python 文章标签： BeautifulSoup Python Python爬虫开发与项目实战

Python 专栏收录该内容

42 篇文章 3 订阅

订阅专栏

文章目录

简介

BeautifulSoup是一个可以从HTML或XML文件中提取数据的Python库，他能够通过你喜欢的转换器实现惯用的文档导航、查找、修改文档的方式。

安装BeautifulSoup、lxml

pip install beautifulsoup4
pip install lxml

beautifulsoup支持python标准库中的HTML解析器，还支持一些第三方的解析器，其中一个是lxml。由于lxml解析速度比标准库中的HTML解析器的速度快得多，我们选择安装lxml作为新的解析器

BeautifulSoup的使用

首先导入bs4库，再导入BeautifulSoup模块：

from bs4 import BeautifulSoup

然后创建BeautifulSoup对象，创建BeautifulSoup对象有两种方式。一种直接通过字符串创建

soup = BeautifulSoup(html_str, “lxml”, from_encoding = “utf-8”)

html_str是HTML代码的字符串

第二种通过文件来创建

soup = BeautifulSoup( open(“index.html”) )

文档被转换成Unicode，并且HTML的实例都被转换成Unicode编码，打印soup对象内容，格式化输出

print( soup.prettify() )

from bs4 import BeautifulSoup

html_str = """
<html>
<head><title>小说搜索网</title>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta name="description" content="新版小说搜索小说网" />
<link rel="shortcut icon" href="/favicon.ico" type="image/vnd.microsoft.icon" />
<script type="text/javascript" src="/js/jquery.js?2.0.0.1115"></script>
</head>
<body onload="window.status='小说搜索网';return true"> 
<span class="home-box" id="homeBox">hhhh</span>
<div style="width:760px;height:60px;margin:auto;border:px solid #ccc;">
<a href="http://hhhhh.cc/"><!--HHHH--></a>
</div>
</body>
</html>
"""

soup = BeautifulSoup(html_str, "lxml", from_encoding = "utf-8")

print(soup.prettify())

"""
---------------打印结果--------------
<html>
 <head>
  <title>
   小说搜索网
  </title>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="新版小说搜索小说网" name="description"/>
  <link href="/favicon.ico" rel="shortcut icon" type="image/vnd.microsoft.icon"/>
  <script src="/js/jquery.js?2.0.0.1115" type="text/javascript">
  </script>
 </head>
 <body onload="window.status='小说搜索网';return true">
  <span class="home-box" id="homeBox">hhhh
  </span>
  <div style="width:760px;height:60px;margin:auto;border:px solid #ccc;">
  <a href="http://hhhhh.cc/"><!--HHHH--></a>
  </div>
 </body>
</html>
"""

对象种类

BeautifulSoup将复杂HTML文档转换为一个复杂的树形结构，每个节点都是python对象，所有对象可以归纳为4种：

Tag-标记
NavigableString
BeautifulSoup
Comment

Tag

Tag对象与XML或HTML原生文档中的Tag相同，通俗点说就是标记。比如小说搜索网或者，title和a标记及其里面的内容称为Tag对象。怎样从html_str中抽取Tag呢？示例如下

# 抽取title：
print( soup.title )
#抽取span：
print( soup.span)
#抽取p：
# print( soup.p )
>++++++++++++++++++++++++++
<title>小说搜索网</title>
<span class="home-box" id="homeBox"></span>

当时该方式查找的是所有内容中第一个符合要求的标记

Tag中有两个最重要的属性：name和attributes。每个Tag都有自己的名字，通过 .name 来获取。示例如下：

print(soup.name)
print(soup.title.name)**
>++++++++++++++++++++
[document]
title

soup对象本身比较特殊，它的name为[document]，对于其他内部标记，输出的值便为标记本身的名称

Tag不仅可以获取name，还可以修改name，改变之后将影响所有通过当前BeautifulSoup对象生成的HTML文档。示例如下：

# 这里将title标记改为mytitle
soup.title.name = "mytitle"
print(soup.titile)
print(soup.mytitle)
>++++++++++++++++++++++
None
<mytitle>小说搜索网</mytitle>

还有Tag中的属性，如 <p class=“home-box” … … … 中有一个“class”属性，值为“home-box”，Tag的属性的操作方法与字典相同

print(soup.span["class"])
print(soup.span.get("class"))
print(soup.a["href"])
>+++++++++++++++++++++++++++
['home-box']
['home-box']
http://hhhhh.cc/

也可以通过直接“点”取属性，比如： .attrs ，用于获取Tag中所有属性

print(soup.span.attrs)
print(soup.a.attrs)
>++++++++++++++++++++++++++++
{'class': ['home-box'], 'id': 'homeBox'}
{'href': 'http://hhhhh.cc/'}

和name一样，我们可以对标记中的这些属性和内容等进行修改

soup.span["class"] = "myclass"
soup.a["href"] = "www.github.com"

print(soup.span)
print(soup.a)
>++++++++++++++++++++++++++++++++++++++
<span class="myclass" id="homeBox"></span>
<a href="www.github.com"></a>

NavigableString
我们已经得到标记的内容，要想获取标记内部的文字怎么办？需要用到 .string

print(soup.title.string)
print(type(soup.title.string))

>++++++++++++++++++++++++++++++++++++
小说搜索网
<class 'bs4.element.NavigableString'>

BeautifulSoup用NavigableString类来包装Tag中的字符串，一个NavigableString字符串与Python中的Unicode字符串相同，通过unicode()方法可以直接将NavigableString对象转换成Unicode字符串

unicode_string = unicode(soup.title.string)

BeautifulSoup

BeautifulSoup对象表示的是一个文档的全部内容。大部分时候，可以把它当作Tag对象，是一个特殊的Tag，因为BeautifulSoup对象并不是真正的HTML或XML的标记，所以它没有name和attribute属性。但为了将BeautifulSoup队形标准化为Tag对象，实现接口的统一，我们依然可以分别获取它的name和attribute属性

print(type(soup.name))
print(soup.name)
print(soup.attrs)
>+++++++++++++++++++++++++++++++
<class 'str'>
[document]
{}

comment

文档的注释部分

print(soup.a.string)
print(type(soup.a.string))
>++++++++++++++++++++++++++++++++
HHHH
<class 'bs4.element.Comment'>

a 标记里的内容实际上是注释，但是如果我们利用 .string 来输出它的内容，会发现它已经把注释符号去掉了。另外如果打印输出它的类型，会发现它是一个Comment类型。
如果在我们不清楚这个标记 .string的情况下，可能造成数据提取混乱。因此在提取字符串时，可以判断一下类型

if type(soup.a.string) == "bs4.element.Comment":
    print(soup.a.string)

遍历文档树

BeautifulSoup会将HTML转化为文档树进行搜索，既然是树形结构，节点的概念必不可少

子节点
首先说一下直接子节点，Tag中的 .contents 和 .children是非常重要的。

Tag的 .content 属性可以将Tag子节点以列表的方式输出

print(soup.head.contents)
>++++++++++++++++++++++++++
['\n', <title>小说搜索网</title>, '\n', <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>, '\n', <meta content="新版小说搜索小说网" name="description"/>, '\n', <link href="/favicon.ico" rel="shortcut icon" type="image/vnd.microsoft.icon"/>, '\n', <script src="/js/jquery.js?2.0.0.1115" type="text/javascript"></script>, '\n']

既然输出方式是列表，我们就可以获取列表的大小，并通过列表索引获取里面的值

print(len(soup.head.contents))
print(soup.head.contents[1].string)

>++++++++++++++++++++++++++++++++++++
11
小说搜索网

有一点需要注意：字符串没有contents属性，因为字符串没有子节点

.children属性返回的是一个生成器，可以对Tag的子节点进行循环

for child in soup.head.children:
    print(child)
    
>+++++++++++++++++++++++++++++++++
<title>小说搜索网</title>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="新版小说搜索小说网" name="description"/>
<link href="/favicon.ico" rel="shortcut icon" type="image/vnd.microsoft.icon"/>
<script src="/js/jquery.js?2.0.0.1115" type="text/javascript"></script>

.contents和.children属性仅包含Tag的直接子节点。而head标记有一个直接子节点title。但是title标记也包含一个子节点：字符串“小说搜索网”，这种情况下字符串“小说搜索网”属于head标记的子孙节点。.descendants属性可以对所有Tag的子孙节点进行递归循环

for child in soup.head.descendants:
    print(child)

获取节点内容

.string这个属性很有特点：如果一个标记里面没有标记了，那么.string就会返回标记里面的内容。如果标记里面只有唯一的一个标记了，那么.string也会返回最里面的内容。如果Tag包含了多个子节点，tag就无法确定.string方法应该调用哪个子节点的内容，.string的输出结果是None。

print(soup.head.string)
print(soup.title.string)
>++++++++++++++++++++++++
None
小说搜索网

.strings属性主要应用于tag中包含多个字符串的情况，可以进行循环遍历

for string in soup.strings:
    print(repr(string))
    
>+++++++++++++++++++++++++++++++++
'\n'
'\n'
'小说搜索网'
'\n'
'\n'
'\n'
'\n'
'\n'
'\n'
'\n'
'hhhh'
'\n'
'\n'
'\n'
'\n'
'\n'
'\n'

.stripped_string属性可以去掉输出字符串中包含的空格或空行

for string in soup.stripped_strings:
    print(repr(string))
    
>+++++++++++++++++++++
'小说搜索网'
'hhhh'

get_text()
如果只想得到tag中包含的文本内容,那么可以用 get_text() 方法，这个方法获取到tag中包含的所有文版内容包括子孙tag中的内容，并将结果作为Unicode字符串返回

print(soup.get_text())

++++++++++++++++
小说搜索网






hhhh

get()

tag.get(attr)，可以得到tag标签中attr属性的value

for meta in soup.find_all("meta"):
    print(meta.get("content"))
-------------
text/html; charset=UTF-8
新版小说搜索小说网

父节点

每个Tag或字符串都有父节点：被包含在某个Tag中
通过**.parent属性来获取某个元素的父节点，head标记是title**标记的父节点

print(soup.title)
print(soup.title.parent)

>++++++++++++++++++++++++++++++++++++++
<title>小说搜索网</title>
<head>
<title>小说搜索网</title>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="新版小说搜索小说网" name="description"/>
<link href="/favicon.ico" rel="shortcut icon" type="image/vnd.microsoft.icon"/>
<script src="/js/jquery.js?2.0.0.1115" type="text/javascript"></script>
</head>

通过元素的**.parents属性可以递归得到元素的所有父辈节点，下面的例子使用了.parents方法遍历了a**标记到根节点的所有节点

for parent in soup.a.parents:
    if parent is None:
        print(parent)
    else:
        print(parent.name)

>+++++++++++++++++++++++++++++
<a href="http://hhhhh.cc/"><!--HHHH--></a>
div
body
html
[document]

兄弟节点

从soup.prettify()的输出结果中，我们可以看到title有很多兄弟节点。兄弟节点可以理解为和节点处在同一级的节点，.next_sibling属性可以获取该节点的下一个兄弟节点， .previous_sibling则与之相反，如果节点不存在，则返回None

print(soup.title.next_sibling)
print(soup.title.next_sibling.next_sibling)
print(soup.meta.prev_sibling)
>+++++++++++++++++++++++++++++++++


<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
None

第一个输出结果为空白，因为空白或者换行也可以被视为一个节点，所以得到的结果可能是空白或者换行

通过**.next_siblings** 和 .previous_siblings属性可以对当前节点的兄弟节点迭代输出

for sibling in soup.title.next_siblings:
    print(repr(sibling))
>++++++++++++++++++++++++++
'\n'
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
'\n'
<meta content="新版小说搜索小说网" name="description"/>
'\n'
<link href="/favicon.ico" rel="shortcut icon" type="image/vnd.microsoft.icon"/>
'\n'
<script src="/js/jquery.js?2.0.0.1115" type="text/javascript"></script>
'\n'

前后节点

前后节点需要使用**.next_element**、.previous_element这两个属性，与**.next_sibling**、 .previous_sibling不同，它并不是针对兄弟节点，而是针对所有节点，不分层次，例如head — title —小说搜索网—title—head中的下一个节点就是title

print(soup.head)

print(soup.head.next_element)

如果想遍历所有前节点或者后节点，通过**.next_elemens和.previous_elements**的迭代器可以向前或后访问文档的解析内容，就好像文档正在被解析一样

for element in soup.span.next_elements:
    print(element)

搜索文档树

find_all方法

用于搜索当前Tag的所有Tag子节点，并判断是否符合过滤器的条件，函数原型如下

find_all(name, attrs, recursive, text, **kwargs)

1、name参数
name参数可以查找所有名字为name的标记，字符串对象会被自动忽略掉。name参数取值可以是字符串、正则表达式、列表、True和方法
最简单的过滤器是字符串。在搜索方法中传入一个字符串参数，BeautifulSoup会查找与字符串完整匹配的内容，下面的例子用于查找文档中所有的meta标记，返回值为列表

print(soup.find_all("meta"))
>+++++++++++++++++++++++++++
[<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>, <meta content="新版小说搜索小说网" name="description"/>]

如果传入正则表达式作为参数，BeautifulSoup会通过正则表达式的match()来匹配内容。下面找出h开头的标记，这样html和head标记都应该被找出

import re

for tag in soup.find_all(re.compile("^h")):
    print(tag.name)
>++++++++++++++++++++++++++++++
html
head

如果传入列表参数，BeautifulSoup会将与列表中任一元素匹配的内容返回

print(soup.find_all(["title", "a"]))
>++++++++++++++++++++++++++++++++++++++
[<title>小说搜索网</title>, <a href="http://hhhhh.cc/"><!--HHHH--></a>]

如果传入的参数是True，True可以匹配任何值，下面代码查找到所有的tag，但是不会返回字符串节点

for tag in soup.find_all(True):
    print(tag.name)
>+++++++++++++++++++++++++++++  
html
head
title
meta
meta
link
script
body
span
div
a

查找所有包含href和id属性的Tag

print(soup.find_all(href = True))
print(soup.find_all(id = True))

>++++++++++++++++++++++++++++++++++++
[<link href="/favicon.ico" rel="shortcut icon" type="image/vnd.microsoft.icon"/>, <a href="http://hhhhh.cc/"><!--HHHH--></a>]
[<span class="home-box" id="homeBox">hhhh</span>]

如果没有合适过滤器，那么还可以定义一个方法，方法只接受一个元素参数Tag节点，如果这个方法返回True表示当前元素匹配并且被找到，如果不是则返回False。比如过滤包含class属性，也包含id属性的元素，程序如下

def hasClass_Id(tag):
    return tag.has_attr("class") and tag.has_attr("id")

print(soup.find_all(hasClass_Id))

>+++++++++++++++++++++++++++++++
[<span class="home-box" id="homeBox">hhhh</span>]

2、**kwargs参数

kwargs参数在python中表示为keyword参数。如果一个指定名字的参数不是搜索内置的参数名，搜索时会把该参数当做指定名字Tag的属性来搜索。搜索指定名字的属性时可以使用的参数值包括字符串、正则表达式、列表、True
如果包含id参数，BeautifulSoup会搜索每个Tag的“id”属性

print(soup.find_all(id = "homeBox"))
>+++++++++++++++++++++++++++++++++++++
[<span class="home-box" id="homeBox">hhhh</span>]

如果传入href参数，BeautifulSoup会搜索每个“href”属性。比如查找href属性中含有“hhhhh”和“icon”的Tag

import re
print(soup.find_all( href =re.compile("hhhhh")))
print(soup.find_all( href =re.compile("icon")))
>+++++++++++++++++++++++++++++++++++++++++++++++++
[<a href="http://hhhhh.cc/"><!--HHHH--></a>]
[<link href="/favicon.ico" rel="shortcut icon" type="image/vnd.microsoft.icon"/>]

查找所有包含href和id属性的Tag，无论href和id的值是什么

print(soup.find_all(href = True))
print(soup.find_all(id = True))

>++++++++++++++++++++++++++++++++++++
[<link href="/favicon.ico" rel="shortcut icon" type="image/vnd.microsoft.icon"/>, <a href="http://hhhhh.cc/"><!--HHHH--></a>]
[<span class="home-box" id="homeBox">hhhh</span>]

如果我们想用class过滤，但是class是python的关键字，需要在class后面加个下划线

print(soup.find_all("span", class_ = "home-box"))
>++++++++++++++++++++++++++++++++++++++++++++++++++
[<span class="home-box" id="homeBox">hhhh</span>]

使用多个指定名字的参数可以同时过滤Tag的多个属性

import re
print(soup.find_all(href = re.compile("ico"), rel = "shortcut icon"))
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
[<link href="/favicon.ico" rel="shortcut icon" type="image/vnd.microsoft.icon"/>]

有些tag属性在搜索不能使用，比如HTML5中的data-*属性

data_soup = BeautifulSoup('<div data-foo= "value">foo! </div>')
data_soup.find_all(data-foo = "value")

这样的代码在python中是不合法的，但是可以通过find_all()方法的attrs参数定义一个字典参数来搜索包含特殊属性的tag

data_soup = BeautifulSoup('<div data-foo= "value">foo!</div>')
data_soup.find_all(attrs= {"data-foo" : "value"})

3、test参数
通过text参数可以搜索文档中的字符串内容。与name参数的可选值一样，text参数接受字符串、正则表达式、列表、True

print(soup.find_all(text = "小说搜索网"))
print(soup.find_all(text = ["小说搜索网", "hhhh"]))
>++++++++++++++++++++++++++++
['小说搜索网']
['小说搜索网', 'hhhh']

虽然text参数用于搜索字符串，还可以与其他参数混合使用来过滤tag。BeautifulSoup会找到 .string方法与text参数值相符的tag

print(soup.find_all("a", text = "HHHH"))
>+++++++++++++++++++++++++++++++++++
[<a href="http://hhhhh.cc/"><!--HHHH--></a>]

4、limit参数

find_all()方法返回全部的搜索结构，如果文档树很大那么搜索会很慢。如果我们不需要全部结果，可以使用limit参数限制返回结果的数量。效果与SQL中的limit关键字类似，当搜索到的结果数量达到limit的限制时，就停止搜索返回结果。

print(soup.find_all("meta"))
print(soup.find_all("meta", limit = 1))
>++++++++++++++++++++++++++++++++++++++
[<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>, <meta content="新版小说搜索小说网" name="description"/>]
[<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>]

5、recursive参数
调用tag的find_all()方式时，BeautifulSoup会检索当前tag的所有子孙节点，如果只想搜索tag的直接子节点，可以使用参数recursive = False

print(soup.find_all("title"))
print(soup.find_all("title", recursive=False))
>++++++++++++++++++++++++++++++++++++++++++++++
[<title>小说搜索网</title>]
[]

find(name, attrs, recursive, text, kwargs)方法
它与find_all()方法唯一的区别是，find_all()方法的返回结果是所有满足要求的值组成的列表，而find()方法直接返回find_all()搜索结果中的第一个值

find_parents(name, attrs, recursive, text, kwargs)
find_parent(name, attrs, recursive, text, kwargs)
find_all()和find()只搜索当前节点的所有子节点，孙子节点等。find_parents()和find_parent()用来搜索当前节点的父辈节点，搜索方法与普通tag的搜索方法相同，搜索文档包含的内容

CSS选择器

在CSS中标记名不加任何修饰，类名前加点“.”，id名前加“#”，在这里我们也可以利用类似的方法来筛选元素，用到的方法是soup.select()，放回类型是列表

1、通过标记名称进行查找
通过标记名称可以直接查找、逐层查找，也可以找到某个标记下的直接子标记和兄弟节点标记

#直接查找title标记
print(soup.select("title"))
#逐层查找title标记
print(soup.select("html head title"))
#查找直接子节点,查找head下的title标记
print(soup.select("head > title"))
###########################################
[<title>小说搜索网</title>]
[<title>小说搜索网</title>]
[<title>小说搜索网</title>]

#查找p下的id = "link1"的标记
print(soup.select("p > #link1"))
#查找兄弟节点，查找id = link1之后class = sisiter的所有兄弟标记
print(soup.select("# link~ .sisiter"))
#查找紧跟着id = "link1"之后class = sisiter的子标记
print(soup.select("# link1 + .sisiter"))

2、通过CSS的类名查找

print(soup.select(".home-box"))
print(soup.select("[class~=home-box]"))
#######################################
[<span class="home-box" id="homeBox">hhhh</span>]
[<span class="home-box" id="homeBox">hhhh</span>]

3、通过Tag的id查找

print(soup.select("#homeBox"))
print(soup.select("span#homeBox"))
##################################
[<span class="home-box" id="homeBox">hhhh</span>]
[<span class="home-box" id="homeBox">hhhh</span>]

4、通过是否存在某个属性查找

print(soup.select("a[href]"))
print(soup.select("link[href]"))
#################################
[<a href="http://hhhhh.cc/"><!--HHHH--></a>]
[<link href="/favicon.ico" rel="shortcut icon" type="image/vnd.microsoft.icon"/>]

5、通过属性值来查找

print(soup.select('a[href="http://hhhhh.cc/"]'))
print(soup.select('a[href^="http://hhhhh"]'))
print(soup.select('link[href$="ico"]'))
print(soup.select('a[href*="h.c"]'))
##########################################
[<a href="http://hhhhh.cc/"><!--HHHH--></a>]
[<a href="http://hhhhh.cc/"><!--HHHH--></a>]
[<link href="/favicon.ico" rel="shortcut icon" type="image/vnd.microsoft.icon"/>]
[<a href="http://hhhhh.cc/"><!--HHHH--></a>]