HTML解析之四：BeautifulSoup4的使用

最新推荐文章于 2024-04-27 13:19:50 发布
磊布斯
最新推荐文章于 2024-04-27 13:19:50 发布
阅读量1.7k
点赞数
分类专栏：爬虫文章标签： beautiful Soup 网络爬虫爬虫正则表达式
本文链接：https://blog.csdn.net/zhang__init__/article/details/78314079
版权
爬虫专栏收录该内容
13 篇文章 0 订阅
订阅专栏
#coding:utf8

# 一：快速开始
#导入bs4库
from bs4 import BeautifulSoup
#创建包含HTML代码的字符串
html_str = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elseie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2"><!-- Lacie --></a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
#创建BeautifulSoup对象
#方法一：直接通过字符串创建
soup = BeautifulSoup(html_str,'lxml', from_encoding='utf-8')
#方法二：通过文件夹创建，例如将html_str字符串保存为index.html文件
#soup = BeautifulSoup(open('index.html'))
#文档被转化成Unicode，并且HTML的实例都被转换成Unicode编码
#打印soup对象的内容
print soup.prettify()



#二：对象种类
#BeautifulSoup将复杂HTML文档转换成复杂树形结构，每个节点都是python对象
#所有对象可以归纳为4种：
#Tag标记   NavigableString获取标记内部文字
# BeautifulString特殊的Tag     Comment文档的注释部分

#(1)Tag
#title和a标记及其里面的类容称为Tag对象
#从html_str中抽取Tag
#抽取 title: print soup.title
#抽取 a: print soup.a
#抽取 p: print soup.p

#Tag标签中两个重要属性：.name获取标签名字，.attributes获取标签属性
print soup.name
print soup.title.name #此局打印结果为：title

#Tag不仅可以获取name，还可以修改name,
# 改变之后将影响所有通过当前Beautiful Soup对象生成的HTML文档
soup.title.name = 'mytitle'
print soup.title
print soup.mytitle
#此时，已经将title标记成功修改为mytitle

#Tag中的属性，<p class="title"><b>The Dormouse's story</b></p>
#中的属性class值为"title"，Tag的属性操作方法与字典相同：
print soup.p['class']
print soup.p.get('class')

#也可以直接"点"取属性
print soup.p.attrs   #.attrs 用于获取Tag中所有属性

#对标记中的属性和内容进行修改，类似name
soup.p['class'] = "myClass"
print soup.p

#(2)NavigableString
#得到标记内容之后，用 .string获取内部文字
print soup.p.string
print type(soup.p.string)

#BeautifulSoup用NavigableString类包装Tag中的字符串
#一个NavigableString字符串与python中的Unicode字符串相同，
#通过unicode()方法可以直接将NavigableString对象转换成Unicode字符串：
unicode_string = unicode(soup.p.string)

#(3)BeautifulSoup
#BeautifulSoup对象表示一个文档的全部内容，可当作一个特殊的Tag对象；
#BeautifulSoup对象并不是真正的HTML或XML的标记，所以没有name和attribute属性；
#为了将BeautifulSoup对象标准化为Tag对象，是想统一接口，我们可以获取它的name和attribute属性
print type(soup.name)
print soup.name
print soup.attrs

#(4)Comment
#文档的注释部分
print soup.a.string
print type(soup.a.string)
#a标记为注释的内容，用 .string 输出它的内容时，会去掉注释符号，打印出的类型为Comment类型；
#当我们不清楚这个标记 .string 的情况下，可能会造成数据提取混乱；
#所以在提取字符串时，可以判断一下类型：
from bs4 import BeautifulSoup
if type(soup.a.string) == bs4.element.Comment:
    print soup.a.string



#三：遍历文档树
#BeautifulSoup将HTML转化为文档树进行搜索
#树形结构，都有节点的概念

#(1)子节点
#Tag中的子节点 .contents 和 .children
#Tag的 .contents 属性可以将Tag子节点以列表的方式输出：
print soup.head.contents

#输出的方式是列表，所以可以获取列表的大小，并通过列表索引获取里面的值：
print len(soup.head.contents)
print soup.head.contents[0].string

#注意：字符串没有子节点，所以字符串没有 .contents属性

# .children属性返回一个生成器，可以对Tag子节点进行循环：
for child in soup.head.childern:
    print child

# .contents和 .children属性仅包含Tag的直接子节点。
#例如： <head><title>The Dormouse's story</title></head>
#<head>标签只有一个子节点<title>;
# <title>标签也包含一个子节点：字符串"The Dormouse's story";
#这种情况下，字符串"The Dormouse's story"也属于<head>标记的子孙节点；
# .descendants属性可以对所有Tag的子孙节点进行递归循环：
for child in soup.head.descendants:
    print child

#以上是获取子节点的方法，接下来是获取子节点内容的方法：
#需要用到 .string, .strings, .stripped_strings这三个属性
#  .string  如果一个标记里面没有标记了，.string返回标记里面的内容；
           #如果标记里面只有一个标记， 返回最里面的内容；
           #如果tag包含多个子节点，返回None;
print soup.head.string
print soup.title.string
print soup.html.string

#  .strings 用于tag中包含多个字符串的情况，对字符串进行循环遍历：
for string in soup.strings:
    print repr(string)

#  .stripped_strings 去掉输出字符串中包含的空格或空行
for string in soup.stripped_strings:
    print repr(string)


#（2）父节点
#每个Tag或者字符串都有父节点：被包含在某个Tag中
#通过 .parent属性来获取某个元素的父节点。
# 如在html_str中，<head>标记是<title>标记的父节点：
print soup.title
print soup.title.parent

#通过元素的 .parents 属性可以递归得到元素的所有父节点，
#用 .parents遍历<a>标记到根节点的所有节点：
print soup.a
for parent in soup.a.parents:
    if parent is None:
        print parent
    else:
        print parent.name


#（3）兄弟节点
#和本届点在同一级的节点
# .soup.prettify() 查看所有兄弟节点
# .next_sibling 获取下一个兄弟节点
# .previous_sibling 获取上一个兄弟节点
#空白和换行也可以视作为一个节点
#通过 .next_siblings和 .previous_siblings属性可以对当前兄弟节点迭代输出：
for sibling in soup.a.next_siblings:
    print repr(sibling)

#（4）前后节点
# 不是针对兄弟节点，而是针对所有节点，不分层次
#如<head><title>The Dormouse's story</title></head>的下一个节点是title
print soup.head
print soup.head.next_element

# .next_elements遍历所有前节点； .previous_elements遍历所有后节点
for element in soup.a.next_elements:
    print repr(element)


#四：搜索文档树：
#BeautifulSoup有很多搜索方法，介绍find_all()，其他方法类似；
#fing_all()用于搜索当前Tag的所有Tag子节点，并判断是否符合过滤条件；
#函数原型：find_all( name , atters , recursive , text , **kwargs )

#(1)name()参数
#name参数可以查找所有名字为name的标记，字符串对象会被自动忽略掉；
# nam参数取值可以是字符串，正则表达式，列表，True和方法；
#最简单的过滤器是字符串，搜索方法中传入一个字符串参数，BeautifulSoup会查找与字符串完整匹配的内容；
#下面的例子用于查找文档中所有的<b>标记，返回值为列表：
print soup.find_all('b')

#如果传入正则表达式作为参数，BeautifulSoup会通过正则表达式的match()来匹配内容；
#找出所有以b开头的标记，这表示<body>和<b>标记都会被找到：
import re
for tag in soup.find_all(re.compile("^b")):
    print tag.name

#如果传入列表参数，BeautifulSoup会将与列表中任一元素匹配的内容返回；
# 下面代码找到文档中所有<a>标记和<b>标记:
print soup.find_all(["a","b"])

#如果传入的参数是True，True可以匹配任何值；
#下面代码查找到所有的tag，但是ubuhui返回字符串节点：
for tag in soup.find_all(True):
    print tag.name

#如果没有合适的过滤器，那么还可以定义一个方法，方法只接受一个元素参数Tag节点；
#如果这个方法返回True表示当前元素匹配并且被找到，否则返回False；
#比如过滤器包含class属性，也包含id属性的元素：
def hasClass_Id(tag):
    return tag.has_atter('class') and tag.has_atter('id')
print soup.find_all(hasClass_Id)



#(2)kwargs参数
#kwargs参数在python中表示为keyword参数。
#如果一个指定名字的参数不是搜索内置的参数名，搜索时会把该参数当作指定名字Tag的属性来搜索。
#搜索指定名字的属性时可以使用的参数值包括字符串，正则表达式，列表，True
#如果包含id参数，BeautifulSoup会搜索每个Tag的"id"属性，代码如下：
print soup.find_all(id='link2')

#如果传入href参数，BeautifulSoup会搜索每个Tag的"href"属性。
#比如查找href属性中含有"elsie"的tag:
import re
print soup.find_all(href=re.compile("elsie"))

#下面的代码在文档树中查找所有包含id属性的Tag，无论id的值是什么：
print soup.find_all(id=True)

#如果我们想用class过滤，但是class是python的关键字，需要在class后面加个下划线：
print soup.find_all("a",class_="sister")

#使用多个指定名字的参数可以同时过滤tag的多个属性：
print soup.find_all(href=re.compile("elsie"), id='link1')

#有些tag属性在搜索不能使用，比如HTML5中的data-*属性：
# data_soup=BeautifulSoup('<div data-foo="value">foo!</div>')
# data_soup.find_all(data-foo="value")
#这样的代码在python中是不合法的，但是可以通过find_all()的attrs参数定义一个字典参数来搜索包含在特殊属性的tag:
data_soup=BeautifulSoup('<div data-foo="value">foo!</div>')
data_soup.find_all(attrs={"data-foo":"value"})


#(3)text参数
#text参数看以搜索文档中的字符串内容；
#与name参数的可选值一样，text参数接受字符串，正则表达式，列表，True
print soup.find_all(text="Elsie")
print soup.find_all(text=["Tillie","Elsie","Lacie"])
print soup.find_all(text=re.compile("Dormouse"))

#text参数用于搜索字符串，还可以与其他参数混合使用来过滤tag;
#BeautifulSoup会找到.string方法与ext参数值相符的tag;
#下面的代码用来搜索内容里面包含"Elsie"的<a>标记：
print soup.find_all("a",text="Elsie")


#(4)limit参数
#find_all()方法返回全部的搜索结构，当文档树很大时会变慢。limit可以限制返回结果的数量，与SQL中limit关键字类似；
#下面例子文档树有3个tag符合搜索条件，但结果反回了两个，因为我们限制了返回数量:
print soup.find_all("a",limit=2)


#(5)recursive参数
#调用tag的find_all()方法时，Beautiful Soup会检索当前tag的所有子孙节点；
#如果只想搜索tag的直接子节点，可以使用参数recursive=False
print soup.find_all("title")
print soup.find_all("title",recursive=False)


#五：CSS选择器
#在web中CSS可以用来定位元素的位置，在写CSS时，标记名不加任何修饰，类名前加 .  id名前加#
#这里我们用类似方法筛选元素，用到的方法soup.select()返回list类型
#(1)通过标记名称进行查找
#通过标记名称可以直接查找，逐层查找，或找到某个标记下的直接子标记和兄弟节点标记

#直接查找title标记
print soup.select("title")
#逐层查找title标记
print soup.select("html head title")

#查找直接子节点
#查找head下的title标记
print soup.select("head > title")
#查找p下的id="link1"的标记
print soup.select("p > # link1")

#查找兄弟节点
#查找id="link1"之后class=sister的手游兄弟标记
print soup.select("# link1 ~ .sister")
#查找紧跟着id="link1"之后class=sisiter的自标记
print soup.select("# link1 + .sister")

#(2)通过CSS的类名查找
print soup.select(".sister")
print soup.select("[class~=sister]")

#(3)通过tag的id查找
print soup.select("# link1")
print soup.select("a# link1")

#(4)通过是否存在某个属性来查找
print soup.select('a[href]')

#(5)通过属性值来查找
print soup.select('a[href="http://example.com/elsie"]')
print soup.select('a[href^="http://example.com/"]')
print soup.select('a[href$="title"]')
print soup.select('a[href*=".com/el"]')
磊布斯
关注
0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
HTML解析之四：BeautifulSoup4的使用

#coding:utf8# 一：快速开始#导入bs4库from bs4 import BeautifulSoup#创建包含HTML代码的字符串html_str = """The Dormouse's storyThe Dormouse's storyOnce upon a time there were three little sisters; and their names
复制链接

扫一扫