网页数据解析 -- BS4

目录

1. 初识bs4

bs4简介:

bs4及解析器安装:

解析器的对比

简单使用

2. Tag对象的查找与信息获取

查找tag对象

查找tag对象的标签名和属性

获取标签的文本内容

demo应用

3. 遍历文档树

嵌套选择

子节点、子孙节点

父节点、祖先节点

兄弟节点

4. 搜索文档树

 find_all() 

name参数

关键字参数

string过滤

limit参数

find

其他方法

5. CSS选择器

6. bs4解析demo


1. 初识bs4

bs4简介:

简单来说,Beautiful Soup是python的一个库,最主要的功能是从网页抓取数据

即针对的是html树形结构文件,可以发挥最大作用。

>

官方解释如下:

Beautiful Soup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。
它是一个工具箱,通过解析文档为用户提供需要抓取的数据,因为简单,所以不需要多少代码就可以写出一个完整的应用程序。

>

Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间.你可能在寻找 Beautiful Soup3 的文档,Beautiful Soup 3 目前已经停止开发,官网推荐在现在的项目中使用Beautiful Soup 4。
>

官方文档: https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/

>

bs4及解析器安装:

● bs4安装:pip install bs4

● 引用:from bs4 import BeautifulSoup

>

Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,如果我们不安装它,则 Python 会使用 Python默认的解析器,lxml 解析器更加强大,速度更快,推荐安装。
● lxml安装:pip install lxml

>


另一个可供选择的解析器是纯Python实现的 html5lib , html5lib的解析方式与浏览器相同,可以选择下列方法来安装html5lib:
● html5lib安装:pip install html5lib

解析器的对比

简单使用

从一个`soup`对象开始,以下两种方式生成一个soup对象:

● soup = BeautifulSoup(open("index.html"))    #传入文件
● soup = BeautifulSoup("<html>data</html>  #html格式的字符串形式

>

构造soup对象时,可以传入解析器参数,如果不传入的话,会以最好的方式去解析

>

下面的一段HTML代码将作为例子被多次用到.这是 *爱丽丝梦游仙境的* 的一段内容(以后内容中简称为 *爱丽丝* 的文档),直接使用html格式字符串形式:提取内容中所有的a标签:

from bs4 import BeautifulSoup

# *爱丽丝* 的文档:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
#找出所有a标签:
soup = BeautifulSoup(html_doc,'html.parser')
print(soup.find_all("a"))


#输出:
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

可以看出,相比较于正则表达式提取html的标签内容,bs4要更加简单方便~

2. Tag对象的查找与信息获取

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象

所有对象可以归纳为4种:

● BeautifulSoup

● Tag

● NavigableString

● Comment

>

tag对象,同网页中的**标签**的意思

查找tag对象

● soup.标签名:查找对应标签名的内容

● soup.body:查找对应标签名的body内容

● soup.标签名.标签名:像路径一样,一直往下找标签后代

>

tag对象查找,遵循深度优先原则,可以一直往下找,不存在时返回None

接上面爱丽丝的文档:

from bs4 import BeautifulSoup

# *爱丽丝* 的文档:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<b>BBB</b>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc,'html.parser')
print("body:",soup.body)  #查找body里的所有内容
print("b标签:",soup.b) #默认找第一个b标签
print("类型:",type(soup.b))  #查看类型
print("b标签:",soup.p.b)  #深度优先查找:可以像路径一样,一直往下找


#输出:
body: <body>
<b>BBB</b>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>
b标签: <b>BBB</b>
类型: <class 'bs4.element.Tag'>
b标签: <b>The Dormouse's story</b>

查找tag对象的标签名和属性

●  soup.标签名.name:查找标签名

●  soup.标签名[属性名]:查找标签属性值

●  soup.标签名attrs:查找对应标签名内的所有属性值

>

使用示例如下:

from bs4 import BeautifulSoup

# *爱丽丝* 的文档:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<b>BBB</b>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc,'html.parser')
#tag对象的名称和属性:
print("标签名:",soup.p.b.name)
print("a标签的链接属性:",soup.a["href"])
print("a标签的所有属性:",soup.a.attrs)


#输出:
标签名: b
a标签的链接属性: http://example.com/elsie
a标签的所有属性: {'href': 'http://example.com/elsie', 'class': ['sister'], 'id': 'link1'}

获取标签的文本内容

获取标签文本内容的方法有两个:

● soup.标签名.string:1个tag对应里有多个文本时,则返回None

● soup.a.text:取出所有tag对象里的文本内容,推荐使用

>

使用示例如下:

from bs4 import BeautifulSoup

# *爱丽丝* 的文档:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<b>BBB</b>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>

soup = BeautifulSoup(html_doc,'html.parser')
"""
#取tag对象文本属性值的两种方法:
print("取a标签文本属性方式1:",soup.a.string) #1个标签里有多个文本时,返回None
print("取a标签文本属性方式2:",soup.a.text) #取出所有tag对象里的文本内容
#输出:
取a标签文本属性方式1: Elsie
取a标签文本属性方式2: Elsie

demo应用

获取爱丽丝文档的标签信息,构建{a标签的文本:a的href作为值}

from bs4 import BeautifulSoup

# *爱丽丝* 的文档:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<b>BBB</b>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

#获取文档的标签信息,构建{a标签的文本:a的href作为值}
ret = soup.find_all("a")
d = {}
for tag in ret:
    val = tag.text
    h = tag.attrs.get("href")
    d[val] = h
print(d)

#简化写法:字典推导式
print({tag.text:tag["href"] for tag in soup.find_all("a")})


#输出:
{'Elsie': 'http://example.com/elsie', 'Lacie': 'http://example.com/lacie', 'Tillie': 'http://example.com/tillie'}
{'Elsie': 'http://example.com/elsie', 'Lacie': 'http://example.com/lacie', 'Tillie': 'http://example.com/tillie'}

3. 遍历文档树

嵌套选择

嵌套选择的方法:

● soup.head.title.text:head标签内title标签的文本属性值

● soup.body.a.text:body标签里a标签的文本属性值

子节点、子孙节点

子节点、子孙节点的两个方法:

● soup.标签名.contents:对应标签下的所有子节点

● soup.标签名.children:得到一个迭代器,包含对应节点下的所有子节点

● soup.标签名.descendants:获取对应的子孙节点,包含对应的标签内嵌套的所有标签

>

代码示例如下:

from bs4 import BeautifulSoup

# *爱丽丝* 的文档:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<b>BBB</b>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc,'html.parser')
#查询子节点:
print(soup.p.contents)
print(soup.p.children) #得到一个迭代器
print(list(soup.p.children)) #进行列表强转

#查询子孙节点:
print("子孙节点:",list(soup.p.descendants))


#输出:
[<b>The Dormouse's story</b>]
<list_iterator object at 0x000002C6D36A83A0>
[<b>The Dormouse's story</b>]
子孙节点: [<b>The Dormouse's story</b>, "The Dormouse's story"]

父节点、祖先节点

父节点、祖先节点方法:

● soup.标签名.parent:获取对应标签的父标签

● soup.标签名.parent.text:获取对应标签的父标签的文本值

● soup.标签名.parents:找到对应标签所有的祖先节点,父亲的父亲,父亲的父亲的父亲...

>

代码示例如下:

from bs4 import BeautifulSoup

# *爱丽丝* 的文档:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<b>BBB</b>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc,'html.parser')
#父节点、祖先节点:
print("父节点:\n",soup.a.parent)
print("父节点的文本值:\n",soup.b.parent.text)
print("祖先节点:",soup.a.parents)

#输出:
父节点:
 <p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
父节点的文本值:
 
BBB
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...

祖先节点: <generator object PageElement.parents at 0x00000223116AF100>

兄弟节点

兄弟节点方法:

● soup.标签名.next_sibling:对应标签的下一个兄弟节点

● soup.标签名.next_sibling.next_sibling:对应标签的下一个兄弟节点的下一个兄弟

● soup.标签名.previous_sibling.previous_sibling:

● soup.标签名.previous_siblings:对应标签的上面的兄弟们 -> 生成器对象

>

代码示例如下:

from bs4 import BeautifulSoup

# *爱丽丝* 的文档:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<b>BBB</b>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc,'html.parser')
#兄弟节点:
print("a标签的下一个兄弟节点:",soup.a.next_sibling)
print("a标签的下一个兄弟节点的下一个兄弟节点:\n",soup.a.next_sibling.next_sibling)
print("a标签上面的兄弟节点:\n",soup.a.previous_sibling)
print("a标签上面的所有兄弟节点:\n",soup.a.previous_siblings)

#输出:
a标签的下一个兄弟节点: ,

a标签的下一个兄弟节点的下一个兄弟节点:
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
a标签上面的兄弟节点:
 Once upon a time there were three little sisters; and their names were

a标签上面的所有兄弟节点:
 <generator object PageElement.previous_siblings at 0x0000022F61F4F1C0>

4. 搜索文档树

 find_all() 

BeautifulSoup定义了很多搜索方法,这里着重介绍2个:  find_all()  find()

其它方法的参数和用法类似

>

find_all( name , attrs , recursive , string , limit, **kwargs ):

5个参数:(标签名,关键字,遍历查询默认=Ture,文本,限制条数,属性值)

>

name:即标签名

recursive:是否从当前位置递归往下查询,如果不递归,只会查询当前soup文档的子元素

string:这里是通过tag的内容来搜索,并且**返回的是类容,而不是tag类型的元素

**kwargs:自动拆包掺入、接受属性值

name参数

name参数是按照标签名查询

● soup.find_all(name='标签名'):搜索索引对应的标签名

● 正则表达式soup.find_all(name=re.compile('^标签名'):查找所有对应的标签名

● 列表soup.find_all(name=['标签名', '标签名']:

        传入列表参数,返回所有与列表内容匹配的内容

● 过滤器方法:如果没有合适过滤器,那么还可以定义一个方法,方法只接受一个元素参数 ,如果这个方法返回 True 表示当前元素匹配并且被找到,如果不是则反回 False

from bs4 import BeautifulSoup
import re

# *爱丽丝* 的文档:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<b>BBB</b>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc,'html.parser')

#1、搜索文档:find_all():
#name参数:字符串、即标签名
ret = soup.find_all(name="b")
print("ret:",ret)
#输出:
ret: [<b>BBB</b>, <b>The Dormouse's story</b>]



#2、正则表达式:
ret1 = soup.find_all(name=re.compile('^b'))
print("正则-所有b开头的标签:\n",ret1)
#输出:
正则-所有b开头的标签:
 [<body>
<b>BBB</b>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>, <b>BBB</b>, <b>The Dormouse's story</b>]
ret2: [<b>BBB</b>, <b>The Dormouse's story</b>]



#3、列表:
ret2 = soup.find_all(name=["a","b"])
print("ret2:",ret)
#输出:
ret2: [<b>BBB</b>, <b>The Dormouse's story</b>]



#4、过滤器:做精细过滤
#定义一个同时有class和id属性的tag对象,为true的留下,为false的去掉
def has_calss_has_id(tag):
    return tag.has_attr('class') and tag.has_attr('id')
print("同时class和id属性的tag:\n",
      soup.find_all(name=has_calss_has_id))

#输出:
同时class和id属性的tag:
 [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

关键字参数

关键字参数是针对属性进行查询

关键字参数的两种等效写法:

● soup.find_all(属性=属性值):查询指定的属性和属性值

● soup.find_(attrs={属性 : 属性值})

>

多个属性和属性值的查询:

● soup.find_all(属性1=属性值1,属性2=属性值2):查询多个指定的属性和属性值

● 正则写法soup.find_all(属性1=re.compile("^属性值1"),属性2=re.compile("^属性值2")):

        查询属性值为xx开头的所有属性1、和属性值为xx开头的所有属性2

>

class属性值的指定查询:为避免与python中class冲突,需要写为class_

soup.find_all(class_="属性值")

from bs4 import BeautifulSoup
import re

# *爱丽丝* 的文档:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<b>BBB</b>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc,'html.parser')

#搜索文档:find_all():
#关键字参数:attrs,以下两种写法等效
ret = soup.find_all(href="http://example.com/lacie")
print(ret)

ret1 = soup.find_all(attrs={"href":"http://example.com/lacie"})
print(ret1)

#输出:
[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]




#同时查找两个属性值,写法1:
ret2 = soup.find_all(href="http://example.com/lacie",id="link2")
print("两个属性值方式1:",ret2)

#同时查找多个属性值,正则写法:href为http://开头,id为link开头
ret3 = soup.find_all(href=re.compile("^http://"),id=re.compile("^link"))
print("多个属性值:",ret3)

#输出:
两个属性值方式1: [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
多个属性值: [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]


#对class属性值查询:
re4 = soup.find_all(class_="sister")
print("ret4:",ret)
#输出:
ret4: [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

string过滤

string过滤是针对文本内容进行查询

● soup.find_all(string="文本内容"):按指定的文本内容查询

● 正则soup.find_all(string=re.compile("文本内容")):查找所有包含xx的文本内容

from bs4 import BeautifulSoup
import re

# *爱丽丝* 的文档:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<b>BBB</b>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc,'html.parser')

#搜索文档:find_all():
#按文本过滤:

ret = soup.find_all(string="Elsie")
print("ret:",ret)

#正则查找:包含
ret1 = soup.find_all(string=re.compile("Dormouse"))
print("ret1:",ret1)


#输出:
ret: ['Elsie']
ret1: ["The Dormouse's story", "The Dormouse's story"]

limit参数

根据查询条件,限制查询次数

● soup.find_all("标签名",limit=条数):针对查询条件,做次数限制

from bs4 import BeautifulSoup
import re

# *爱丽丝* 的文档:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<b>BBB</b>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc,'html.parser')

#搜索文档:find_all():
#limit参数:
ret = soup.find_all(string=re.compile("Dormouse"),limit=1)
print("ret:",ret)

#输出:
ret: ["The Dormouse's story"]

 补充:tag对象也可以find_all()

from bs4 import BeautifulSoup
import re

# *爱丽丝* 的文档:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<b>BBB</b>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc,'html.parser')

#tag对象也可以find_all()
tag = soup.p
ret = tag.find_all("b")
print(ret)

#输出:
[<b>The Dormouse's story</b>]

find

● find_all()与find()的区别:

find_all():查找所有符合条件的tag、返回结果是值包含一个元素的列表,没找到返回空列表

find():查找第一个符合条件的tag,直接返回结果,找不到目标时,返回 `None`

>

● find_all()与find()的相同点:

find_all()与find()所有参数相同

其他方法

● find_parents():查询所有的父节点,父节点的父节点……

● find_parent():查询所有的父节点

● find_next_siblings():方法返回所有符合条件的后面的兄弟节点

● find_next_sibling():只返回符合条件的后面的第一个tag节点,包含文本

5. CSS选择器

css选择器的方法为select(css_selector)

css选择器可以参考:https://www.w3school.com.cn/cssref/css_selectors.ASP

>

具体方法与实例如下:

● soup.select("title"):选择所有的title

● soup.select("body 标签"):选择body中所有指定的标签名

● soup.select("html head title"):选择html文件中的head title

● soup.select("head > title"):选择所有head标签下的title标签

● soup.select("p > a"):选择所有p标签下的a标签

● soup.select("body > a"):选择所有body下的a标签

● soup.select("p > #link1"):选择所有p标签下属性为link1的标签

● soup.select("#link1 ~ .sister"):选择id为link1的元素后面所有class为sister的兄弟元素

● soup.select("#link1 + .sister"):选择id为link1的元素后面紧邻的class为sister的兄弟元素

● soup.select(".sister"):选择所有class为sister的元素

● soup.select("[class~=sister]"):选择所有具有包含sister的单词的class属性的元素

● soup.select("#link1"):选择id为link1的元素。

● soup.select("a#link2"):选择标签为a且id为link2的元素

● soup.select("#link1,#link2"):选择id为link1link2的元素

● soup.select('a[href]'):选择具有href属性的所有a标签元素

● soup.select('a[href^="http://example.com/"]'):选择href属性以http://example.com/开头的所有a标签元素

● soup.select('a[href$="tillie"]'):选择href属性以tillie结尾的所有a标签元素

● soup.select('a[href*=".com/el"]'):选择href属性中包含.com/el的所有a标签元素

● select_one():返回查找到的元素的第一个xx

6. bs4解析demo

以豆瓣电影top250为例,使用bs4进行电影信息提取

网址:豆瓣电影TOP250(含下榜片)

可以看出,电影名被嵌套在多个div标签中

初级水平,简单处理步骤:

1. 将所有的返回html内容,黏贴在html文件中(后续涉及了直接网页爬取再进行整体优化)

2. 新建一个py文件,进行导入html文件,然后开始电影名提取

from bs4 import BeautifulSoup

#读取豆瓣电影html文件方式1
# with open("douban250.html",encoding="utf-8") as f:
#     data = f.read()
# print(data)
# BeautifulSoup(data,"html.parser")

#方式2:
soup = BeautifulSoup(open("douban250.html",encoding="utf-8"),"html.parser")
title_divs = soup.find_all('div', class_='title')
for title_div in title_divs:
    movie_name = title_div.a.text.strip()
    print(movie_name)


#输出:
肖申克的救赎 The Shawshank Redemption
霸王别姬
阿甘正传 Forrest Gump
泰坦尼克号 Titanic
这个杀手不太冷 Léon
美丽人生 La vita è bella
千与千寻 千と千尋の神隠し
辛德勒的名单 Schindler's List
盗梦空间 Inception
忠犬八公的故事 Hachi: A Dog's Tale
星际穿越 Interstellar
楚门的世界 The Truman Show
海上钢琴师 La leggenda del pianista sull'oceano
三傻大闹宝莱坞 3 Idiots
机器人总动员 WALL·E
放牛班的春天 Les choristes
无间道 無間道
疯狂动物城 Zootopia
大话西游之大圣娶亲 西遊記大結局之仙履奇緣
熔炉 도가니
教父 The Godfather
控方证人 Witness for the Prosecution
当幸福来敲门 The Pursuit of Happyness
怦然心动 Flipped
触不可及 Intouchables

bs4的学习知识就到这里啦~

可以看出相比较正则表达式,提取html信息来讲,总体还是简单很多

后面一起继续学习爬虫相关的知识哦~

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值