爬虫(8)bs4上

本章节详细介绍了BeautifulSoup4的使用,包括bs4的安装、基本操作、对象类型、遍历文档树以及重点知识find与find_all的用法。通过实例展示了如何查找、提取HTML中的标签、属性和文本内容,强调了find_all在提取多个匹配项时的便利性。
摘要由CSDN通过智能技术生成

第八章 bs4上

1. bs4简介

Beautiful Soup是一个可以从HTML或XML文件中提取提取数据的网页信息提取库。
首先需要安装,最好先安装pip install lxml再安装pip install bs4否则可能会出错。
bs4不需要记语法,直接调用里面的方法就可以了,这是它比正则和xpath方便的地方。

2. bs4入门

我们用一段网页文档来示例一下如何使用bs4。

from bs4 import Beautiful Soup   # 先引入Beautiful Soup类,Beautiful Soup是bs4中常用的类
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

如果我们要从上面文档中用Beautiful Soup提取需要的内容,我们要先解析成bs4对象。

from bs4 import Beautiful Soup   # 先引入Beautiful Soup类,Beautiful Soup是bs4中常用的类
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc,features='lxml') # 这里我们传入两个对象,一个是刚才的文档,第二个是features='lxml',用来解析文档的。
print(soup)  # 打印一下,得到一个Beautifull Soup对象。

结果

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>


如果我想要结构更清晰一点的打印结果可以这样打印:

print(soup.prettify())

得到一个更清晰的结构树,可以方便找到各标签的关系。

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>

如果我现在要把title标签的元素打印出来,可以这样操作。

print(soup.title)

得到

<title>The Dormouse's story</title>

如果要得到标签名字,和标签内的字符串,可以这样。

print(soup.title.name)
print(soup.title.string)

结果

title
The Dormouse's story

如果我要得到p标签

print(soup.p)

结果发现只找到里面三个p标签的第一个

<p class="title"><b>The Dormouse's story</b></p>

如果我都找到可以用find_all方法

res = soup.find_all('p')
print(res,len(res))

结果我们得到一个列表,返回所有的p标签作为列表的元素。发现有3各p标签,即找到了所有的p标签。

[<p class="title"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>, <p class="story">...</p>] 3

下面我们发现a标签里面有一个href里面有url,我们如何获取呢?可以这样

links = soup.find_all('a')
for link in links:
    print(link.get('href'))

就拿到了

http://example.com/elsie
http://example.com/lacie
http://example.com/tillie

以上是我们对bs4的一些入门操作,可以看到是很方便简洁的。

3. bs4对象的种类

Tag :标签
Navigablestring :可导航的字符串
BeautifulSoup :soup对象
Comment :注释

我们来通过操作代码来认识它们。。

from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc,'lxml')
print(type(soup.title))
print(type(soup.p))
print(type(soup.a))

结果我们看到,以上三个都是Tag对象。

<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>

对Tag对象我们可以做以下操作

print(soup.p.name)
print(soup.p.attrs)
print(soup.p.string)

结果

p
{'class': ['title']}
The Dormouse's story

其中

print(type(soup.p.string))

得到

<class 'bs4.element.NavigableString'>

是NavigableString字符串类型,它和普通字符串一样,可以做拼接等一样的操作。

print(type(soup))

我们看到得到的是一个soup对象

<class 'bs4.BeautifulSoup'>

下面我们看看注释类型,这个并不常用。我们随便写一个注释:

html = '<a><!--新年快乐!!--></a>'
soup = BeautifulSoup(html,'lxml')
print(soup.a.string)

我们先打印一下看看效果

新年快乐!!

把注释打印出来了。我们看看类型。

html = '<a><!--新年快乐!!--></a>'
soup = BeautifulSoup(html,'lxml')
print(type(soup.a.string))

我们看到是注释类对象

<class 'bs4.element.Comment'>

好,通过以上操作我们认识了四个对象类型。

4. 遍历文档树

我们先了解一下常用的解析器:

<
解析器 使用方法 优势 劣势
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值