python爬虫之数据解析（BeautifulSoup）

最新推荐文章于 2024-06-21 13:53:51 发布

X-Hoshino

最新推荐文章于 2024-06-21 13:53:51 发布

阅读量1.9k

点赞数 1

分类专栏： python爬虫文章标签： python 爬虫数据挖掘

本文链接：https://blog.csdn.net/qq_53221728/article/details/122942867

版权

python爬虫专栏收录该内容

15 篇文章 0 订阅

订阅专栏

BeautifulSoup也是python爬虫常用的一种数据解析方法，主要就两步。

1、实例化一个Beautifulsoup对象，平且将页面源码数据加载到该对象中。

2、通过调用Beautifulsoup对象中相关的属性或者方法进行标签定位和数据提取。

怎么实例化一个Beautifulsoup对象呢？

首先下载好bs4这个库，然后倒入BeautifulSoup包，然后就是将本地的HTML文档源码数据加载到Beautifulsoup对象中，或者是将实时的网页页面源码数据加载到Beautifulsoup中。

from bs4 import BeautifulSoup
#将本地html文档中的数据加载到该对象中
fp = open('./douban.html','r',encoding='utf-8')
soup = BeautifulSoup(fp,"html.parser")

html.parser ：是HTML类型文档的解析器，Beautifulsoup不仅仅能解析HTML，还能解析xml，json等等，不同类型的文件用不同的解析器。

import requests
from bs4 import BeautifulSoup
#将网页源码数据加载到该对象中
html = requests.get(url=url,headers=headers).text
soup = BeautifulSoup(html,"html.parser")

完成了实例化对象，接下来就是调用Beautifulsoup中的属性或方法。

tagName

指的是HTML中标签的名字，例如title、div、a、p等等。返回的是文档中第一次出现tagName对应的标签内容。

from bs4 import BeautifulSoup
#将本地html文档中的数据加载到该对象中
fp = open('./douban.html','r',encoding='utf-8')
soup = BeautifulSoup(fp,"html.parser")
print(soup.title)

>>  <title>
    豆瓣电影 Top 250
    </title>


#如果只想要标签里面的内容就在后面加.string/.text/.get_text()
#.string只可以获取该标签下直系的文本内容
#.text/get_text()可以获取某一标签的所有文本内容
print(soup.title.string)

>>  豆瓣电影 Top 250


#如果想要标签里面的属性就在后面加.attrs
soup = BeautifulSoup(fp,"html.parser")
print(soup.a.attrs)

#以字典形式返回，就可以轻松拿到标签里的属性
>>  {'href': 'https://accounts.douban.com/passport/login?source=movie', 'class': ['nav-login'], 'rel': ['nofollow']}

find()

返回的是文档中第一次出现标签或属性对应的内容。

from bs4 import BeautifulSoup
#将本地html文档中的数据加载到该对象中
fp = open('./douban.html','r',encoding='utf-8')
soup = BeautifulSoup(fp,"html.parser")
print(soup.find('a'))   #在find里面加上tagName，其实和.tagName一样

>>  <a class="nav-login" href="https://accounts.douban.com/passport/login?source=movie" rel="nofollow">登录/注册</a>

#可以在后加上标签的属性 calss_/id等
print(soup.find('a',class_="nav-login"))

>>  <a class="nav-login" href="https://accounts.douban.com/passport/login?source=movie" rel="nofollow">登录/注册</a>

#也可以直接写属性不要标签
print(soup.find(class_="title"))

>>  <span class="title">肖申克的救赎</span>

find_all()

和find()用法一样，不过find_all()可以以列表的形式返回所有符合要求的标签。

from bs4 import BeautifulSoup
#将本地html文档中的数据加载到该对象中
fp = open('./douban.html','r',encoding='utf-8')
soup = BeautifulSoup(fp,"html.parser")
print(soup.find_all('span',class_="title"))

>>  [<span class="title">肖申克的救赎</span>, <span class="title"> / The Shawshank Redemption</span>, <span class="title">霸王别姬</span>, <span class="title">阿甘正传</span>......]

select()

可以通过标签、类名、属性、id和子标签来查找，返回的是一个列表。

from bs4 import BeautifulSoup
#将本地html文档中的数据加载到该对象中
fp = open('./douban.html','r',encoding='utf-8')
soup = BeautifulSoup(fp,"html.parser")
print(soup.select('title'))     #标签

>>  [<title>
    豆瓣电影 Top 250
    </title>]


print(soup.select('.title'))    #类名

>>  [<span class="title">肖申克的救赎</span>, <span class="title"> / The Shawshank Redemption</span>, <span class="title">霸王别姬</span>, <span class="title">阿甘正传</span>......]


print(soup.select('#inp-query')) #id,前面要加#

>>  [<input id="inp-query" maxlength="60" name="search_text" placeholder="搜索电影、电视剧、综艺、影人" size="22" value=""/>]


print(soup.select('span[class="title"]'))   #属性，class后面不用加_

>>  [<span class="title">肖申克的救赎</span>, <span class="title"> / The Shawshank Redemption</span>, <span class="title">霸王别姬</span>, <span class="title">阿甘正传</span>......]


print(soup.select('.info > div > a > span'))  #子标签，'>' 代表一个层级

>>  [<span class="title">肖申克的救赎</span>, <span class="title"> / The Shawshank Redemption</span>, <span class="title">霸王别姬</span>, <span class="title">阿甘正传</span>......]


print(soup.select('.info span'))  #子标签，' '代表多个个层级

>>  [<span class="title">肖申克的救赎</span>, <span class="title"> / The Shawshank Redemption</span>, <span class="title">霸王别姬</span>, <span class="title">阿甘正传</span>......]

以上是Beautifulsoup比较常用的方法了，当然Beautifulsoup还有很多其他方法，大家感兴趣可以去查阅相关的文档。

X-Hoshino

关注

1
点赞
踩
11

收藏

觉得还不错? 一键收藏
0
评论
python爬虫之数据解析（BeautifulSoup）

BeautifulSoup也是python爬虫常用的一种数据解析方法，主要就两步。1、实例化一个Beautifulsoup对象，平且将页面源码数据加载到该对象中。2、通过调用Beautifulsoup对象中相关的属性或者方法进行标签定位和数据提取。怎么实例化一个Beautifulsoup对象呢？首先下载好bs4这个库，然后倒入BeautifulSoup包，然后就是将本地的HTML文档源码数据加载到Beautifulsoup对象中，或者是将实时的网页页面源码数据加载到Beautifulsoup
复制链接

扫一扫