Python爬虫之数据解析——BeautifulSoup亮汤模块(一):基础与遍历(接上文,2023美赛春季赛帆船数据解析sailboatdata.com)

一、html文件准备

首先,我们要明确我们需要的数据,并在html中找到它们的位置。

1.帆船名称:11 METER

2.Sailboat Specifications

事实上,还可以获取更多帆船数据,但因为与Sailboat Specifications的过程基本相同,这里省略。

为了方便演示,我把相关部分摘下来:

<!doctype html>
<html lang="en-US">

<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <meta name="apple-itunes-app" content="app-id=1545197438">
    <link rel="profile" href="https://gmpg.org/xfn/11">

        
    <meta name='robots' content='index, follow, max-image-preview:large, max-snippet:-1, max-video-preview:-1'/>

    <!-- This site is optimized with the Yoast SEO plugin v20.6 - https://yoast.com/wordpress/plugins/seo/ -->
    <title>11 METER - sailboatdata</title>
    <link rel="canonical" href="https://sailboatdata.com/sailboat/11-meter/"/>
    <meta property="og:locale" content="en_US"/>
    <meta property="og:type" content="article"/>
    <meta property="og:title" content="11 METER - sailboatdata"/>
    <meta property="og:url" content="https://sailboatdata.com/sailboat/11-meter/"/>
    <meta property="og:site_name" content="sailboatdata"/>
    <meta property="article:publisher" content="https://www.facebook.com/sailboatdata"/>
    <meta property="article:modified_time" content="2023-05-12T18:11:13+00:00"/>
    <meta property="og:image" content="https://sailboatdata.com/wp-content/uploads/2023/03/11_meter_photo.jpg"/>
    <meta property="og:image:width" content="483"/>
    <meta property="og:image:height" content="640"/>
    <meta property="og:image:type" content="image/jpeg"/>
</head>

<body class="sailboat-template-default single single-sailboat postid-64231 wp-custom-logo aa-prefix-sailb-">
<div id="page" class="site">
    <a class="skip-link screen-reader-text" href="#primary">Skip to content</a>

    <main id="primary" class="site-main">
        <div class="container">
            <div class="row">
                <div class="col-lg-8 sailboat-content-wrapper">
                    <div class="boats-dimension">
                        <h1>11 METER</h1>

                    </div>
                    <div class="spec">
                        <div class="spec-bord">
                            <table class="table">
                                <tbody class="table-light">
                                <!-- start hull section -->
                                <tr>
                                    <td>Hull Type:</td>
                                    <td>Fin w/bulb &amp; spade rudder</td>
                                </tr>
                                <!-- end hull section -->
                                <!-- start rig section -->
                                <tr>
                                    <td>Rigging Type:</td>
                                    <td>Fractional Sloop</td>
                                </tr>
                                <!-- end rig section -->
                                <!-- start loa section -->
                                <tr>
                                    <td>LOA:</td>
                                    <td>33.80 ft / 10.30 m</td>
                                </tr>
                                <!-- end loa section -->
                                <!-- start lod section -->
                                <!-- end lod section -->
                                <!-- start lwl section -->
                                <tr>
                                    <td>LWL:</td>
                                    <td>26.90 ft / 8.20 m</td>
                                </tr>
                                <!-- end lwl section -->
                                <!-- start sa section -->
                                <tr>
                                    <td>S.A. (reported):</td>
                                    <td>450.00 ft² / 41.81 m²</td>
                                </tr>
                                <!-- end sa section -->
                                <!-- start beam section -->
                                <tr>
                                    <td>Beam:</td>
                                    <td>8.20 ft / 2.50 m</td>
                                </tr>
                                <!-- end beam section -->
                                <!-- start beam wl section -->
                                <!-- end beam wl section -->
                                <!-- start disp section -->
                                <tr>
                                    <td>Displacement:</td>
                                    <td>3,527.00 lb / 1,600 kg</td>
                                </tr>
                                <!-- end disp section -->
                                <!-- start ballast section -->
                                <tr>
                                    <td>Ballast:</td>
                                    <td>1,598.00 lb / 725 kg</td>
                                </tr>
                                <!-- end ballast section -->
                                <!-- start max draft section -->
                                <tr>
                                    <td>Max Draft:</td>
                                    <td>5.90 ft / 1.80 m</td>
                                </tr>
                                <!-- end max draft section -->
                                <!-- start min draft section -->
                                <!-- end min draft section -->

                                <!-- start hull construction section -->
                                <tr>
                                    <td>Construction:</td>
                                    <td>FG w/core,Composite</td>
                                </tr>

                                <!-- end hull construction section  -->
                                <!-- start ballast type section -->
                                <tr>
                                    <td>Ballast Type:</td>
                                    <td>Lead</td>
                                </tr>
                                <!-- end ballast type section  -->

                                <!-- start Bridgedeck Clearance section -->
                                <!-- end Bridgedeck Clearance section -->
                                <!-- start first built section -->
                                <tr>
                                    <td>First Built:</td>
                                    <td>1990</td>
                                </tr>
                                <!-- end first built section -->
                                <!-- start last built section -->
                                <!-- end last built section -->

                                <!-- start number built section -->
                                <tr>
                                    <td># Built:</td>
                                    <td>350</td>
                                </tr>
                                <!-- end number built section -->
                                <!-- start builder name section -->
                                <!-- end builder name section -->
                                <!-- start designer name section -->
                                <tr>
                                    <td>Designer:</td>
                                    <td>Ron Holland & Rolf Gyhlenius</td>
                                </tr>
                                <!-- end designer name section -->
                                </tbody>
                            </table>
                        </div>
                    </div>
                    <!-- Auxiliary Power/Tanks -->


                </div>

            </div>
        </div>
    </main><!-- #main -->
</div><!-- #page -->
</body>

</html>

 

观察结构,它是由一个head和body组成,head中有一些meta、link

二、beautifulsoup基本类型与用法

beautifulsuop,漂亮的汤,它的作用是将一个html文件进行解析,转换成一个树状结构,每个节点是一个Python对象,有四种类型:Tag,NavigableString,BeautifulSoup,comment

首先我们新建文件testBs4.py

0.导包、读取文件、创建bs对象

from bs4 import BeautifulSoup

f = open(r"./11meters.html", "rb")  # 以二进制读取,以模拟我们爬取页面response的结果
html = f.read().decode("utf-8")     # 解码

#创建beautifulsoup对象
bs= BeautifulSoup(html,"html.paeser")   # (待解析的文档,解析器类型)

1.bs4.element.Tag(获取第一次出现的标签及层级下的所有内容

print(type(bs.title),bs.title)
print("-"*50)
print(type(bs.meta),bs.meta)
print("-"*50)
print(type(bs.link),bs.link)
print("-"*50)
print(type(bs.head),bs.head)
print("-"*50)
print(type(bs.a),bs.a)
print("-"*50)
# print(type(bs.div),bs.div)
# print("-"*50)
D:\Anaconda3\python.exe C:\Users\和谐号\PycharmProjects\pythonProject\2023-08-22-sailboatData\testBs4.py 
<class 'bs4.element.Tag'> <title>11 METER - sailboatdata</title>
--------------------------------------------------
<class 'bs4.element.Tag'> <meta charset="utf-8"/>
--------------------------------------------------
<class 'bs4.element.Tag'> <link href="https://gmpg.org/xfn/11" rel="profile"/>
--------------------------------------------------
<class 'bs4.element.Tag'> <head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<meta content="app-id=1545197438" name="apple-itunes-app"/>
<link href="https://gmpg.org/xfn/11" rel="profile"/>
<meta content="index, follow, max-image-preview:large, max-snippet:-1, max-video-preview:-1" name="robots">
<!-- This site is optimized with the Yoast SEO plugin v20.6 - https://yoast.com/wordpress/plugins/seo/ -->
<title>11 METER - sailboatdata</title>
<link href="https://sailboatdata.com/sailboat/11-meter/" rel="canonical">
<meta content="en_US" property="og:locale">
<meta content="article" property="og:type">
<meta content="11 METER - sailboatdata" property="og:title"/>
<meta content="https://sailboatdata.com/sailboat/11-meter/" property="og:url"/>
<meta content="sailboatdata" property="og:site_name"/>
<meta content="https://www.facebook.com/sailboatdata" property="article:publisher"/>
<meta content="2023-05-12T18:11:13+00:00" property="article:modified_time"/>
<meta content="https://sailboatdata.com/wp-content/uploads/2023/03/11_meter_photo.jpg" property="og:image"/>
<meta content="483" property="og:image:width"/>
<meta content="640" property="og:image:height"/>
<meta content="image/jpeg" property="og:image:type"/>
</meta></meta></link></meta></head>
--------------------------------------------------
<class 'bs4.element.Tag'> <a class="skip-link screen-reader-text" href="#primary">Skip to content</a>
--------------------------------------------------

进程已结束,退出代码为 0

可以str()到字符串

2.bs4.element.NavigableString (获取第一次出现的标签里的内容

print(type(bs.title.string),bs.title.string)
print("-"*50)
print(type(bs.meta.string),bs.meta.string)
print("-"*50)
print(type(bs.link.string),bs.link.string)
print("-"*50)
print(type(bs.head.string),bs.head.string)
print("-"*50)
print(type(bs.a.string),bs.a.string)
print("-"*50)
# print(type(bs.div),bs.div)
# print("-"*50)

<class 'bs4.element.NavigableString'> 11 METER - sailboatdata
--------------------------------------------------
<class 'NoneType'> None
--------------------------------------------------
<class 'NoneType'> None
--------------------------------------------------
<class 'NoneType'> None
--------------------------------------------------
<class 'bs4.element.NavigableString'> Skip to content
--------------------------------------------------

当不存在内容时,就是None,类型也不是bs4.element.NavigableString了,

可以str()到字符串

3.bs4.BeautifulSoup (整个bs对象)

print(type(bs.name),bs.name)
print("-"*50)
print(type(bs.attrs),bs.attrs)
print("-"*50)
print(type(bs),bs)

<class 'str'> [document]
--------------------------------------------------
<class 'dict'> {}
--------------------------------------------------
<class 'bs4.BeautifulSoup'> <!DOCTYPE doctype html>

<html lang="en-US">
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<meta content="app-id=1545197438" name="apple-itunes-app"/>
<link href="https://gmpg.org/xfn/11" rel="profile"/>
<meta content="index, follow, max-image-preview:large, max-snippet:-1, max-video-preview:-1" name="robots">
<!-- This site is optimized with the Yoast SEO plugin v20.6 - https://yoast.com/wordpress/plugins/seo/ -->
<title>11 METER - sailboatdata</title>
<link href="https://sailboatdata.com/sailboat/11-meter/" rel="canonical">

......

4.bs4.element.Comment(一种特殊的NavigableString)

当标签里的内容时注释时,再用.string命令,得到的不在是NavigableString,而是comment,同时也不会原封不动地输出注释,而是展示去掉注释符号以后的结果。

(1)对html做如下修改: 

print(type(bs.a),bs.a)
print(type(bs.a.string),bs.a.string)
print(type(bs.a.attrs),bs.a.attrs)

 <class 'bs4.element.Tag'> <a class="skip-link screen-reader-text" href="#primary">Skip to content</a>
<class 'bs4.element.NavigableString'> Skip to content
<class 'dict'> {'class': ['skip-link', 'screen-reader-text'], 'href': '#primary'}

(2)在修改:

print(type(bs.a),bs.a)
print(type(bs.a.string),bs.a.string)
print(type(bs.a.attrs),bs.a.attrs)

 <class 'bs4.element.Tag'> <a class="skip-link screen-reader-text" href="#primary"><!--Skip to content--></a>
<class 'bs4.element.Comment'> Skip to content
<class 'dict'> {'class': ['skip-link', 'screen-reader-text'], 'href': '#primary'}

5..attrs获取标签里的属性值(字典类型,键值对存储)

print(type(str(bs.title.attrs)),str(bs.title.attrs))
print("-"*50)
print(type(bs.meta.attrs),bs.meta.attrs)
print("-"*50)
print(type(bs.link.attrs),bs.link.attrs)
print("-"*50)
print(type(bs.head.attrs),bs.head.attrs)
print("-"*50)
print(type(bs.a.attrs),bs.a.attrs)
print("-"*50)

<class 'str'> {}
--------------------------------------------------
<class 'dict'> {'charset': 'UTF-8'}
--------------------------------------------------
<class 'dict'> {'rel': ['profile'], 'href': 'https://gmpg.org/xfn/11'}
--------------------------------------------------
<class 'dict'> {}
--------------------------------------------------
<class 'dict'> {'class': ['skip-link', 'screen-reader-text'], 'href': '#primary'}
--------------------------------------------------

当不存在内容时,返回空字典


上述内容只是让我们了解beautifulsoup中有的对象类型,以及它们的特点,真正在使用的时候,还用不上,因为它只能返回它找到的第一项,对我们提取数据用处不大。

三、beautifulsoup遍历(对tag操作)

1..contents属性

print(bs.body)
print("-"*500)
print(type(bs.body.contents))
for item in bs.body.contents:
    print(item)

 bs.body是tag对象,我们可以把它转为str,然后按行放到一个列表中,有点像readlins()方法

以上代码,主题部分是完全相同的,区别就是contents只有head里面的内容,head标签以及起止符不会出现:

.body:

 

 .body.contents:

 

 2.children

print(bs.body)
print("-"*500)
print(type(bs.body.children))
for item in bs.body.children:
    print(item)

 

与contents相比,似乎只是数据类型的不同,contents得到是list,children得到的是list_iterator

 3.其它

5.3 .descendants :获取 Tag 的所有子孙节点
5.4 .strings :如果 Tag 包含多个字符串,即在子孙节点中有内容,可以用此获取,而后进行遍历
5.5 .stripped_strings :与 strings 用法一致,只不过可以去除掉那些多余的空白内容
5.6 .parent :获取 Tag 的父节点
5.7 .parents :递归得到父辈元素的所有节点,返回一个生成器
5.8 .previous_sibling :获取当前 Tag 的上一个节点,属性通常是字符串或空白,真实结果是当前标签
与上一个标签之间的顿号和换行符
5.9 .next_sibling :获取当前 Tag 的下一个节点,属性通常是字符串或空白,真是结果是当前标签与下
一个标签之间的顿号与换行符
5.10 .previous_siblings :获取当前 Tag 的上面所有的兄弟节点,返回一个生成器
5.11 .next_siblings :获取当前 Tag 的下面所有的兄弟节点,返回一个生成器
5.12 .previous_element :获取解析过程中上一个被解析的对象 ( 字符串或 tag) ,可能与
previous_sibling 相同,但通常是不一样的
5.13 .next_element :获取解析过程中下一个被解析的对象 ( 字符串或 tag) ,可能与 next_sibling 相同,
但通常是不一样的
5.14 .previous_elements :返回一个生成器,可以向前访问文档的解析内容
5.15 .next_elements :返回一个生成器,可以向后访问文档的解析内容
5.16 .has_attr :判断 Tag 是否包含属性
遍历,感觉要对这个得到的html结构很熟悉,然后再用前节点后节点父节点子节点来回调指针,没有下面的搜索好用。
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值