python网络爬虫学习——BeautifulSoup库

maizeman126

于 2024-05-07 01:00:00 发布

阅读量490

点赞数 5

文章标签： python 爬虫学习 beautifulsoup

本文链接：https://blog.csdn.net/maizeman126/article/details/137675596

版权

参考资料：python网络爬虫技术与应用【邓维】

beautifulsoup是python的一个HTML或XML的及解析库，可以方便地从网页中提取数据。

beautifulsoup把HTML解析为对象进行处理，将全部页面转变为字典或者数组，相当于正则表达式的方式，能够大大简化处理过程。

# 导入库
from bs4 import BeautifulSoup
import urllib.request
# 创建实例，以百度为例
url="http://www.baidu.com"
# 打开和浏览URL中的内容
resp=urllib.request.urlopen(url)
# 返回html对象
html=resp.read()
# 创建对象
bs=BeautifulSoup(html)
# 格式化输出该内容
print(bs.prettify())

1、节点选择器

直接调用节点的名称就能够选择节点元素。调用string属性就能得到节点内的文本。假如单个节点的结构层次十分清晰，就能够选用这类方式来解析。

html="""
<html><head><title>TheDormouse'sstory</title></head>
<body>
<p class="title"name="dromouse"><b>TheDormouse'sstory</b></p>
<p class="story">Onceuponatimetherewerethreelittlesisters;andtheirnameswere
<a href="http://example.com/else" class="sister" id="link1"><!--Elsie--></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
andtheylivedatthebottomofawell.</p>
<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup=BeautifulSoup(html,"lxml")# 使用lxml解析器
print(soup.title) # 输出soup对象的title
print(type(soup.title)) 
print(soup.title.string) # 输出title的内容
print(soup.head)
print(soup.p)

2、提取信息

（1）获得名称。利用name属性获得节点的名称

（2）获得属性。选择一个节点元素后，能够通过调用attrs获得全部属性。

# 输出节点名称
print(soup.title.name)
# 输出p节点的全部属性
print(soup.p.attrs)
# 输出p节点name属性对应的值
print(soup.p.attrs["name"])
print(soup.p["name"])

（3）获得内容

利用string属性获得节点元素包含的文本内容。

print(soup.p.string)

（4）嵌套选择

html2="""
<html><head><title>TheDormouse'sstory</title></head>
<body>
"""
from bs4 import BeautifulSoup
# 创建对象
soup=BeautifulSoup(html,"lxml")
# 返回的时候<title>对象内容</title>
print(soup.head.title)
# 返回对象类型
print(type(soup.head.title))
# 返回标题内容
print(soup.head.title.string)

3、关联选择

在做选择的时候，有时候不能做到一步就选到想要的节点元素，需要先选中某个节点元素，而后以它为基准再选择它的子节点、父节点、兄弟节点等。

（1）父节点和祖先节点。要获得某个节点元素的父节点，可以调用parent属性。

html="""
<html><head>
<title>TheDormouse'sstory</title>
</head>
<body>
<p class="story">
Once up on a time there were three little sisters,and their names were
<a href="http://example.com/elsie" class="sister" id="link1">
<span>E1sie</span>
</a><./p>
<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup=BeautifulSoup(html,"lxml")
print(soup.a.parent)

（2）兄弟节点。获得同级的节点用next_sibling和previous_sibling

html="""
<html><head>
<title>TheDormouse'sstory</title>
</head>
<body>
<p class="story">
Once up on a time there were three little sisters,and their names were
<a href="http://example.com/elsie" class="sister" id="link1">
<span>E1sie</span>
</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
andtheylivedatthebottomofawell
</p>
"""
# 导入库
from bs4 import BeautifulSoup
# 添加lxml解析器
soup=BeautifulSoup(html,"lxml")
print("NextSibling",soup.a.next_sibling)
print("PrevSibling",soup.a.previous_sibling)
print("NextSiblings",list(enumerate(soup.a.next_siblings)))
print("PrevSiblings",list(enumerate(soup.a.previous_siblings)))

4、提取信息

html="""
<html><head>
<title>TheDormouse'sstory</title>
</head>
<body>
<p class="story">
Once up on a time there were three little sisters,and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Bob</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
</p>
"""
# 导入库
from bs4 import BeautifulSoup
# 添加lxml解析器
soup=BeautifulSoup(html,"lxml")
print("NextSibling:")
print(type(soup.a.next_sibling))
print(soup.a.next_sibling)
print(soup.a.next_sibling.string)
print("Parent:")
print(type(soup.a.parents))
print(list(soup.a.parents)[0])
print(list(soup.a.parents)[0].attrs["class"])

如果返回结果是单个节点，那么能够直接调用string、attrs等属性获得其文本和属性；假如返回结果是多个节点的生成器，则能够转为列表后取出某个元素，再调用string、attrs等属性取得对应节点的文本和属性。

find_all()用于查询全部符合条件的元素。如下

html="""
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="listlist-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
"""

from bs4 import BeautifulSoup
soup=BeautifulSoup(html,"lxml")
# 输出find_all查询name
print(soup.find_all(name="ul"))
print(type(soup.find_all(name="ul")[0]))

本例调用find_all()方法，传入name参数，其参数值为ul。也就是说想要查询全部ul节点，返回结果是列表类型，每个元素仍然是bs4.element.Tag类型。

5、解析本地网页

运用import从bs4中导入BeautifulSoup库，利用open函数打开存放在本地的网页文件所在位置，随后使用BeautifulSoup解析网页，解析完毕打印相应的网页数据。

from bs4 import BeautifulSoup
with open("fff.html",encoding="utf-8")as web_data:
    soup=BeautifulSoup(web_data.read(),"lxml")
    print(soup.title)

6、解析在线网页

解析在线网页时，首先把在线网页的数据请求至本地，之后再进行解析。因而，除导入BeautifulSoup模块之外，还要导入网页请求模块requests。

# 导入库
from bs4 import BeautifulSoup
import requests
url="https://cn.tripadvisor.com/Attractions-g60763-Activities-New_York_City_New_York.html"
# 引入URL地址
web_data=requests.get(url)
# 创建beautifulsoup对象
soup=BeautifulSoup(web_data,"html")
# 输出对象内容
print(soup)

maizeman126

关注

5
点赞
踩
8

收藏

觉得还不错? 一键收藏
0
评论
python网络爬虫学习——BeautifulSoup库

假如返回结果是多个节点的生成器，则能够转为列表后取出某个元素，再调用string、attrs等属性取得对应节点的文本和属性。运用import从bs4中导入BeautifulSoup库，利用open函数打开存放在本地的网页文件所在位置，随后使用BeautifulSoup解析网页，解析完毕打印相应的网页数据。在做选择的时候，有时候不能做到一步就选到想要的节点元素，需要先选中某个节点元素，而后以它为基准再选择它的子节点、父节点、兄弟节点等。要获得某个节点元素的父节点，可以调用parent属性。
复制链接

扫一扫