BeautifulSoup爬虫小结

最新推荐文章于 2023-05-22 14:59:42 发布

Gao__xi

最新推荐文章于 2023-05-22 14:59:42 发布

阅读量254

点赞数

分类专栏： Python爬虫基础文章标签： BeautifulSoup

本文链接：https://blog.csdn.net/Gao__xi/article/details/88674986

版权

Python爬虫基础专栏收录该内容

7 篇文章 0 订阅

订阅专栏

个人感悟

爬虫的最终目的是：爬取网页中的信息，也就是html文档中的信息，可以下几步：

获取html文档
获取想要内容所在的标签
获取想要的内容（一般是标签中的文字信息，或者 a 连接 href）

代码

from bs4 import  BeautifulSoup
htmltext='''<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>选择器</title>
</head>
<body>
<!--测试class-->
<p class="p_class">1</p>
<p class="p_class">1</p>
<p class="p_class">1</p>
<p class="p_class">1</p>
<!--测试href-->
<a  id="mya" class="myaclass" href="https://www.baidu.com">CSDN博客<p><p>insert</p></p><p>dddd</p></a>
<a href="https://www.baidu.com" >2222</a>
<a href="https://www.taobao.com"><p>3333</p></a>
<a href="https://www.taobao.com">4444</a>
<!--测试id-->
<p id="myp1">1</p>
<p id="myp2">2</p>
<p id="myp3">3</p>
<p id="myp4">4</p>
<p id="myp5" class="lbj" >5</p>
</body>
</html>'''
soup=BeautifulSoup(htmltext,'lxml')
aTag=soup.select("a")[0]
#获取属性方法1
print(aTag["id"])
print(aTag["class"])#class 可能有很多因此返回的是list
print(aTag["href"])

#获取属性方法2
print(aTag.get("id"))
print(aTag.get("class"))
print(aTag.get("href"))
###string text 获取标签的文本内容.
print(aTag.text)  #标签下的所有文字包括子和孙子
print(aTag.string)#只能获得直接子标签的文字内容，如果标签内有多个内容，则为空NavigableString