beautifulsoup 最基本的用法

最新推荐文章于 2024-08-12 23:17:41 发布

睡觉对我很重要

最新推荐文章于 2024-08-12 23:17:41 发布

阅读量381

点赞数

分类专栏： python 文章标签： python

本文链接：https://blog.csdn.net/weixin_44724691/article/details/105620627

版权

python 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

基本元素

（1）tag
（2）name
（3）attributes
（4）navigableString
(5) Comment

相关的方法

(1)
获取标签

import requests
from bs4 import BeautifulSoup
r =requests.get('http://python123.io/ws/demo.html')
# 应该是判断一下状态码
demo=r.text
soup=BeautifulSoup(demo,'html.parser')
# 汤
print(soup.title)
#<title>This is a python demo page</title>
print(soup.a)
#<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>

(2)属性

print(soup.a.parent.name)
print(soup.a.name)
print(soup.a.parent.parent.name)
# p a body
print(type(soup.a.name))
# <class 'str'> 
print(soup.a.attrs)
#{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}
print(type(soup.a.attrs))
# <class 'dict'>
print(tag.attrs['class'])
#['py1']
print(type(tag.attrs['class']))
#<class 'list'> ，是类似字典类型 的访问的

print(type(tag))
# <class 'bs4.element.Tag'> 即使没有属性使用.attrs也会有返回一个字典类型

print(tag.string)
#Basic Python
print(soup.p.string)
#The demo python introduces several python courses.
print(type(soup.p.string))
#<class 'bs4.element.NavigableString'>·


html2='''
<b><!--this is a comment--></b> <p>this is not a comment </p>
'''
soup=BeautifulSoup(html2,'html.parser')
print(soup.b.string)
print(type(soup.b.string))
print(soup.p.string)
print(type(soup.p.string))

在这里插入图片描述
这是值得注意的

遍历

(1) 下行遍历
属性 $\begin{cases} .contens 子节点的列表<tag> 所有的子节点存入列表\\ .children 子节点的迭代类型，与contents类型\\ .descendants 子孙节点的迭代类型\end {cases}$

print(soup.head.contents)
#[<title>This is a python demo page</title>]
print(soup.body.contents)
# ['\n', <p class="title"><b>The demo python introduces several python courses.</b></p>, '\n', <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
# <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>, '\n']
print(len(soup.body.contents))
# 5

对于迭代类型使用 for in即可
（2）上行遍历
$\begin {cases} .parent 父亲标签\\.parents 父亲标签的迭代类型用于循环遍历先序节点 \\ \end{cases}$

for parent in soup.a.parents:
    if parent is not  None :
        print(parent.name)

(3)平行遍历（同一个父亲节点下的）
$\begin{cases} .next\_sibling 下一个平行节点的标签\\ .previous\_sibling 上一个平行节点的标签 \\ .next\_siblings 迭代类型 \\ .previous\_sibling 迭代类型 \end{cases}$

print(soup.a.next_sibling)

print(type(soup.a.next_sibling))
#and 
#<class 'bs4.element.NavigableString'>

由此可见 navigablestring 也会构成节点

更好的输出

print(soup.prettify())

查找

find_all(name,attrs,recursive,string,**kwargs)
返回列表类型
(1)

print(soup.find_all('a')) # 标签名称
print(soup.find_all(['a','b'])) # a,b 标签

import requests
import re
from bs4 import BeautifulSoup
r =requests.get('http://python123.io/ws/demo.html')
demo=r.text
soup=BeautifulSoup(demo,'html.parser')
for tag in soup.find_all(re.compile('b')):
    print(tag.name)
    # b ,body

(2)

print(soup.find_all('a','py1'))
print(soup.find_all(id='py1'))

<tag> (…) 等价于<tag>.find_all()
soup(…) 等价于 soup.find_all(…)

睡觉对我很重要

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录