beautifulsoup方法总结

入梦游

已于 2022-11-20 23:16:06 修改

阅读量1.1k

点赞数 1

分类专栏： python学习笔记文章标签： beautifulsoup 爬虫

于 2022-11-20 23:04:13 首次发布

本文链接：https://blog.csdn.net/weixin_66651900/article/details/127955492

版权

python学习笔记专栏收录该内容

4 篇文章 0 订阅

订阅专栏

BeautifulSoup模块

模块beautifulsoup
总结
测试源码

模块beautifulsoup

导入模块
from bs4 import BeautifulSoup

实列化类
soup = BeautifulSoup(html,'lxml') 1.解析对象 2.解析器

简单筛选

获取p标签对象
soup.p
获取p标签名称
soup.p.name
获取p标签属性值
soup.p.attrs['href']
获取p标签文本内容
soup.p.string

find_all()方法

通过p标签获取

soup.find_all('p')

通过属性筛选所有具有value值的标签对象

soup.find_all(class_='value')

soup.find_all(id='value')

soup.find_all(tag='value')

组合筛选:

筛选出具有value值的p标签对象
soup.find_all('p',{'class':'value'})

筛选同时具有class_='value’和text='文本内容’的p标签对象

soup.find_all('p',class_='value',text='文本内容')

select()方法

标签筛选

soup.select('p')

属性筛选:
通过id属性筛选在属性前必须带’.‘,class选择器属性前带’#',普通属性必须放在中括号[]内

soup.select('.story')

soup.select('#w')

soup.select('[href="http://example.com/lacie"]')

组合筛选

soup.select('p.story')

soup.select('p#w')

soup.select('a[href="http://example.com/lacie"]')

soup.select('p a[href="http://example.com/lacie"]')#空格隔开，表示筛选p标签下的具有属性href="http://example.com/lacie"的a标签

子标签查找

soup.select('head>title')

获取属性方法

[‘’]

print(soup.p['name'])

attrs[‘’]

print(soup.p.attrs['tag'])`

get()

soup.p.get('name')

获取文本方法

string
获取p标签的文本内容

print(soup.p.string)

get_text()
获取p标签下所有文本内容

print(soup.p.get_text())

总结

find_all()和select()返回的都是列表list。获取标签属性和文本内容都需要先转成字符串str，通常使用for循环历遍。
string方法获取的文本内容来自当前标签，该标签下其他标签的文本内容无法提取。get_text()方法可以提取该标签下所有文本内容。

测试源码

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>

    <body>

        <p class="title" name="dromouse" tag='第一个p标签'>
            <b>The Dormouse's story</b>
        </p>
        <p class="story" tag='第二个p标签'>
            你好！
            <a href="http://example.com/elsie" class="sister" id="link1" tag='第二个标签下的第1个a标签'>
                <!-- 注释 -->
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2" tag='第二个标签下的第2个a标签'>
                世界
            </a>
            或
            <a href="http://example.com/tillie" class="sister" id="link3" tag='第二个标签下的第3个a标签'>
                hello!
            </a>;
                world
        </p>
        <p class="story" tag='第三个p标签'>第三个</p>
        <p class="story_1" id='q' tag='第四个p标签'>第四个</p>
        <p id='w' tag='第五个p标签'>
            <b class='title'>这是b标签</b>
        </p>
        

    </body>
</html>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')

# print(soup.p) #获取p标签对象

# print(soup.p['name'])
# print(soup.p.attrs['tag']) #获取<p>标签属性

# print(soup.p.string)      #获取<p>标签的文本内容
# print(soup.p.get_text())  #获取<p>标签下所有文本内容

# print(soup.p.b)           #获取p标签内b标签对象
# print(soup.p.b.string)    #获取<p标签内b标签下的文本内容>

# soup.find_all('p')
# soup.find_all('p', {'name':"dromouse"})#疑问：多属性查找是否可行？
# soup.find_all('p',text='第三个')
# soup.find_all('p',class_='story_1')

# soup.find_all(class_='story')
# soup.find_all(id='w')
# soup.find_all(tag='第四个p标签')

# soup.select('p')
# soup.select('p.title')
# soup.select('p#q')

# soup.select('.title')
# soup.select('#q')
# soup.select('[href="http://example.com/lacie"]')

# soup.select('p .title')#空格间隔，表示下一层级,(p标签下class属性为title的标签对象)

# soup.select('a[href="http://example.com/lacie"]')#属性查找
# soup.select('p a[href="http://example.com/lacie"]')#属性组合查找

# soup.select('head>title')# 子标签查找
# soup.p.get('name')

入梦游

关注

1
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
beautifulsoup方法总结

find_all()和select()返回的都是列表list。获取标签属性和文本内容都需要先转成字符串str，通常使用for循环历遍。string方法获取的文本内容来自当前标签，该标签下其他标签的文本内容无法提取。get_text()方法可以提取该标签下所有文本内容。
复制链接

扫一扫

专栏目录