爬虫解析网页的find方法_爬虫里面find-CSDN博客

本文链接：https://blog.csdn.net/wzxeleanor/article/details/122441520

文章目录

find_all()方法
- 查找范围
- - find_all()方法的参数
find 方法
提取节点和提取节点内容
- 提取对应元素的元素属性

find_all()方法

find_all()返回的结果是一个类似列表的可迭代对象，里面包含了所有满足参数条件的 Tag 对象。

查找范围

可以从 BeautifulSoup 对象中找，也可以从 Tag 对象中找。

BeautifulSoup对象.find_all()

Tag对象.find_all()

find_all()方法的参数

第一章参数：HTML元素名
传入HTML 元素名作为 find_all() 方法的参数，即可搜索所有元素名匹配的 Tag 对象。
BeautifulSoup 对象.find_all(‘div’)，可以获取 BeautifulSoup 对象中所有的

元素。用元素名查找对应的 Tag 对象时，每次只能传入一个元素名，而且要以字符串的形式传入。（‘ ’）

from bs4 import BeautifulSoup

html = '''
    <html>
    <head>
    <meta charset="utf-8">
    <title>大川神的爬虫世界</title>
    </head>
    <body>
    <div id="header">
    <h1>川神教你HTML</h1>
    </div>
    <div class="poems" id="section1">
    <h2>静夜思</h2>
    <h3>李白（唐）</h3>
    <p>床前明月光，疑是地上霜。<br>举头望明月，低头思故乡。</p>
    </div>
    <div class="poems" id="section2">
    <h2>早发白帝城</h2>
    <h3>李白（唐）</h3>
    <p>朝辞白帝彩云间，千里江陵一日还。<br>两岸猿声啼不住，轻舟已过万重山。</p>
    </div>
    </body>
    </html>
    '''
# 解析 HTML 文档
bs = BeautifulSoup(html, 'html.parser')
# 用find_all()获取所有<div>节点
div_all = bs.find_all('div')

# 打印查看结果
print(div_all)

bs=BeautifulSoup(html,‘html,parser’)

div_all=bs.find_all(‘div’)

第二种参数是：HTML 元素属性。

传入HTML 元素属性作为 find_all() 方法的参数，就可以依据 HTML 元素的属性（如 id, class, href）来搜素对应的 Tag 对象。传入 HTML 元素属性时，要用参数名 = 参数值的形式，一次可以传入 0 到多个属性。参数名通常是元素的属性名，参数值就是对应的属性值。

这里需要注意的是：HTML 的 class 属性与 Python 的保留关键字 class 重复。因此，作为参数使用 class 属性时，要加一个_，写作class_避免混淆。

BeautifulSoup 对象.find_all(class_‘poems’)方法用来搜索 BeautifulSoup 对象中，所有拥有属性class="poems"的元素对应的 Tag 对象。

# 用find_all()获取所有含属性class="poems"的HTML元素对应的节点
poems_all = bs.find_all(class_='poems')

# 打印查看结果
print(poems_all)

poems_all = bs.find_all(class = ‘poems’)

由于 find_all() 返回的都是满足所有参数条件的 Tag 对象，因此，可以结合使用上述两种参数，更准确定位到 Tag 对象。同时使用 HTML 元素名和 HTML 元素属性作为搜索条件时，要把 HTML 元素名作为第 1 个参数，后面接 0 到多个 HTML 元素属性。

想在 BeautifulSoup 对象中搜索所有元素名为div，并且拥有属性class="poems"的元素对应的 Tag 对象，语法应该是：BeautifulSoup 对象.find_all(‘div’, class_=‘poems’)这一语句需要注意的是，find_all() 返回的结果并不是 Tag 对象，而是 Tag 对象组成的一个类似列表的可迭代对象。要拿到其中的 Tag 对象，通常需要for 循环来帮忙。

from bs4 import BeautifulSoup

html = '''
    <html>
    <head>
    <meta charset="utf-8">
    <title>大川神的爬虫世界</title>
    </head>
    <body>
    <div id="header">
    <h1>川神教你HTML</h1>
    </div>
    <div class="poems" id="section1">
    <h2>静夜思</h2>
    <h3>李白（唐）</h3>
    <p>床前明月光，疑是地上霜。<br>举头望明月，低头思故乡。</p>
    </div>
    <div class="poems" id="section2">
    <h2>早发白帝城</h2>
    <h3>李白（唐）</h3>
    <p>朝辞白帝彩云间，千里江陵一日还。<br>两岸猿声啼不住，轻舟已过万重山。</p>
    </div>
    </body>
    </html>
    '''
# 解析 HTML 文档
bs = BeautifulSoup(html, 'html.parser')
# 用find_all()获取所有含属性class="poems"的HTML元素对应的节点
poems_all = bs.find_all(class_='poems')

# 遍历 find_all() 的结果 poems_all，得到其中的每个节点，并打印
for poem in poems_all:
    print('------打印 Tag 对象------')
    print(poem)

find 方法

find() 方法使用范围和 find_all() 一样，也是 BeautifulSoup 对象和 Tag 对象都能用的方法；find() 方法的参数要求也和 find_all() 相同。唯一不同的是，find() 方法返回的结果是一个 Tag 对象，更准确地说是：搜索范围内，满足参数条件的第一个 Tag 对象。这一点和.元素名操作有点儿像。

在这里插入图片描述

from bs4 import BeautifulSoup

html = '''
    <html>
    <head>
    <meta charset="utf-8">
    <title>大川神的爬虫世界</title>
    </head>
    <body>
    <div id="header">
    <h1>川神教你HTML</h1>
    </div>
    <div class="poems" id="section1">
    <h2>静夜思</h2>
    <h3>李白（唐）</h3>
    <p>床前明月光，疑是地上霜。<br>举头望明月，低头思故乡。</p>
    </div>
    <div class="poems" id="section2">
    <h2>早发白帝城</h2>
    <h3>李白（唐）</h3>
    <p>朝辞白帝彩云间，千里江陵一日还。<br>两岸猿声啼不住，轻舟已过万重山。</p>
    </div>
    </body>
    </html>
    '''
# 解析 HTML 文档
bs = BeautifulSoup(html, 'html.parser')
# 用find()获取第一个满足参数条件的节点
poem1_tag = bs.find('div', class_='poems')
# 用find()从poem1_tag中提取<h2>节点
h2_tag = poem1_tag.find('h2')
# 打印查看结果
print(h2_tag)

在这里插入图片描述

提取节点和提取节点内容

这是提取的节点
在这里插入图片描述
这是提取节点内的内容

语法是Tag对象.text
在这里插入图片描述

提取对应元素的元素属性

Tag 对象[‘属性名’]，可用来提取对应元素的属性值

from bs4 import BeautifulSoup

html = '''
    <html>
    <head>
    <meta charset="utf-8">
    <title>大川神的爬虫世界</title>
    </head>
    <body>
    <div id="header">
    <h1>川神教你HTML</h1>
    </div>
    <div class="poems" id="section1">
    <h2>静夜思</h2>
    <h3>李白（唐）</h3>
    <p>床前明月光，疑是地上霜。<br>举头望明月，低头思故乡。</p>
    </div>
    <div class="poems" id="section2">
    <h2>早发白帝城</h2>
    <h3>李白（唐）</h3>
    <p>朝辞白帝彩云间，千里江陵一日还。<br>两岸猿声啼不住，轻舟已过万重山。</p>
    </div>
    </body>
    </html>
    '''
# 解析 HTML 文档
bs = BeautifulSoup(html, 'html.parser')
# 用find()获取第一个满足条件节点的id属性值
poem1_id = bs.find('div', class_='poems')['id']

# 打印查看结果
print(poem1_id)