37-python中bs4获取的标签中如何提取子标签

最新推荐文章于 2024-07-17 14:11:27 发布

ystraw_ah

最新推荐文章于 2024-07-17 14:11:27 发布

阅读量3.1k

点赞数 4

分类专栏： python

本文链接：https://blog.csdn.net/qq_39451578/article/details/97861845

版权

python 专栏收录该内容

73 篇文章 1 订阅

订阅专栏

如果只是要提取一个标签里面的属性值啥的，直接看这篇文章就可以了：

23-python用BeautifulSoup用抓取a标签内所有数据

如果是标签的嵌套，可以参考下面的思路，虽然不是很简洁，但是可以解决你的问题：

可以看到不能直接 findAll 所有的 tr 标签，否则会有许多杂质的，所以，可以知道 table, 根据id 或者class, 则可以唯一找到；

下面的重点就是如何分析出我们想要的数据，如何提取出每个 tr 包含的一行数据呢？

我的思路是：

findALL---table => 得到一个list只包含一个table,也就是我要的table

将这个table 在用一次 findall 然后：

findAll --- tr ==> 得到一个list, 包含每一个<tr>....</tr>; 好，我们已经得到我们要的每一行数据了

如何提取每一个数据呢：两种方法：

第一种是考虑遍历每个<tr>, 利用findALL函数找 td, 读取td的string

第二种是考虑遍历每个<tr>, 利用findALL函数找到每个td, 用replace 替换掉 <td>,<\td>

代码如下：

# -*- coding:utf-8 -*-
# python 2.7
# XiaoDeng
# http://tieba.baidu.com/p/2460150866
# 标签操作


from bs4 import BeautifulSoup
import urllib.request
import re

# 如果是网址，可以用这个办法来读取网页
# html_doc = "http://tieba.baidu.com/p/2460150866"
# req = urllib.request.Request(html_doc)
# webpage = urllib.request.urlopen(req)
# html = webpage.read()



html = """
<table class="GridViewStyle" cellspacing="0" rules="all" border="1" id="ctl00_MainContentPlaceHolder_GridScore">
		<caption>
			(共49条/1页)
		</caption>
		<tr class="HeaderStyle">
			<th scope="col">学年</th><th scope="col">学期</th><th scope="col">课程名称</th><th scope="col">课程学分</th><th scope="col">考试类型</th><th scope="col">考试成绩</th><th scope="col">所获学分</th><th scope="col">考试成绩3</th>
		</tr>
		<tr>
			<td>2016</td><td>3</td><td>高等数学Ⅰ(一)</td><td>5.5</td><td>正常</td><td>67</td><td>5.5</td><td>107356</td>
		</tr>
		<tr>
			<td>2017</td><td>1</td><td>高等数学Ⅰ(二)</td><td>5.5</td><td>正常</td><td>65</td><td>5.5</td><td>111481</td>
		</tr>
	</table>
</div>
"""
bs = BeautifulSoup(html, 'lxml')
score = bs.findAll('table', attrs={"id": 'ctl00_MainContentPlaceHolder_GridScore'})
bs2 = score[0]  #提取出唯一一项，注意它依然是'lxml'格式，不需要重新beautifulSoup,注意它不是字符串！！！
score = bs2.findAll('tr') #找到每一个行
for i in score:
    rt = i.findAll('td')  #找到每一列
    # print(rt)
    if len(rt) == 0:
        continue
    ## 一种方法：
    for j in rt:
        # print('j: ', j)
        sj = str(j)
        sj = sj.replace("<td>", '')
        sj = sj.replace('</td>', '')
        print(sj)
    ## 另一种方法：
    for j in rt:
        print(j.string)
    print('=============================')