python中beautifulsoup怎么输出文本内容_python-BeautifulSoup：提取不在给定标签中的文本...

最新推荐文章于 2023-09-05 18:52:00 发布

炒锅电解氯化钠

最新推荐文章于 2023-09-05 18:52:00 发布

阅读量488

点赞数

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/weixin_32023091/article/details/111967467

版权

python中beautifulsoup怎么输出文本内容

我有以下变量,标头等于：

Andrew Anglin

Daily Stormer

February 11, 2017

我只想从此变量中提取日期2017年2月11日.

如何在python中使用BeautifulSoup做到这一点？

解决方法:

如果您知道日期始终是header变量中的最后一个文本节点,则可以访问.contents property并获取返回列表中的最后一个元素：

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')

header = soup.find('p')

header.contents[-1].strip()

> February 11, 2017

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')

header = soup.find('p')

header.text.split('\n')[-1]

> February 11, 2017

如果您不知道日期文本节点的位置,那么另一种选择是解析出所有匹配的字符串：

from bs4 import BeautifulSoup

import re

soup = BeautifulSoup(html, 'html.parser')

header = soup.find('p')

re.findall(r'\w+ \d{1,2}, \d{4}', header.text)[0]

> February 11, 2017

但是,正如您的标题所暗示的那样,如果您只想检索未用element标签包裹的文本节点,则可以使用以下内容来过滤掉元素：

from bs4 import BeautifulSoup

import re

soup = BeautifulSoup(html, 'html.parser')

header = soup.find('p')

text_nodes = [e.strip() for e in header if not e.name and e.strip()]

请记住,由于第一个文本节点未包装,这将返回以下内容：

> ['Andrew Anglin', 'February 11, 2017']

当然,您也可以结合使用最后两个选项,并在返回的文本节点中解析出日期字符串：

from bs4 import BeautifulSoup

import re

soup = BeautifulSoup(html, 'html.parser')

header = soup.find('p')

for node in header:

if not node.name and node.strip():

match = re.findall(r'^\w+ \d{1,2}, \d{4}$', node.strip())

if match:

print(match[0])

> February 11, 2017

标签：python-3-x,beautifulsoup,web-scraping,html,python

来源： https://codeday.me/bug/20191026/1935087.html

炒锅电解氯化钠

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。