使用BeautifulSoup的string元素提取标签内容出现None的解决方法

最新推荐文章于 2024-08-16 17:18:56 发布

hq_686842

最新推荐文章于 2024-08-16 17:18:56 发布

阅读量5.7k

点赞数 14

分类专栏： Python

本文链接：https://blog.csdn.net/qq_41417974/article/details/103084351

版权

Python 专栏收录该内容

12 篇文章 0 订阅

订阅专栏

使用tag.string出现None的解决方法:

一，存在bs4.element.NavigableString和bs4.element.Comment导致使用.string时出现None.
二，html页面中存在br标签时使用.string导致出现None

一，存在bs4.element.NavigableString和bs4.element.Comment导致使用.string时出现None.

1.在使用.string提取单个标签的内容时,不会出错;但对同时含有注释和文字的标签进行.string时会出现None这是因为BeautifulSoup的string属性中并没有对注释和文字内容进行区分,此时标签含有两个节点,一个是注释,一个是文字,此时要想提取注释内容可以使用contents[0],提取文字内容可以使用contents[1].

from bs4 import BeautifulSoup
#bs4.element.NavigableString和bs4.element.Comment
soup = BeautifulSoup("""<b><!--This is a comment--></b>
                         <p>文字内容</p>""",'html.parser')
print(type(soup.b.string))#<class 'bs4.element.Comment'>
print(type(soup.p.string))#<class 'bs4.element.NavigableString'>

newsoup=BeautifulSoup("""<a>a的文字内容</a>
                      <b><!--This is a comment-->文字内容</b>""","html.parser")
print(newsoup.a.string)#a的文字内容
print(newsoup.b.string)#None,b存在两个节点
print(newsoup.b.contents[0])#This is a comment
print(newsoup.b.contents[1])#文字内容

2.当标签较多其中含有较多bs4.element.NavigableString和bs4.element.Comment提取标签中的文字内容比较麻烦的时候，可以将注释抽取出来，对注释和文字内容分别进行处理。

from bs4 import BeautifulSoup, Comment
soup = BeautifulSoup("""<a><!--This is the first comment-->1111</a>
                        <a><!--This is the second comment-->222</a>
                        <a class="attrs"></a>
                        <c></c>""",'html.parser')
comments = soup.findAll(text=lambda text:isinstance(text, Comment))
#仅提取标签中的注释内容：
for comment in comments:
    print(comment)#This is the first comment   This is the second comment
# 去除标签中的注释内容：
[comment.extract() for comment in comments]
print(soup)#<a>1111</a>  <a>222</a>  <a class="attrs"></a> <c></c>
#提取标签中的文字内容
tag_a=soup.find_all("a")
print("标签中的文字内容为：")
for i in tag_a:
    if i.string != None:
        print(i.string)#1111  222

二，html页面中存在br标签时使用.string导致出现None

1. 可插入一个简单的换行符。
2. 标签是空标签（意味着它没有结束标签，因此这是错误的： ）。在 XHTML 中，把结束标签放在开始标签中，也就是 。
3.请注意， 标签只是简单地开始新的一行，而当浏览器遇到 标签时，通常会在相邻的段落之间插入一些垂直的间距。
*此时可以使用replace方法来去除标签中的 标签。

from bs4 import BeautifulSoup
html= '''<html>
               <tbody>
                   <a>文字内容1<br/></a>
                   <a>文字内容2<br></a>
                </tbody>
        </html>'''
soup = BeautifulSoup(html, 'lxml')
for i in soup.find_all('a'):
    print(i.string)#None  None
# 网页就是一个字符串对象，可以使用replace替代
new_html=(html.replace('<br>','')).replace('<br/>','')
print(new_html)
soup = BeautifulSoup(new_html, 'lxml')
for i in soup.find_all('a'):
    print(i.string)#文字内容1  文字内容2