python网络爬虫使用BeautifulSoup时出现findAll执行失败问题分析

最新推荐文章于 2023-01-06 15:17:21 发布

perfecttshoot

最新推荐文章于 2023-01-06 15:17:21 发布

阅读量2.9k

点赞数 2

分类专栏： python网络爬虫文章标签： BeautifulSoup解析器 Linux Word Python读取 xml lxml

本文链接：https://blog.csdn.net/wanght89/article/details/78184883

版权

python网络爬虫专栏收录该内容

37 篇文章 3 订阅

订阅专栏

最近在学习和演技python网络爬虫，并且使用了BeautifulSoup进行格式化查找。在使用Python读取word文档的时候，由于在Linux环境下读取Word文档，需要先将word文档转换为xml文档，在使用findAll函数进行文档内容定位时。findAll执行无结果。经过分析和问题查找，发现是由于为将BeautifulSoup的解析器指定为xml，导致后续的查找失败。修改后的代码片段如下：

from zipfile import ZipFile
from urllib.request import urlopen
from io import BytesIO
from bs4 import BeautifulSoup

wordFile=urlopen("http://pythonscraping.com/pages/AWordDocument.docx").read()
wordFile=BytesIO(wordFile)
document=ZipFile(wordFile)
xml_content=document.read('word/document.xml')

wordObj=BeautifulSoup(xml_content.decode('utf-8'),"xml")
textString=wordObj.findAll("w:t")
for textElem in textString:
    print(textElem.text)

整理出来，望大家在遇到findAll执行失败的时候，首先考虑是否BeautifulSoup的解析结构不正确，为指定解析器。