python 一个HTML文件，找出正文和链接

最新推荐文章于 2024-07-29 02:32:41 发布

qq_33643943

最新推荐文章于 2024-07-29 02:32:41 发布

阅读量772

点赞数

文章标签： python

本文链接：https://blog.csdn.net/qq_33643943/article/details/82999763

版权

from bs4 import BeautifulSoup
def sechBodyURL(path):
    #此处因为我的html文件编码格式为gbk，因此加了encoding
    fp=open(path,encoding='gbk',errors='ignore')
    text=BeautifulSoup(fp,'html.parser')
    urls=text.findAll('a')
    for u in urls:
        print(u['href'])
    content=text.get_text().strip()
    print(content)
    return content
sechBodyURL('20test.html')

运行结果如下，中文存在乱码，是因为原html文件编码问题