初学python爬虫，bs4解析后print(bs,h1)返回None的原因和解决方案

最新推荐文章于 2022-08-26 02:19:09 发布

DelisPhi

最新推荐文章于 2022-08-26 02:19:09 发布

阅读量2.2k

点赞数 3

分类专栏：备忘文章标签： python

本文链接：https://blog.csdn.net/DelisPhi/article/details/107608416

版权

在学习Python爬虫时，使用BeautifulSoup解析网页时遇到print(bs, h1)返回None的情况。原因在于html.read()执行后会变成空字符串，导致后续读取失败。解决方案包括一次性读取数据存储到字符串中，或者使用requests库。文中给出了使用requests库的示例代码，以及注意事项和参考资料。" 52021405,5618517,Java Native Interface (JNI) 初次实战指南,"['Java开发', 'C++开发', 'JNI接口', 'Windows开发', '编程实践']

摘要由CSDN通过智能技术生成

本人用的python3.7，代码在anacoda 3.7版和自装的bs4 4.9.1都成功测试。

初学爬虫，结果第一个BeautifulSoup的实例就运行失败，print(bs,h1)返回None，但原网页明明就有h1标签。

比如下面的代码。

from bs4 import BeautifulSoup
from urllib.request import urlopen
html = urlopen('http://www.pythonscraping.com/pages/page1.html')
print(html.read())

如果页面OK，返回的是

“b'<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Interesting Title</h1>\n<div>\nLorem ipsum……”这样的。

但我们直接加bs4解析代码就会出问题，比如这样：

from bs4 import BeautifulSoup
from urllib.request import urlopen
html = urlopen('http://www.pythonscraping.com/pages/page1.html')
print(html.read())
#以下是新加的
bs = BeautifulSoup(html, 'html.parser')
print(bs.h1)

返回的是：

“b'<html>\n<head>\n<titl