Python爬虫：bs4解析，遇到NoneType怎么解决

最新推荐文章于 2024-06-04 15:04:27 发布

超级小白小温

最新推荐文章于 2024-06-04 15:04:27 发布

阅读量2k

点赞数 2

文章标签： python 爬虫开发语言

本文链接：https://blog.csdn.net/weixin_50746407/article/details/126535370

版权

一、爬取的对象：某图库网的图片

二、源代码

源代码如下：

# 1.拿到主页面的源代码，然后提取到子页面的链接href
# 2.通过href拿到子页面的内容，从子页面中找到图片的下载地址 img->src
# 3.下载图片
import requests
from bs4 import BeautifulSoup
import time

url = "https://www.umei.cc/bizhitupian/fengjingbizhi/"
headers = {
    "User-Agent": "********************"
}

resp = requests.get(url=url, headers=headers)
resp.encoding = resp.apparent_encoding  # 处理编码

# 把源代码交给bs,从bs对象中查找数据
main_page = BeautifulSoup(resp.text, "html.parser")
alist = main_page.find("div", class_="pic-box").find_all("a")
# print(alist)
for a in alist:
    href = ("https://www.umei.cc"+a.get('href'))  # 直接通过get就可以拿到属性的值
    # 拿到子页面的源代码
    child_page_resp = requests.get(href)
    child_page_resp.encoding = child_page_resp.apparent_encoding
    child_page_text = child_page_resp.text
    # print(href)
    # 从子页面中拿到图片的下载路径
    child_page = BeautifulSoup(child_page_text, "html.parser")
    p = child_page.find("section", class_="img-content")
    img = p.find("img")
    src = img.get("src")
    # 下载图片
    img_resp = requests.get(src)
    # img_resp.content  # 这里拿到的字节
    img_name = src.split("/")[-1]  # 拿到url中的最后一个/以后的内容
    with open(img_name, mode="wb") as f:
        # 将获取到的图片的字节写入到新建的文件里面，就得到了图片
        f.write(img_resp.content)

    print("{0}已经下载成功！".format(img_name))
    time.sleep(1)
print("全部下载成功！！")

resp.close()

输出结果如下：

此处出现了错误，显示第30行，也就是上方标红的语句“img = p.find("img")”,报错的原因是‘NoneType’ object has no attribute 'find', 意思就是获取到的数据是空类型，而空类型并没有find这个方法，所以无法执行。

纠错过程如下：

1.将源代码中的黄色语句，也就是“print(alist)”执行；

2.把for循环里面的语句全部注释掉；

# 1.拿到主页面的源代码，然后提取到子页面的链接href
# 2.通过href拿到子页面的内容，从子页面中找到图片的下载地址 img->src
# 3.下载图片
import requests
from bs4 import BeautifulSoup
import time

url = "https://www.umei.cc/bizhitupian/fengjingbizhi/"
headers = {
    "User-Agent": "*********"
}

resp = requests.get(url=url, headers=headers)
resp.encoding = resp.apparent_encoding  # 处理编码

# 把源代码交给bs,从bs对象中查找数据
main_page = BeautifulSoup(resp.text, "html.parser")
alist = main_page.find("div", class_="pic-box").find_all("a")
print(alist)
# for a in alist:
#     href = ("https://www.umei.cc"+a.get('href'))  # 直接通过get就可以拿到属性的值
#     # 拿到子页面的源代码
#     child_page_resp = requests.get(href)
#     child_page_resp.encoding = child_page_resp.apparent_encoding
#     child_page_text = child_page_resp.text
#     # print(href)
#     # 从子页面中拿到图片的下载路径
#     child_page = BeautifulSoup(child_page_text, "html.parser")
#     p = child_page.find("section", class_="img-content")
#     img = p.find("img")
#     src = img.get("src")
#     # 下载图片
#     img_resp = requests.get(src)
#     # img_resp.content  # 这里拿到的字节
#     img_name = src.split("/")[-1]  # 拿到url中的最后一个/以后的内容
#     with open(img_name, mode="wb") as f:
#         # 将获取到的图片的字节写入到新建的文件里面，就得到了图片
#         f.write(img_resp.content)
#
#     print("{0}已经下载成功！".format(img_name))
#     time.sleep(1)
# print("全部下载成功！！")

resp.close()

执行结果如下：

此时我们可以看见，在刚开始我们把获取到源代码交给bs处理的时候，就出现了一开始获取到到两个a标签里面是没有任何数据的。

3.接着我们注释掉上面的print(alist),按照下方去执行并观察执行结果,

此处做出的修改就是执行print(href)语句：

# 1.拿到主页面的源代码，然后提取到子页面的链接href
# 2.通过href拿到子页面的内容，从子页面中找到图片的下载地址 img->src
# 3.下载图片
import requests
from bs4 import BeautifulSoup
import time

url = "https://www.umei.cc/bizhitupian/fengjingbizhi/"
headers = {
    "User-Agent": "*************"
}

resp = requests.get(url=url, headers=headers)
resp.encoding = resp.apparent_encoding  # 处理编码

# 把源代码交给bs,从bs对象中查找数据
main_page = BeautifulSoup(resp.text, "html.parser")
alist = main_page.find("div", class_="pic-box").find_all("a")
# print(alist)
for a in alist:
    href = ("https://www.umei.cc"+a.get('href'))  # 直接通过get就可以拿到属性的值
    # 拿到子页面的源代码
    child_page_resp = requests.get(href)
    child_page_resp.encoding = child_page_resp.apparent_encoding
    child_page_text = child_page_resp.text
    print(href)
#     # 从子页面中拿到图片的下载路径
#     child_page = BeautifulSoup(child_page_text, "html.parser")
#     p = child_page.find("section", class_="img-content")
#     img = p.find("img")
#     src = img.get("src")
#     # 下载图片
#     img_resp = requests.get(src)
#     # img_resp.content  # 这里拿到的字节
#     img_name = src.split("/")[-1]  # 拿到url中的最后一个/以后的内容
#     with open(img_name, mode="wb") as f:
#         # 将获取到的图片的字节写入到新建的文件里面，就得到了图片
#         f.write(img_resp.content)
#
#     print("{0}已经下载成功！".format(img_name))
#     time.sleep(1)
# print("全部下载成功！！")

resp.close()

执行结果如下：

此处我们能够发现，由于一开始获取a标签时，获取到两个空数据的a标签，从而导致了我们的网址错误，于是就导致了后面在进行对子页面源代码的进一步解析时出现了错误，因为第1、2句获取到的网址并不能执行p=child_page.find("section",class="img-content")

我们修改for循环里面的代码并执行如下：（红色为修改的部分，输出获取到的p的类型）

for a in alist:
    href = ("https://www.umei.cc"+a.get('href'))  # 直接通过get就可以拿到属性的值
    # 拿到子页面的源代码
    child_page_resp = requests.get(href)
    child_page_resp.encoding = child_page_resp.apparent_encoding
    child_page_text = child_page_resp.text
    # print(href)
    # 从子页面中拿到图片的下载路径
    child_page = BeautifulSoup(child_page_text, "html.parser")
    p = child_page.find("section", class_="img-content")
    print(type(p))
#     img = p.find("img")
#     src = img.get("src")
#     # 下载图片
#     img_resp = requests.get(src)
#     # img_resp.content  # 这里拿到的字节
#     img_name = src.split("/")[-1]  # 拿到url中的最后一个/以后的内容
#     with open(img_name, mode="wb") as f:
#         # 将获取到的图片的字节写入到新建的文件里面，就得到了图片
#         f.write(img_resp.content)
# 
#     print("{0}已经下载成功！".format(img_name))
#     time.sleep(1)
# # print("全部下载成功！！")

我们可以清楚的看到，出现NoneType的原因就是因为一开始获取a标签里面的数据的时候，获取到了两个a标签的数据是空的，所以导致在执行bs的进一步解析的时候，会出现NoneType,没有find方法的问题

解决方法如下：

蓝色的语句就是需要修改的地方，在原来的基础上加上[2:],从而使得alist列表获取的数据从第二个数据开始，直接把一开始获取到的第1第2个空数据忽略掉，如此以来就解决了NoneType问题

# 1.拿到主页面的源代码，然后提取到子页面的链接href
# 2.通过href拿到子页面的内容，从子页面中找到图片的下载地址 img->src
# 3.下载图片
import requests
from bs4 import BeautifulSoup
import time

url = "https://www.umei.cc/bizhitupian/fengjingbizhi/"
headers = {
    "User-Agent": "************"
}

resp = requests.get(url=url, headers=headers)
resp.encoding = resp.apparent_encoding  # 处理编码

# 把源代码交给bs,从bs对象中查找数据
main_page = BeautifulSoup(resp.text, "html.parser")
alist = main_page.find("div", class_="pic-box").find_all("a")[2:]
# print(alist)
for a in alist:
    href = ("https://www.umei.cc"+a.get('href'))  # 直接通过get就可以拿到属性的值
    # 拿到子页面的源代码
    child_page_resp = requests.get(href)
    child_page_resp.encoding = child_page_resp.apparent_encoding
    child_page_text = child_page_resp.text
    # print(href)
    # 从子页面中拿到图片的下载路径
    child_page = BeautifulSoup(child_page_text, "html.parser")
    p = child_page.find("section", class_="img-content")
    img = p.find("img")
    src = img.get("src")
    # 下载图片
    img_resp = requests.get(src)
    # img_resp.content  # 这里拿到的字节
    img_name = src.split("/")[-1]  # 拿到url中的最后一个/以后的内容
    with open(img_name, mode="wb") as f:
        # 将获取到的图片的字节写入到新建的文件里面，就得到了图片
        f.write(img_resp.content)

    print("{0}已经下载成功！".format(img_name))
    time.sleep(1)
# print("全部下载成功！！")

resp.close()