学习爬虫第一天，遇到问题求帮助，谢谢大家。

最新推荐文章于 2022-04-27 22:58:29 发布

爱吃冰ql

最新推荐文章于 2022-04-27 22:58:29 发布

阅读量853

点赞数

文章标签：爬虫 python html

本文链接：https://blog.csdn.net/weixin_48679181/article/details/122533410

版权

学习爬虫的第一天，我进行了对网页图片的爬虫，但是遇到了一个问题，查阅资料得好像是说，发出请求之后得到的HTML前面没有www，从而导致第二次请求这些没有www的网站出错，具体细节如下图所示，有没有懂得朋友帮忙看一下，蟹蟹啦

import requests

from bs4 import BeautifulSoup
resp = requests.get("http://www.juimg.com/sucai/miao-18710441.html")
resp.encoding = 'utf-8'
# print(resp.text)

main_page=BeautifulSoup(resp.text,"html.parser")
alst = main_page.find("div",attrs={"class":"pageLayout"}).find_all("a",attrs={"class":"worksListPic"})
n = 1
# print(alst)
for a in alst:
    href = a.get("href")
    print(href)
    # resp1 = requests.get(href)
    # resp1.encoding='utf-8'
    # print(resp1.text)
    # child_page = BeautifulSoup(resp1.text,"html.parser")
    # src = child_page.find("div",attrs={"class":"wra"}).find("img").get("src")
    # print(src)

这是直接得到图片超链接所对应网站

/tupian/202109/shenghuoqita_3005738.html
/tupian/202109/shenghuoqita_3002965.html
/tupian/202109/shenghuoqita_3001484.html
/tupian/202109/shenghuoqita_2999886.html
/tupian/202109/shenghuoqita_2999881.html
/tupian/202109/shenghuoqita_2998247.html
/tupian/202109/qitafengguang_2994092.html
/tupian/202109/qitafengguang_2994097.html
/tupian/202109/qitafengguang_2992458.html
/tupian/202108/jiaotonggongju_2989851.html
/tupian/202108/jiaotonggongju_2988147.html
/tupian/202108/shenghuoqita_2984981.html
/tupian/202108/shenghuoqita_2982736.html
/tupian/202108/shenghuoqita_2982619.html
/tupian/202108/shenghuoqita_2982620.html
/tupian/202108/shenghuoqita_2982621.html
/tupian/202106/shenghuorenwu_2960130.html
/tupian/202106/shenghuorenwu_2959743.html
/tupian/202106/shenghuorenwu_2958855.html
/tupian/202106/shenghuorenwu_2956762.html
/tupian/202106/shenghuorenwu_2953540.html
/tupian/202106/shenghuorenwu_2952361.html
/tupian/202106/shenghuorenwu_2939377.html
/tupian/202106/shenghuorenwu_2939040.html
/tupian/202105/shenghuorenwu_2937825.html
/tupian/202105/shenghuorenwu_2935762.html
/tupian/202105/shenghuorenwu_2933613.html
/tupian/202105/shenghuorenwu_2932240.html
/tupian/202105/shenghuorenwu_2931822.html
/tupian/202105/xiandaishangwu_2929953.html
/tupian/202105/shenghuorenwu_2917590.html
/tupian/202105/shenghuorenwu_2911325.html
/tupian/202105/shenghuorenwu_2910381.html
/tupian/202105/shenghuorenwu_2909078.html
/tupian/202105/shenghuorenwu_2908800.html
/tupian/202105/shenghuorenwu_2908762.html
/tupian/202105/shenghuorenwu_2908463.html
/tupian/202105/shenghuorenwu_2908464.html
/tupian/202105/shenghuorenwu_2907920.html
/tupian/202105/shenghuorenwu_2907921.html

Process finished with exit code 0

这是运行结果。
然后第二次请求这些网站就会出错：

import requests

from bs4 import BeautifulSoup
resp = requests.get("http://www.juimg.com/sucai/miao-18710441.html")
resp.encoding = 'utf-8'
# print(resp.text)

main_page=BeautifulSoup(resp.text,"html.parser")
alst = main_page.find("div",attrs={"class":"pageLayout"}).find_all("a",attrs={"class":"worksListPic"})
n = 1
# print(alst)
for a in alst:
    href = a.get("href")
    print(href)
    resp1 = requests.get(href)
    resp1.encoding='utf-8'
    print(resp1.text)
    # child_page = BeautifulSoup(resp1.text,"html.parser")
    # src = child_page.find("div",attrs={"class":"wra"}).find("img").get("src")
    # print(src)

然后错误显示如下：

/tupian/202109/shenghuoqita_3005738.html
Traceback (most recent call last):
  File "C:/Users/24493/PycharmProjects/untitled/venv/爬虫/pic.py", line 15, in <module>
    resp1 = requests.get(href)
  File "C:\Users\24493\PycharmProjects\untitled\venv\lib\site-packages\requests\api.py", line 75, in get
    return request('get', url, params=params, **kwargs)
  File "C:\Users\24493\PycharmProjects\untitled\venv\lib\site-packages\requests\api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "C:\Users\24493\PycharmProjects\untitled\venv\lib\site-packages\requests\sessions.py", line 515, in request
    prep = self.prepare_request(req)
  File "C:\Users\24493\PycharmProjects\untitled\venv\lib\site-packages\requests\sessions.py", line 453, in prepare_request
    hooks=merge_hooks(request.hooks, self.hooks),
  File "C:\Users\24493\PycharmProjects\untitled\venv\lib\site-packages\requests\models.py", line 318, in prepare
    self.prepare_url(url, params)
  File "C:\Users\24493\PycharmProjects\untitled\venv\lib\site-packages\requests\models.py", line 392, in prepare_url
    raise MissingSchema(error)
requests.exceptions.MissingSchema: Invalid URL '/tupian/202109/shenghuoqita_3005738.html': No scheme supplied. Perhaps you meant http:///tupian/202109/shenghuoqita_3005738.html?

Process finished with exit code 1

爱吃冰ql

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
2
评论
学习爬虫第一天，遇到问题求帮助，谢谢大家。

学习爬虫的第一天，我进行了对网页图片的爬虫，但是遇到了一个问题，查阅资料得好像是说，发出请求之后得到的HTML前面没有www，从而导致第二次请求这些没有www的网站出错，具体细节如下图所示，有没有懂得朋友帮忙看一下，蟹蟹啦import requestsfrom bs4 import BeautifulSoupresp = requests.get("http://www.juimg.com/sucai/miao-18710441.html")resp.encoding = 'utf-8'# pr
复制链接

扫一扫