学习爬虫的第一天,我进行了对网页图片的爬虫,但是遇到了一个问题,查阅资料得好像是说,发出请求之后得到的HTML前面没有www,从而导致第二次请求这些没有www的网站出错,具体细节如下图所示,有没有懂得朋友帮忙看一下,蟹蟹啦
import requests
from bs4 import BeautifulSoup
resp = requests.get("http://www.juimg.com/sucai/miao-18710441.html")
resp.encoding = 'utf-8'
# print(resp.text)
main_page=BeautifulSoup(resp.text,"html.parser")
alst = main_page.find("div",attrs={"class":"pageLayout"}).find_all("a",attrs={"class":"worksListPic"})
n = 1
# print(alst)
for a in alst:
href = a.get("href")
print(href)
# resp1 = requests.get(href)
# resp1.encoding='utf-8'
# print(resp1.text)
# child_page = BeautifulSoup(resp1.text,"html.parser")
# src = child_page.find("div",attrs={"class":"wra"}).find("img").get("src")
# print(src)
这是直接得到图片超链接所对应网站
/tupian/202109/shenghuoqita_3005738.html
/tupian/202109/shenghuoqita_3002965.html
/tupian/202109/shenghuoqita_3001484.html
/tupian/202109/shenghuoqita_2999886.html
/tupian/202109/shenghuoqita_2999881.html
/tupian/202109/shenghuoqita_2998247.html
/tupian/202109/qitafengguang_2994092.html
/tupian/202109/qitafengguang_2994097.html
/tupian/202109/qitafengguang_2992458.html
/tupian/202108/jiaotonggongju_2989851.html
/tupian/202108/jiaotonggongju_2988147.html
/tupian/202108/shenghuoqita_2984981.html
/tupian/202108/shenghuoqita_2982736.html
/tupian/202108/shenghuoqita_2982619.html
/tupian/202108/shenghuoqita_2982620.html
/tupian/202108/shenghuoqita_2982621.html
/tupian/202106/shenghuorenwu_2960130.html
/tupian/202106/shenghuorenwu_2959743.html
/tupian/202106/shenghuorenwu_2958855.html
/tupian/202106/shenghuorenwu_2956762.html
/tupian/202106/shenghuorenwu_2953540.html
/tupian/202106/shenghuorenwu_2952361.html
/tupian/202106/shenghuorenwu_2939377.html
/tupian/202106/shenghuorenwu_2939040.html
/tupian/202105/shenghuorenwu_2937825.html
/tupian/202105/shenghuorenwu_2935762.html
/tupian/202105/shenghuorenwu_2933613.html
/tupian/202105/shenghuorenwu_2932240.html
/tupian/202105/shenghuorenwu_2931822.html
/tupian/202105/xiandaishangwu_2929953.html
/tupian/202105/shenghuorenwu_2917590.html
/tupian/202105/shenghuorenwu_2911325.html
/tupian/202105/shenghuorenwu_2910381.html
/tupian/202105/shenghuorenwu_2909078.html
/tupian/202105/shenghuorenwu_2908800.html
/tupian/202105/shenghuorenwu_2908762.html
/tupian/202105/shenghuorenwu_2908463.html
/tupian/202105/shenghuorenwu_2908464.html
/tupian/202105/shenghuorenwu_2907920.html
/tupian/202105/shenghuorenwu_2907921.html
Process finished with exit code 0
这是运行结果。
然后第二次请求这些网站就会出错:
import requests
from bs4 import BeautifulSoup
resp = requests.get("http://www.juimg.com/sucai/miao-18710441.html")
resp.encoding = 'utf-8'
# print(resp.text)
main_page=BeautifulSoup(resp.text,"html.parser")
alst = main_page.find("div",attrs={"class":"pageLayout"}).find_all("a",attrs={"class":"worksListPic"})
n = 1
# print(alst)
for a in alst:
href = a.get("href")
print(href)
resp1 = requests.get(href)
resp1.encoding='utf-8'
print(resp1.text)
# child_page = BeautifulSoup(resp1.text,"html.parser")
# src = child_page.find("div",attrs={"class":"wra"}).find("img").get("src")
# print(src)
然后错误显示如下:
/tupian/202109/shenghuoqita_3005738.html
Traceback (most recent call last):
File "C:/Users/24493/PycharmProjects/untitled/venv/爬虫/pic.py", line 15, in <module>
resp1 = requests.get(href)
File "C:\Users\24493\PycharmProjects\untitled\venv\lib\site-packages\requests\api.py", line 75, in get
return request('get', url, params=params, **kwargs)
File "C:\Users\24493\PycharmProjects\untitled\venv\lib\site-packages\requests\api.py", line 61, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Users\24493\PycharmProjects\untitled\venv\lib\site-packages\requests\sessions.py", line 515, in request
prep = self.prepare_request(req)
File "C:\Users\24493\PycharmProjects\untitled\venv\lib\site-packages\requests\sessions.py", line 453, in prepare_request
hooks=merge_hooks(request.hooks, self.hooks),
File "C:\Users\24493\PycharmProjects\untitled\venv\lib\site-packages\requests\models.py", line 318, in prepare
self.prepare_url(url, params)
File "C:\Users\24493\PycharmProjects\untitled\venv\lib\site-packages\requests\models.py", line 392, in prepare_url
raise MissingSchema(error)
requests.exceptions.MissingSchema: Invalid URL '/tupian/202109/shenghuoqita_3005738.html': No scheme supplied. Perhaps you meant http:///tupian/202109/shenghuoqita_3005738.html?
Process finished with exit code 1