今晚本来打算刷题的,但是舍友让我给他做一个爬虫(因为快要考试了嘛)。然后其他的爬文字的都还好,但是在爬图片的时候,明明图片的url都没有错,但是20个图片爬到了5个就报错了,显示了一大堆raise HTTPError(req.full_url, code, msg, hdrs, fp)
本来以为是我的问题,然后找了很久(TMD),也查了很多的资料,找不到解答,最后干脆用了另一种方法才做出来。
最后查了一查,urllib.request.urlretrieve数据少还好数据多了就容易卡死。(但是之前爬的别的也没有这个情况)
报错的代码
# -*- coding: utf-8 -*-
"""
Created on Wed Nov 24 19:46:58 2021
@author: sys
"""
import requests
import urllib.request
from lxml import etree
import pandas
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.55 Safari/537.36 Edg/96.0.1054.34',
'Cookie': 'll="118229"; bid=URZuuPTNxFs; _ga=GA1.2.378646834.1634915868; __gads=ID=67f008fccb5d9129-2219de0694ce00cd:T=1636019292:RT=1636019292:S=ALNI_MaEL_LEuiBzhcr_ayCuHR30-akN3Q; douban-fav-remind=1; __utmz=30149280.1637231848.13.6.utmcsr=link.csdn.net|utmccn=(referral)|utmcmd=referral|utmcct=/; ap_v=0,6.0; __utma=30149280.378646834.1634915868.1638098811.1638364973.20; _pk_ses.100001.afe6=*; __utmc=30149280; __utmt=1; _pk_id.100001.afe6=f55a9d3464457532.1637828284.3.1638365266.1637831705.; __utmb=30149280.8.10.1638364973'}
name=[]
like=[]
zhaopian=[]
local=r'C:\Users\Lenovo\Desktop\图片h/'
url="https://music.douban.com/artists/genre_page/6/3"
hhh=requests.get(url, headers=headers)
h=etree.HTML(hhh.text)
hh=h.xpath('//*[@id="content"]/div/div[1]/div[2]/div/div/div[2]/a/text()')
# //*[@id="content"]/div/div[1]/div[2]/div/div[2]/div[2]/a
hhhh=h.xpath('//*[@id="content"]/div/div[1]/div[2]/div/div/div[2]/div/text()')
name.extend(hh)
like.extend(hhhh)
hhhhh=h.xpath('//*[@id="content"]/div/div[1]/div[2]/div/div/div[1]/a/img/@src')
# //*[@id="content"]/div/div[1]/div[2]/div/div[2]/div[1]/a/img
zhaopian.extend(hhhhh)
zidian={"歌手":name,'喜欢人数':like}
pa=pandas.DataFrame(zidian)
pa.to_csv('歌手信息.csv',encoding=('utf-8-sig'))
for i in range(0,len(name)):
# if i!=5:
zz=zhaopian[i]
nam=name[i]
print(i)
urllib.request.urlretrieve(zhaopian[i],local+name[i]+'.jpg')
爬到第6个就报错了。
解决办法:
使用requests模块进行保存
for i in range(0,len(name)):
r=requests.get(zhaopian[i])
# //zhaopian列表中保存了每一个照片的url
with open(local+name[i]+'.jpg', 'wb+') as f:
f.write(r.content)
f.close()
最终的代码:
import requests
import urllib.request
from lxml import etree
import pandas
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.55 Safari/537.36 Edg/96.0.1054.34',
'Cookie': 'll="118229"; bid=URZuuPTNxFs; _ga=GA1.2.378646834.1634915868; __gads=ID=67f008fccb5d9129-2219de0694ce00cd:T=1636019292:RT=1636019292:S=ALNI_MaEL_LEuiBzhcr_ayCuHR30-akN3Q; douban-fav-remind=1; __utmz=30149280.1637231848.13.6.utmcsr=link.csdn.net|utmccn=(referral)|utmcmd=referral|utmcct=/; ap_v=0,6.0; __utma=30149280.378646834.1634915868.1638098811.1638364973.20; _pk_ses.100001.afe6=*; __utmc=30149280; __utmt=1; _pk_id.100001.afe6=f55a9d3464457532.1637828284.3.1638365266.1637831705.; __utmb=30149280.8.10.1638364973'}
name=[]
like=[]
zhaopian=[]
local=r'C:\Users\Lenovo\Desktop\图片h/'
url="https://music.douban.com/artists/genre_page/6/3"
hhh=requests.get(url, headers=headers)
h=etree.HTML(hhh.text)
hh=h.xpath('//*[@id="content"]/div/div[1]/div[2]/div/div/div[2]/a/text()')
# //*[@id="content"]/div/div[1]/div[2]/div/div[2]/div[2]/a
hhhh=h.xpath('//*[@id="content"]/div/div[1]/div[2]/div/div/div[2]/div/text()')
name.extend(hh)
like.extend(hhhh)
hhhhh=h.xpath('//*[@id="content"]/div/div[1]/div[2]/div/div/div[1]/a/img/@src')
# //*[@id="content"]/div/div[1]/div[2]/div/div[2]/div[1]/a/img
zhaopian.extend(hhhhh)
zidian={"歌手":name,'喜欢人数':like}
pa=pandas.DataFrame(zidian)
pa.to_csv('歌手信息.csv',encoding=('utf-8-sig'))
for i in range(0,len(name)):
r=requests.get(zhaopian[i])
with open(local+name[i]+'.jpg', 'wb+') as f:
f.write(r.content)
f.close()
一个小问题一晚上就过去了,害。怎么说呢,还算是有收获吧!