分析方法参考:1 爬取7K小说网用户书架信息-CSDN博客
因某种不知名原因,有些英雄爬取不到,就先不管。
代码:
# 爬取王者荣耀英雄图片,并保存在同目录的 heroPhoto 中,给每张图片命名为英雄.jpg
# https://pvp.qq.com/web201605/herolist.shtml
import requests
from lxml import etree
from bs4 import BeautifulSoup
from pyquery import PyQuery as pq
url = 'https://pvp.qq.com/web201605/herolist.shtml'
response = requests.get(url)
content = response.content
# Xpath 解析
# html = etree.HTML(content)
# image_urls = html.xpath('//ul[@class="herolist clearfix"]/li/a/img/@src')
# print(image_urls)
# hero_list = html.xpath('//ul[@class="herolist clearfix"]/li/a/text()')
# print(hero_list)
# for i in range(len(image_urls)): # Xpath不可以对节点进行解析吗?
# image_url = image_urls[i]
# name = hero_list[i]
# url = f'https:{image_url}'
# jpg_content = requests.get(url).content
# with open(f'heroPhoto/{name}.jpg', 'wb') as file:
# file.write(jpg_content)
# print(f'图片{i}存储完毕')
# BeautifulShop 解析
# soup = BeautifulSoup(content, 'lxml')
# image_a = soup.select('.herolist.clearfix li a') # 找到a节点
# for a in image_a:
# href = a.img.attrs['src']
# hero_name = a.get_text()
# hero_url = f'https:{href}'
# jpg_content = requests.get(hero_url).content
# with open(f'heroPhoto/{hero_name}.jpg', 'wb') as f:
# f.write(jpg_content)
# print(f'图片{hero_name}存储完毕')
# pyquery
doc = pq(content)
image_as = doc('.herolist.clearfix li a').items()
print(image_as, type(image_as))
for a in image_as:
href = a.find('img').attr('src')
hero_name = a.text()
hero_url = f'https:{href}'
jpg_content = requests.get(hero_url).content
with open(f'heroPhoto/{hero_name}.jpg', 'wb') as f:
f.write(jpg_content)
print(f'图片{hero_name}存储完毕')
爬取到的图片如下:
文章到此结束,本人新手,若有错误,欢迎指正;若有疑问,欢迎讨论。若文章对你有用,点个小赞鼓励一下,谢谢大家,一起加油吧!