本文爬取目标网址的图片,获取如下区域的图片
前期准备:
工具:Spyder
引用的库:requests,bs4中的BeautifulSoup
1.获取目标网址信息
url = "https://www.youmeitu.com/"
resp = requests.get(url)
resp.encoding = 'utf-8' #设定编码格式
2.解析数据,将页面源代码交给BeautifulSoup进行处理,生成bs对象
main_page = BeautifulSoup(resp.text,"html.parser")
3.定位寻找目标信息
alist = main_page.find("div",attrs={"class":"IndexListUnion_2"}).find_all("a")
这里的意识是寻找<div模块中,class属性值为IndexListUnion_2>的,后面的find_all(a),则是找每个<a>的标签的内容
4.通过<a>标签的内容,跳转到子页面中下载图片
这里需要注意,需要加上头部的链接,
即,子链接 = 头部链接+子页面的<a>中的href的值
for a in alist:
href = url+a.get('href').strip("/")
# 拿到子页面的源代码
child_page_resp = requests.get(href)
child_page_resp.encoding = 'utf-8'
child_page_text = child_page_resp.text
# 从子页面中拿到图片的下载路径
child_page = BeautifulSoup(child_page_text,"html.parser")
div = child_page.find("div",attrs={"class":"ImageBody"})
img = div.find("img")
src = img.get("src")
# 下载图片
img_resp = requests.get(url+src.strip("/"))
img_name = src.split("/")[-1]
with open("img/"+img_name,mode="wb") as f:
f.write(img_resp.content) # 将图片内容写入文件
print("over!!!",img_name)
- href = url+a.get(‘href’).strip("/"):分别得到每个子页面的完整链接
- child_page_resp = requests.get(href):拿到子页面的源代码
- child_page_resp_encoding = ‘utf-8’:设置编码格式,防止乱码
- child_page_text = child_page_resp.text:设置成文本格式
child_page = BeautifulSoup(child_page_text,"html.parser")
div = child_page.find("div",attrs={"class":"ImageBody"})
img = div.find("img")
src = img.get("src")
对其,进行bs解析,进行合理的定位
5. 图片的下载
获取图片的地址信息
img_resp = requests.get(url+src.strip("/"))
确定图片的名字,使用切片
img_name = src.split("/")[-1]
保存到img文件夹下
with open("img/"+img_name,mode="wb") as f:
f.write(img_resp.content) # 将图片内容写入文件
print("over!!!",img_name)
完整代码如下:
# -*- coding: utf-8 -*-
"""
Created on Sun Oct 3 12:14:05 2021
@author: yingzi
E-mail:guotaomath@163.com
"""
import requests
from bs4 import BeautifulSoup
import time
url = "https://www.youmeitu.com/"
resp = requests.get(url)
resp.encoding = 'utf-8'
# 将源代码交给BS
main_page = BeautifulSoup(resp.text,"html.parser")
alist = main_page.find("div",attrs={"class":"IndexListUnion_2"}).find_all("a")
for a in alist:
href = url+a.get('href').strip("/")
# 拿到子页面的源代码
child_page_resp = requests.get(href)
child_page_resp.encoding = 'utf-8'
child_page_text = child_page_resp.text
# 从子页面中拿到图片的下载路径
child_page = BeautifulSoup(child_page_text,"html.parser")
div = child_page.find("div",attrs={"class":"ImageBody"})
img = div.find("img")
src = img.get("src")
# 下载图片
img_resp = requests.get(url+src.strip("/"))
img_name = src.split("/")[-1]
with open("img/"+img_name,mode="wb") as f:
f.write(img_resp.content) # 将图片内容写入文件
print("over!!!",img_name)
time.sleep(1)
print("all over!!!")