Python爬取图片

最新推荐文章于 2024-04-12 14:01:35 发布

...天晴...

最新推荐文章于 2024-04-12 14:01:35 发布

阅读量860

点赞数 7

文章标签： python

本文链接：https://blog.csdn.net/qq_43055002/article/details/114221147

版权

Python爬取图片

代码实现：
1：导入包import
2：获取未清洗的网页 request.get .text
3:解析网页精确到你想要的抓取位置 re.findall, .content, with open(filename,‘wb’) as f f.write() f.close
4:保存图片内容这个可以跟解析一起写（没分开时因为我捣鼓了一个晚上也没搞定。。。。哈哈哈哈）
5：翻页地址 for if elif
6：入口
7：补充内容：
1：同类的图片可以有多个网址选择
2：用网址上的部分内容作为保存图片的图片名称 split(‘分割内容’)[位置]
3：解析网页方法很多建议先学好一种，早去试着用个另一种尝试
4：不要一下子就要把结果弄出来，发现问题的时候多试试print() print(type)
5：参考链接: https://blog.csdn.net/qq_44921056/article/details/114124982.
6：参考链接：5：参考链接: https://blog.csdn.net/weixin_42555080?spm=1001.2014.3001.5509.
7:5,6两个博主写的，我感觉很给力，至少在他们写的文章中我比较容易清楚
8：抓取图片还是蛮有意思的，争取以后可以往更高级的抓取方向去，让自己的代码跟层次化，项目化
9：该内容涉及的图片网址也极力推荐，比较好入手
10：涉及的网址跟博客都是自己网上找的如果有侵犯，还请联系我删除！我会积极响应的！！！

代码实现如下：
import requests
import re
import os
import time
from fake_useragent import UserAgent

hearder = UserAgent(verify_ssl=False, path=‘fake_useragent.json’)

def get_html(html):
response = requests.get(html)
parse_html(response.text)
# print(response.text)

def parse_html(content):
image_dirpath = ‘C:/DownLoad/PycharmWork/爬取保存位置/fengjing/’
if not os.path.exists(image_dirpath):
os.mkdir(image_dirpath)
i = 1
img_lists = re.findall(’<img.?src.+?(/uploads/allimg.+?jpg)’, content, re.S)
# <img.?src.?(-.?.jpg)
for img_list in img_lists:
save_path = image_dirpath + img_list.split(’-’)[-1]
img_list_whole = ‘http://pic.netbian.com/’ + img_list
image_content = requests.get(img_list_whole).content
with open(save_path, ‘wb’) as f:
f.write(image_content)
f.close()
i += 1
print(‘图{}已经保存’.format(save_path))

“”"
url_1的存在是因为第一页跟第二页的网址存在差异，就默认两部分首页跟后续页数，如果page小于2 那就只下载首页
“”"

def get_pages(choose, page):
if choose == 0:
url_1 = ‘http://pic.netbian.com/4kfengjing/index.html’
get_html(url_1)
for i in range(2, page + 1):
url = ‘http://pic.netbian.com/4kfengjing/index_{}.html’.format(i)
get_html(url)
time.sleep(5)
i += 1
elif choose == 1:
url_1 = ‘http://pic.netbian.com/4kmeinv/index.html’
get_html(url_1)
for i in range(2, page + 1):
url = ‘http://pic.netbian.com/4kmeinv/index_{}.html’.format(i)
get_html(url)
time.sleep(5)
i += 1
elif choose == 2:
url_1 = ‘http://pic.netbian.com/4kdongwu/index.html’
get_html(url_1)
for i in range(2, page + 1):
url = ‘http://pic.netbian.com/4kdongwu/index_{}.html’.format(i)
get_html(url)
time.sleep(5)
i += 1
elif choose == 3:
url_1 = ‘http://pic.netbian.com/4kmingxing/index.html’
get_html(url_1)
for i in range(2, page + 1):
url = ‘http://pic.netbian.com/4kmingxing/index_{}.html’.format(i)
get_html(url)
time.sleep(5)
i += 1
elif choose == 4:
url_1 = ‘http://pic.netbian.com/4kqiche/index.html’
get_html(url_1)
for i in range(2, page + 1):
url = ‘http://pic.netbian.com/4kqiche/index_{}.html’.format(i)
get_html(url)
time.sleep(5)
i += 1
elif choose == 5:
url_1 = ‘http://pic.netbian.com/4kyouxi/index.html’
get_html(url_1)
for i in range(2, page + 1):
url = ‘http://pic.netbian.com/4kyouxi/index_{}.html’.format(i)
get_html(url)
# 加个时间间隔，不要给爬取网址增加压力
time.sleep(5)
i += 1
elif choose == 6:
url_1 = ‘http://pic.netbian.com/4kmeishi/index.html’
get_html(url_1)
for i in range(2, page + 1):
url = ‘http://pic.netbian.com/4kmeishi/index_{}.html’.format(i)
get_html(url)
time.sleep(5)
i += 1

if name == ‘main’:
print(‘使用工具可以爬取的图片内容有一下几种：’)
print(‘0:风景 1: 美女 2: 宠物 3: 明星 4: 汽车 5:游戏 6:动物’)
print(‘请输入你的选择对应的数字：’)
chooses = input(‘输入的数字是：’)
print(‘请输入你想获取的页数：’)
pages = input(‘选择的页数是：’)
# 加int是应为输入的数字是string类型的需要转化
get_pages(int(chooses), int(pages))
# get_pages(1, 3)