python爬取gif发源地

最新推荐文章于 2020-08-05 08:03:18 发布

wenpi_linuxer

最新推荐文章于 2020-08-05 08:03:18 发布

阅读量3.1k

点赞数

分类专栏： talk is cheap 文章标签：多进程爬虫 gif爬取分文件夹保存 BeautifulSoup requests

本文链接：https://blog.csdn.net/qq_41603639/article/details/84944076

版权

talk is cheap 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

今天又改进了半天的代码，之前的下载下来就在一个大文件夹里，现在采取分文件夹爬取，对于编码，采用了html=response.text.encode(‘iso-8859-1’).decode(‘utf-8’) 这种形式，先将html转换为unicode编码，再转化为utf-8编码。这样就能够提取出没有乱码的汉字了，将其定为文件夹里的分文件夹名。分别爬取。想看爬取思路的看我上一篇博文，这里不再赘述。
先上爬取结果
在这里插入图片描述



![在这里插入图片描述](https://img-blog.csdnimg.cn/20181210181444182.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzQxNjAzNjM5,size_16,color_FFFFFF,t_70)


import os
import requests
from bs4 import BeautifulSoup
from multiprocessing import Pool
import time
from requests.exceptions import RequestException


def Download_gif(url,path):
	headers={
			'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36',
			'Connection':'close'
			}
	html=requests.get(url,headers=headers)
	soup=BeautifulSoup(html.text,'html.parser')
	gif_url=soup.find_all('a',class_='focus')		#找出单页上的所有链接，返回一个list，这个list由一系列字典组成
	for girls_url in gif_url:				#迭代出每个字典
		eachgirl_url=girls_url['href']		#每个字典的key为href时对应的value为链接
		response=requests.get(eachgirl_url,headers=headers)
		html=response.text.encode('iso-8859-1').decode('utf-8')
		soup=BeautifulSoup(html,'html.parser')				#soup库进行解析
		title=soup.find('h1',class_='article-title')
		title=title.get_text()						#title 为文件夹名，我们将每个链接保存在文件夹里
	
		os.mkdir(path+title)



		
		page=soup.find('div',class_='article-paging').find_all('span')		#进入链接发现是分页形式，所有找出链接上的总页数
		max_page=page[-1].text							#发现div标签，class为article-paging的标签内的最后一个span标签为页数
		each_url=eachgirl_url						#这里一定要将url区分开来，一个用each_url,一个用gif_url，否则会发生未知错误，调试过程会发现
		for i in range(1,int(max_page)+1):			#构造列表生成式，对应每一页链接进行图片或gif下载
			pic=each_url+str(i)				#每一页链接
			html=requests.get(pic,headers=headers)		#请求每一页链接
			soup=BeautifulSoup(html.text,'html.parser')		#解析每一页链接
			pic_url=soup.find_all('img',class_='aligncenter')		

			
			for a_url in pic_url:			#迭代出每个图片或Gif链接（为字典形式）
				os.chdir(path+title)
				a_url=a_url['src']			#gif链接中的src对应图片链接
				print(a_url+'开始下载')
				file_name=a_url.split(r'/')[-1]			#文件名
				if file_name[-4:]!='.gif' and file_name[-4:]!='.jpg' and file_name[-4:]!='jpeg':
					return None
				if a_url==None:					#加入判断，如果图片无链接，pass，让爬虫能够运行下去
					pass
				try:
					html=requests.get(a_url,headers=headers)		#请求图片链接，得到图片或Gif的文件流
					requests.adapters.DEFAULT_RETRIES = 5		#加入重复请求次数
					f=open(file_name,'wb')		
					f.write(html.content)
					time.sleep(0.000001)
					f.close()
					time.sleep(0.2)			
				except RequestException:
					return None
				
if __name__=='__main__':
	path='C://Users/panenmin/Desktop/GIF/'		#定义path,这里可以更改为自己电脑上的路径，一定用正斜线
	start_url='https://www.gifjia5.com/category/neihan/page/'	#定义start_url
	pool=Pool(6)			#构建进程池
	for i in range(1,23):		#构造列表生成式
		url=start_url+str(i)
		pool.apply(Download_gif,args=(url,path))			#传入函数和函数的参数
		print('第%d页已爬完'%i)
	pool.close()		
	pool.join()

wenpi_linuxer

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
python爬取gif发源地

今天又改进了半天的代码，之前的下载下来就在一个大文件夹里，现在采取分文件夹爬取，对于编码，采用了html=response.text.encode(‘iso-8859-1’).decode(‘utf-8’) 这种形式，先将html转换为unicode编码，再转化为utf-8编码。这样就能够提取出没有乱码的汉字了，将其定为文件夹里的分文件夹名。分别爬取。import osimport requ...
复制链接

扫一扫