图床爬虫

最新推荐文章于 2024-08-13 09:00:00 发布

江前云后

最新推荐文章于 2024-08-13 09:00:00 发布

阅读量2.7k

点赞数

分类专栏： [爬虫] 【Python】

本文链接：https://blog.csdn.net/songyu0120/article/details/47811507

版权

【Python】同时被 2 个专栏收录

28 篇文章 0 订阅

订阅专栏

[爬虫]

13 篇文章 0 订阅

订阅专栏

贴一个知乎的回答

不贴代码的都是耍流氓！

===========================

这是很久以前写的第一个爬虫，为了爬一个新发现的图床网站 (没错，服务器在美国，上面有你想要的东西，自己去发掘吧)

不过网速很一般，还经常掉线，需要有耐心。

不知道会不会被和谐，为了满足大家好奇心放出来吧 http://ihostimg.com/

几个月之前写的了，刚刚试了下，还可以跑，看来这个网站的代码基本没改。

1、爬图床的爬虫

功能：直接在这个网站找到某个人的相册网址，输入开始和结束的相册数，全部爬到本地，并新建文件夹按照网页相册名字存储。

说明：并没有用任何库，装了Python可以直接运行，想下载其他相册需要手动去找网址复制修改。

代码如下，如果你运行过了，应该回来点赞！（逃

#!/usr/bin/env python
#! -*- coding: utf-8 -*-
import urllib,urllib2,cookielib
import re
import os,time
#返回网页源代码
def getHtml(hUrl):
	print '获取网页源代码'
	user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
	headers = {'User-Agent' : user_agent ,'Connection':'keep-alive'}
	req = urllib2.Request(hUrl,headers = headers)
	html = urllib2.urlopen(req)
	srcCode = html.read()
	html.close()
	#完整网页源代码
	# print srcCode
	return srcCode

#返回页面中所有的相册链接(后面处理的时候先加上主站开头)
#传入参数:html源代码
def getGalleryUrl(gsrcCode):
	galleryUrl = re.findall(r'<a.*?href="(.*?\.html)">',gsrcCode)
	#print galleryUrl
	return galleryUrl

def get_Title(TsrcCode):
	#print TsrcCode
	getTitle = re.compile(r'<div style="height:18px">(.*?)</div>')
	getNum   = re.compile(r'<div style="height:18px">(.*?)</div>.*?<br.*?>\(\W(.*?)\Wimages.*?\)</span></li>')
	title = getNum.findall(TsrcCode)
	for x in title:
		print x[0] + '[' +x[1] +'P]'
	return title

def getImg(theGalleryUrl,startGallery):
	srcCode = getHtml(theGalleryUrl)
	#网站开头
	head = 'http://ihostimg.com'
	#每个fullUrl的元素对应一个相册
	fullUrl = getGalleryUrl(srcCode)
	print fullUrl
	#每个相册路径前都加入主站开头
	for num in xrange(len(fullUrl)):
		fullUrl[num] = head + fullUrl[num]
	#得到完整相册地址
	print fullUrl
	#获取每个相册的名称和包含的图片数来建立路径
	#每个title元素对应一个相册名称+图片数
	title = get_Title(srcCode)
	#相册索引
	GalleryNum = startGallery
	#对网页中图片建立正则
	pattern = re.compile(r'<a.*?href="(.*?\.jpeg)">')
	for imgUrl in fullUrl[startGallery:]:
		#先建立对应相册名的文件夹
		path = mkdir(title[GalleryNum])
		GalleryNum += 1
		#获得包含图片的网页源代码
		#在源代码中匹配图片下载地址
		imgSrcCode = getHtml(imgUrl)
		imgSrc = pattern.findall(imgSrcCode)
		print imgSrc
		ImgNum = 0
		for i in imgSrc:
			# urlretrieve速度慢不稳定
			# urllib.urlretrieve(i, path +'%s.jpg' % ImgNum)
			urlopen = urllib.URLopener()
			#下载图片流
			fp = urlopen.open(i)
			data = fp.read()
			#清除并以二进制写入
			f = open(path + '%s.jpg' % ImgNum, 'w+b')
			f.write(data)
			f.close()
			ImgNum += 1
			print u"正在下载"
			print i
			time.sleep(2)

		print '第' + str(GalleryNum) + '个相册完成!'

def mkdir(list):
	path = "..\\nothing\\" + unicode(list[0],'utf8') + '[' + list[1] + "P]\\"
	# 判断路径存在与否
	path = path.strip() # 去掉首尾空格
	# path = path.rstrip('') # 去掉右侧 \ 符号
	exits = os.path.exists(path)
	if not exits: # 不存在路径就创建
		os.makedirs(path)
		print path + u'创建成功'
	else:
		print path + u'已经存在'
		return path
	return path

# myUrl = 'http://ihostimg.com/mygallery.php?gallerybelongto=147258'
myUrl = 'http://ihostimg.com/mygallery.php?gallerybelongto=jomiler00'
page = int(raw_input(unicode("第几个相册开始? 请输入数字\n",'utf8')))
getImg(myUrl,page)

上面的myurl就是需要爬的相册示例，为了说明特意贴了两个网址，请自行尝试。

2、第二个可能没太多人感兴趣了，爬百度图片的，输入页码，关键字，直接爬下来。

代码在博客里，放在答案里就太长了，有感兴趣的自己去看吧