Python可自动登录爬取图片的网络爬虫

最新推荐文章于 2020-12-06 11:38:26 发布

idragonkid

最新推荐文章于 2020-12-06 11:38:26 发布

阅读量2k

点赞数

分类专栏： python 文章标签：网络爬虫 python 源代码 cookie session

本文链接：https://blog.csdn.net/idragonkid/article/details/21340413

版权

python 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

最近在学习网络爬虫相关的东西，偶然在CSDN中看到了一个非常简单的网络爬虫程序。但是该程序存在一个问题，爬取的图片除了第一页的之外都是小图片，文中没有给出解决办法。于是就想尝试解决这个问题，毕竟实践出真知。

原文地址：http://blog.csdn.net/qqzhoufei521/article/details/19570971。

因为程序本身比较简单，而且没什么技术难度，这里只简单记录下卡住我的两个问题，分享的同时也算是给自己备忘。

1. authenticity_token。

最开始尝试自动登录的时候只是按照网页呈现的form结构填写了用户名、密码和remember_me构造出http消息的content部分，但是发现这样无法登录。通过抓包对比正常登录和程序自动登录发送的消息体发现我发送的content部分缺少了一个叫做authenticity_token的部分，这个东西在网页界面上是看不到的。通过查看资料发现这个东西是存在于form中的一个隐藏项，详见：http://stackoverflow.com/questions/941594/understand-rails-authenticity-token。找到了问题所在，用正则表达式把他从/account/sign_in中取出，放入content中再次post，从返回的消息体中就可以看出登录问题已经解决了（当然需要实现对于http重定向的支持）。

具体代码：

SignIn = '/account/sign_in'	# const
GetAuthKeyExp = r'<input name="authenticity_token" type="hidden" value="(.*?)" />'	# const
AuthKeyPattern = re.compile(GetAuthKeyExp)	# const

def getAuthKey():
	global headers
	page = myRequest(url = SignIn, headers = headers)
	authKey = AuthKeyPattern.findall(page)
	return authKey[0]

2. 保持登录状态继续访问其他页面。

通过查看网页的源代码发现要想获得除了第一页之外的页面的大图，必须点击图片进入网站的下一级目录，类似：/items/802。再未登录状态下点击图片会被自动重定向到/account/sign_in页面，这就需要程序实现保持登录状态继续访问其他页面的功能。在学习爬虫之前我对http协议的掌握仅限于了解的程度，还是继续google了解到：http本身是短连接无状态的，为了给http增加状态以使其可以以会话的形式交互，就在http中加入了session的概念，而session的概念也是通过cookie实现的。

通过抓包，发现在cs交互过程中http请求的cookie头域中始终会存在一个叫做_Loudatui_Session的项，而ack的set-cookie头域中也会携带这样一项，下次请求就会携带上一次ack返回的session值。资料中的session就是这样实现的，想必这个网站的会话保持也是通过_Loudatui_Session实现的。于是，按照上述思路实现了之后，文章开头所述的问题就被解决掉了。

这中间还有个小问题就是这个网站登录会话的建立是从访问/account/sign_in开始的。也就是说如果在/account/sign_in页面上点击登录按钮之前清空cookie，虽然可以正常登录，但是后续的访问又会被重定向到/account/sign_in界面，不知道这个算不算是网站的一个小bug，希望有经验的童鞋可以帮忙解释下。这个问题也卡住我好长时间，修改程序从第一次获取authenticity_token就开始记录session值后，整个程序才算是完全解决了问题。

完整代码奉上：

# spider
import urllib, httplib
import re
import threading
import os
import sys
import datetime

Pages = 10	# no. of pages want to download.

Today = datetime.date.today().isoformat()
S = os.sep	# const
Root = "d:" + S + "ludatui" + S	# const
	
Prefix = "/?page="	# const
HOST = "loudatui.com"	# const
SignIn = '/account/sign_in'	# const
GetZoomLinkExp = r'<a class=".*?" href="(.*?)" title="\D*?">'	# const
ZoomLinkPattern = re.compile(GetZoomLinkExp)	# const
GetImageExp = r'<img alt=".*?" class=".*?" src="(.*?)" />'	# const
ImagePattern = re.compile(GetImageExp)	# const
GetAuthKeyExp = r'<input name="authenticity_token" type="hidden" value="(.*?)" />'	# const
AuthKeyPattern = re.compile(GetAuthKeyExp)	# const

## some sites de-spider. use headers to disguise as browser.
headers = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1)\
							AppleWebKit/537.36 (KHTML, like Gecko)\
							Chrome/28.0.1500.72 Safari/537.36',
			'Connection' : 'Keep-Alive',
			'Cookie' : '__utma=233318019.434537196.1394773301.1394773301.1394773301.1; __utmb=233318019.5.6.1394773301; __utmc=233318019; __utmz=233318019.1394773301.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)' }

def getAuthKey():
	global headers
	page = myRequest(url = SignIn, headers = headers)
	authKey = AuthKeyPattern.findall(page)
	return authKey[0]

## get the url used in request.
def parseUrl(url = ''):
	pos = url.find(HOST) + len(HOST)
	return url[pos:]

def setHTTPSession(session = ''):
	global headers
	if not session:
		return
	tmpCookie = str(headers.get('Cookie'))
	sessionName = '_LouDaTui_session'
	pos = tmpCookie.find(sessionName)
	if -1 != pos:
		tmpCookie = tmpCookie[0:pos - 2]
	tmpCookie += '; ' + session
	headers['Cookie'] = tmpCookie

## this method will return body.	
def myRequest(method = 'GET', url = '', body = '', headers = {}):
	## init connection
	connection = httplib.HTTPConnection(HOST)
	## send request
	connection.request(method = method, url = url, body = body, headers = headers)
	## buffer new cookie and other info
	resp = connection.getresponse()
	HTTPSession = ''
	if resp.getheader('Set-Cookie') != None:
		HTTPSession = resp.getheader('Set-Cookie').split(';')[0]
	status = resp.status
	location = resp.getheader('location')
	text = resp.read()
	## set HTTPSession
	setHTTPSession(HTTPSession)
	## connection should be closed before next request.
	## and response will be None when close.
	connection.close()
	## rediret if needed.
	if (httplib.FOUND == status) or (httplib.MOVED_PERMANENTLY == status):
		text = myRequest(method = 'GET', url = parseUrl(location), headers = headers)
	return text
	
def login(signIn = '', data = '', headers = {}):
	## login request.
	myRequest(method = 'POST', url = SignIn, body = data, headers = headers)
	
def getIt(url, i, k):
	page = myRequest(url = url, headers = headers)
	picUrl = ImagePattern.findall(page)
	fname = Today + "-" + str(i) + "-" + str(k + 1) + ".jpg"
	if 1 == len(picUrl):
		urllib.urlretrieve(picUrl[0], os.path.join(Root, fname))
	else:
		print 'number of pic url is %d' % len(picUrl)
	
def parsePages(indexOfPage = 0):
	global headers
	page = myRequest(url = Prefix + str(indexOfPage), headers = headers)
	items = ZoomLinkPattern.findall(page)
	tasks = []
	for k in range(len(items)):
		try:
			t = threading.Thread(target = getIt, args = (items[k], indexOfPage, k))
			tasks.append(t)
		except:
			print "some error in %sth download." % k
			continue
	for task in tasks:
		task.start()
	for task in tasks:
		task.join(300)
	return 0
	
def main():
	## forms need to commit when login.
	formsPrev = 'utf8=%E2%9C%93&'
	forms = {
			'authenticity_token' : '',
			'user[login]' : '1234@163.com',
			'user[password]' : '123456',
			'user[remember_me]' : '0',
			}
	formsEnd = '&commit=%E7%99%BB%E9%99%86'
	## update authenticity_token from sign_in page.
	forms['authenticity_token'] = getAuthKey()
	data = formsPrev + urllib.urlencode(forms) + formsEnd
	## login. session created after get sign in.
	login(SignIn, data, headers)
	## begin to spider the pics.
	if False == os.path.exists(Root):
		os.mkdir(Root)
	for n in range(Pages):
		print "Now page %s" % str(n + 1)
		parsePages(n + 1)
		print "Page %s OK\n" % str(n + 1)

main();

如果文中有什么问题，希望大家可以帮忙指出~

在此先谢过各位了

idragonkid

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Python可自动登录爬取图片的网络爬虫

最近在学习网络爬虫相关的东西，偶然在CSDN中看到了一个非常简单的网络爬虫程序。但是该程序存在一个问题，爬取的图片除了第一页的之外都是小图片，文中没有给出解决办法。于是就想尝试解决这个问题，毕竟实践出真知。原文地址：http://blog.csdn.net/qqzhoufei521/article/details/19570971。因为程序本身比较简单，而且
复制链接

扫一扫