最近在学习网络爬虫相关的东西,偶然在CSDN中看到了一个非常简单的网络爬虫程序。但是该程序存在一个问题,爬取的图片除了第一页的之外都是小图片,文中没有给出解决办法。于是就想尝试解决这个问题,毕竟实践出真知。
原文地址:http://blog.csdn.net/qqzhoufei521/article/details/19570971。
因为程序本身比较简单,而且没什么技术难度,这里只简单记录下卡住我的两个问题,分享的同时也算是给自己备忘。
1. authenticity_token。
最开始尝试自动登录的时候只是按照网页呈现的form结构填写了用户名、密码和remember_me构造出http消息的content部分,但是发现这样无法登录。通过抓包对比正常登录和程序自动登录发送的消息体发现我发送的content部分缺少了一个叫做authenticity_token的部分,这个东西在网页界面上是看不到的。通过查看资料发现这个东西是存在于form中的一个隐藏项,详见:http://stackoverflow.com/questions/941594/understand-rails-authenticity-token。找到了问题所在,用正则表达式把他从/account/sign_in中取出,放入content中再次post,从返回的消息体中就可以看出登录问题已经解决了(当然需要实现对于http重定向的支持)。
具体代码:
SignIn = '/account/sign_in' # const
GetAuthKeyExp = r'<input name="authenticity_token" type="hidden" value="(.*?)" />' # const
AuthKeyPattern = re.compile(GetAuthKeyExp) # const
def getAuthKey():
global headers
page = myRequest(url = SignIn, headers = headers)
authKey = AuthKeyPattern.findall(page)
return authKey[0]
2. 保持登录状态继续访问其他页面。
通过查看网页的源代码发现要想获得除了第一页之外的页面的大图,必须点击图片进入网站的下一级目录,类似:/items/802。再未登录状态下点击图片会被自动重定向到/account/sign_in页面,这就需要程序实现保持登录状态继续访问其他页面的功能。在学习爬虫之前我对http协议的掌握仅限于了解的程度,还是继续google了解到:http本身是短连接无状态的,为了给http增加状态以使其可以以会话的形式交互,就在http中加入了session的概念,而session的概念也是通过cookie实现的。
通过抓包,发现在cs交互过程中http请求的cookie头域中始终会存在一个叫做_Loudatui_Session的项,而ack的set-cookie头域中也会携带这样一项,下次请求就会携带上一次ack返回的session值。资料中的session就是这样实现的,想必这个网站的会话保持也是通过_Loudatui_Session实现的。于是,按照上述思路实现了之后,文章开头所述的问题就被解决掉了。
这中间还有个小问题就是这个网站登录会话的建立是从访问/account/sign_in开始的。也就是说如果在/account/sign_in页面上点击登录按钮之前清空cookie,虽然可以正常登录,但是后续的访问又会被重定向到/account/sign_in界面,不知道这个算不算是网站的一个小bug,希望有经验的童鞋可以帮忙解释下。这个问题也卡住我好长时间,修改程序从第一次获取authenticity_token就开始记录session值后,整个程序才算是完全解决了问题。
完整代码奉上:
# spider
import urllib, httplib
import re
import threading
import os
import sys
import datetime
Pages = 10 # no. of pages want to download.
Today = datetime.date.today().isoformat()
S = os.sep # const
Root = "d:" + S + "ludatui" + S # const
Prefix = "/?page=" # const
HOST = "loudatui.com" # const
SignIn = '/account/sign_in' # const
GetZoomLinkExp = r'<a class=".*?" href="(.*?)" title="\D*?">' # const
ZoomLinkPattern = re.compile(GetZoomLinkExp) # const
GetImageExp = r'<img alt=".*?" class=".*?" src="(.*?)" />' # const
ImagePattern = re.compile(GetImageExp) # const
GetAuthKeyExp = r'<input name="authenticity_token" type="hidden" value="(.*?)" />' # const
AuthKeyPattern = re.compile(GetAuthKeyExp) # const
## some sites de-spider. use headers to disguise as browser.
headers = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1)\
AppleWebKit/537.36 (KHTML, like Gecko)\
Chrome/28.0.1500.72 Safari/537.36',
'Connection' : 'Keep-Alive',
'Cookie' : '__utma=233318019.434537196.1394773301.1394773301.1394773301.1; __utmb=233318019.5.6.1394773301; __utmc=233318019; __utmz=233318019.1394773301.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)' }
def getAuthKey():
global headers
page = myRequest(url = SignIn, headers = headers)
authKey = AuthKeyPattern.findall(page)
return authKey[0]
## get the url used in request.
def parseUrl(url = ''):
pos = url.find(HOST) + len(HOST)
return url[pos:]
def setHTTPSession(session = ''):
global headers
if not session:
return
tmpCookie = str(headers.get('Cookie'))
sessionName = '_LouDaTui_session'
pos = tmpCookie.find(sessionName)
if -1 != pos:
tmpCookie = tmpCookie[0:pos - 2]
tmpCookie += '; ' + session
headers['Cookie'] = tmpCookie
## this method will return body.
def myRequest(method = 'GET', url = '', body = '', headers = {}):
## init connection
connection = httplib.HTTPConnection(HOST)
## send request
connection.request(method = method, url = url, body = body, headers = headers)
## buffer new cookie and other info
resp = connection.getresponse()
HTTPSession = ''
if resp.getheader('Set-Cookie') != None:
HTTPSession = resp.getheader('Set-Cookie').split(';')[0]
status = resp.status
location = resp.getheader('location')
text = resp.read()
## set HTTPSession
setHTTPSession(HTTPSession)
## connection should be closed before next request.
## and response will be None when close.
connection.close()
## rediret if needed.
if (httplib.FOUND == status) or (httplib.MOVED_PERMANENTLY == status):
text = myRequest(method = 'GET', url = parseUrl(location), headers = headers)
return text
def login(signIn = '', data = '', headers = {}):
## login request.
myRequest(method = 'POST', url = SignIn, body = data, headers = headers)
def getIt(url, i, k):
page = myRequest(url = url, headers = headers)
picUrl = ImagePattern.findall(page)
fname = Today + "-" + str(i) + "-" + str(k + 1) + ".jpg"
if 1 == len(picUrl):
urllib.urlretrieve(picUrl[0], os.path.join(Root, fname))
else:
print 'number of pic url is %d' % len(picUrl)
def parsePages(indexOfPage = 0):
global headers
page = myRequest(url = Prefix + str(indexOfPage), headers = headers)
items = ZoomLinkPattern.findall(page)
tasks = []
for k in range(len(items)):
try:
t = threading.Thread(target = getIt, args = (items[k], indexOfPage, k))
tasks.append(t)
except:
print "some error in %sth download." % k
continue
for task in tasks:
task.start()
for task in tasks:
task.join(300)
return 0
def main():
## forms need to commit when login.
formsPrev = 'utf8=%E2%9C%93&'
forms = {
'authenticity_token' : '',
'user[login]' : '1234@163.com',
'user[password]' : '123456',
'user[remember_me]' : '0',
}
formsEnd = '&commit=%E7%99%BB%E9%99%86'
## update authenticity_token from sign_in page.
forms['authenticity_token'] = getAuthKey()
data = formsPrev + urllib.urlencode(forms) + formsEnd
## login. session created after get sign in.
login(SignIn, data, headers)
## begin to spider the pics.
if False == os.path.exists(Root):
os.mkdir(Root)
for n in range(Pages):
print "Now page %s" % str(n + 1)
parsePages(n + 1)
print "Page %s OK\n" % str(n + 1)
main();
如果文中有什么问题,希望大家可以帮忙指出~
在此先谢过各位了