这几天了解了一下scrapy框架,想爬取知乎用户信息来练习一下。然后发现在进入每个用户的followers和followees网页时,需要用户登录,否则无法爬取。
上网搜索了一下资料,网上说是要将用户名和密码post 到 http://www.zhihu.com/login。但是试了一下发现不行。原来现在登陆是post到 http://www.zhihu.com/login/email
用chrome开发者工具看了一下登陆的过程,
发现现在的知乎是将email,password,_xsrf post到http://www.zhihu.com/login/email。
接着我就将所需的内容post到该url,发现登陆反馈还是错误。
没办法,只好把cookies也加上了。最后完成了登陆。
cookies也是在开发者工具里可以看到的,记得换成字典。
用scrapy登陆的话很简单。
#coding=utf-8
from scrapy.spiders import CrawlSpider
from zhihu_user.items import *
import scrapy
class ZhihuUserSpider(CrawlSpider):
name = "zhihu_user"
allowed_domains = ['zhihu.com']
start_urls = ["http://www.zhihu.com"]
#要cookie!!!!!
cook = {'_za':'fa9fc68f-11cf-4ac5-988c-c96a71314555',
'cap_id':'OTZkYWNhM2U5NjNjNDY0YjhiY2RlZTY5ZWU2YzQxOTM=|1437467003|4e4efe7eac594758447752d643bd2d09a55da003',
'_ga':'GA1.2.1564443706.1436181504',
'q_c1':'a0fa2a995d2b42508c989033b99f8b59|1438829781000|1436181610000',
'Hm_lvt_16374ac3e05d67d6deb7eae3487c2345':'1438829813',
'CNZZDATA1255966030':'2033776647-1438828198-http%253A%252F%252Fwww.zhihu.com%252F%7C1438828198',
'_xsrf':'c0fb9d9a1e9fd2d2c13f873c8b632084' ,
'tc':'AQAAAAWHFmdKTAYAMYVscUxEq23ssAJS',
'z_c0':'QUFBQVV1Y2tBQUFYQUFBQVlRSlZUWmY2NzFYZ3FydEdUSENERHBvZk12SVRQVFVhVFE2OUJRPT0=|1439198615|9846ac1f6283b21c5ac397c52b5d91dbd8a4ad18',
'unlock_ticket':'QUFBQVV1Y2tBQUFYQUFBQVlRSlZUWjkweUZYMHZ3RzJSVjhFR1o2R0thY2RhZkxxOExlajJRPT0=|1439198615|040bb323faec37453cdb5b285feeb58e8162bee1',
'__utmt':'1',
'__utma':'51854390.1564443706.1436181504.1439197715.1439198983.2',
'__utmb':'51854390.9.9.1439199033355',
'__utmc':'51854390',
'__utmz':'51854390.1439198983.2.2.utmcsr=baidu|utmccn=(organic)|utmcmd=organic',
'__utmv':'51854390.100-1|2=registration_date=20140128=1^3=entry_date=20140128=1'
}
def start_requests(self): #登陆
return [scrapy.FormRequest(
"http://www.zhihu.com/login/email",
formdata = {'email':youremail,
'_xsrf':从request的headers中获取
'remember_me':'true',
'password':yourpassword
},
cookies = self.cook,
callback = self.after_login
)]
def after_login(self, response):
print 'after login'
for url in self.start_urls:
request = self.make_requests_from_url(url)
yield request
scrapy.FormRequest函数作用是向url post数据(在这里就是email和密码),并且返回Request类型,然后调用callback的函数,跳转到starts_url。
到这里登陆就完成了!