1、回顾之前的模拟登陆的方法
1.1 requests模块是如何实现模拟登陆的?
-
直接携带cookies请求页面
-
找url地址,发送post请求存储cookie
1.2 selenium是如何模拟登陆的?
-
找到对应的input标签,输入文本点击登陆
1.3 scrapy有二种方法模拟登陆
-
直接携带cookies
-
找url地址,发送post请求存储cookie
2、scrapy携带cookies直接获取需要登陆后的页面
2.1 应用场景
-
cookie过期时间很长,常见于一些不规范的网站
-
能在cookie过期之前把搜有的数据拿到
-
配合其他程序使用,比如其使用selenium把登陆之后的cookie获取到保存到本地,scrapy发送请求之前先读取本地cookie
2.2 通过修改settings中DEFAULT_REQUEST_HEADERS携带cookie
settings.py
DEFAULT_REQUEST_HEADERS = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'en', 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36', 'Cookie': 'ASP.NET_SessionId=n4lwamv5eohaqcorfi3dvzcv; xiaohua_visitfunny=103174; xiaohua_web_userid=120326; xiaohua_web_userkey=r2sYazfFMt/rxUn8LJDmUYetwR2qsFCHIaNt7+Zpsscpp1p6zicW4w==' }
注意:需要打开COOKIES_ENABLED,否则上面设定的cookie将不起作用
# Disable cookies (enabled by default) COOKIES_ENABLED = False
局限性:
当前设定方式虽然可以实现携带cookie保持登录,但是无法获取到新cookie,也就是当前cookie一直是固定的
如果cookie是经常性变化,那么当前不适用
2.3 实现:重构scrapy的start_rquests方法
scrapy中start_url是通过start_requests来进行处理的,其实现代码如下
def start_requests(self): cls = self.__class__ if method_is_overridden(cls, Spider, 'make_requests_from_url'): warnings.warn( "Spider.make_requests_from_url method is deprecated; it " "won't be called in future Scrapy releases. Please " "override Spider.start_requests method instead (see %s.%s)." % ( cls.__module__, cls.__name__ ), ) for url in self.start_urls: yield self.make_requests_from_url(url) else: for url in self.start_urls: yield Request(url, dont_filter=True)
所以对应的,如果start_url地址中的url是需要登录后才能访问的url地址,则需要重写start_request方法并在其中手动添加上cookie
settings.py
import scrapy class DengluSpider(scrapy.Spider): name = 'denglu' # allowed_domains = ['https://user.17k.com/ck/user/mine/readList?page=1'] start_urls = ['https://user.17k.com/ck/user/mine/readList?page=1&appKey=2406394919'] def start_requests(self): cookies = 'GUID=796e4a09-ba11-4ecb-9cf6-aad19169267d; Hm_lvt_9793f42b498361373512340937deb2a0=1660545196; c_channel=0; c_csc=web; accessToken=avatarUrl%3Dhttps%253A%252F%252Fcdn.static.17k.com%252Fuser%252Favatar%252F18%252F98%252F90%252F96139098.jpg-88x88%253Fv%253D1650527904000%26id%3D96139098%26nickname%3D%25E4%25B9%25A6%25E5%258F%258BqYx51ZhI1%26e%3D1677033668%26s%3D8e116a403df502ab; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%2296139098%22%2C%22%24device_id%22%3A%22181d13acb2c3bd-011f19b55b75a8-1c525635-1296000-181d13acb2d5fb%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_referrer%22%3A%22%22%2C%22%24latest_referrer_host%22%3A%22%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%7D%2C%22first_id%22%3A%22796e4a09-ba11-4ecb-9cf6-aad19169267d%22%7D; Hm_lpvt_9793f42b498361373512340937deb2a0=1661483362' cookie_dic = {} for i in cookies.split(';'): v = i.split('=') cookie_dic[v[0]] = v[1] # {i.split('=')[0]:i.split('=')[1] for i in cookies_str.split('; ')} # 简写 for url in self.start_urls: yield scrapy.Request(url, cookies=cookie_dic) def parse(self, response): print(response.text)
注意:
-
scrapy中cookie不能够放在headers中,在构造请求的时候有专门的cookies参数,能够接受字典形式的coookie
-
在setting中设置ROBOTS协议、USER_AGENT
3、scrapy.FormRequest发送post请求
我们知道可以通过scrapy.Request()指定method、body参数来发送post请求;那么也可以使用scrapy.FormRequest()来发送post请求
3.1 scrapy.FormRequest()的使用
通过scrapy.FormRequest能够发送post请求,同时需要添加fromdata参数作为请求体,以及callback
login_url = 'https://passport.17k.com/ck/user/login' yield scrapy.FormRequest( url=login_url, formdata={'loginName': '17346570232', 'password': 'xlg17346570232'}, callback=self.do_login )
3.2 使用scrapy.FormRequest()登陆
3.2.1 思路分析
-
找到post的url地址:点击登录按钮进行抓包,然后定位url地址为https://user.17k.com/ck/user/mine/readList?page=1&appKey=2406394919
-
找到请求体的规律:分析post请求的请求体,其中包含的参数均在前一次的响应中
-
否登录成功:通过请求个人主页,观察是否包含用户名
3.2.2 代码实现如下:
import scrapy class DengluSpider(scrapy.Spider): name = 'denglu' # allowed_domains = ['17k.com'] start_urls = ['https://user.17k.com/ck/user/mine/readList?page=1&appKey=2406394919'] def start_requests(self): ''' 请求登陆的账号和密码 ''' login_url = 'https://passport.17k.com/ck/user/login' # 使用request进行请求 # yield scrapy.Request(url=login_url, body='loginName=17346570232&password=xlg17346570232', callback=self.do_login, method='POST') # 使用Request子类FormRequest进行请求 自动为post请求 yield scrapy.FormRequest( url=login_url, formdata={'loginName': '17346570232', 'password': 'xlg17346570232'}, callback=self.do_login ) def do_login(self, response): ''' 登陆成功后调用parse进行处理 cookie中间件会帮我们自动处理携带cookie ''' for url in self.start_urls: yield scrapy.Request(url=url, callback=self.parse) def parse(self, response, **kwargs): print(response.text)
总结
-
start_urls中的url地址是交给start_request处理的,如有必要,可以重写start_request函数
-
直接携带cookie登陆:cookie只能传递给cookies参数接收
-
scrapy.FormRequest()发送post请求