scrapy采集需要登录的网站信息
直接登录网站,鼠标右键–检查,或者直接键盘F12,找到cookie值,此时的cookie是已经登录之后的cookie,将其转化为字典的格式
下面这段代码可以将网页中的cookie值直接转化成字典格式:
class DictCookie:
def __init__(self, cookie):
self.cookie = cookie
def stringToDict(self):
itemDict = {}
items = self.cookie.split(';')
for item in items:
key = item.split('=')[0].replace(' ', '')
value = item.split('=')[1]
itemDict[key] = value
return itemDict
if __name__ == "__main__":
cookie = "此处复制你从网站上面复制粘贴下来你的cookie值"
trans = DictCookie(cookie)
dict_cookie = trans.stringToDict()
print("dict_cookie:",dict_cookie)
然后在你的crawl.py(蜘蛛中)
cookie = {
"key1":"values1",
"key2":"values2",
....
}
header={
"Accept":"",
"Accept-Encoding":"",
"Accept-Language":"",
"Cache-Control":"",
"Connection":"",
"Host":"",
"User-Agent":"",
}
(设置header头,在我抓取的商品链接的时候,在header头中没有referer,在抓取的时候,一直被重定向出现302错误,加上之后,可以重新采集)
设置的cookie和header,
yield scrapy.Request(url=url,callback=self.parse,headers=headers,cookies=self.cookie)
告诉网页此时已经是登录的状态