先说一下核心思想,从一个大v开始,抓取他的关注和粉丝,然后再遍历这两个群体,再抓关注和粉丝,层层抓下去,就会覆盖知乎的所有用户。
好,让我们先分析分析知乎这个网站,提示一下知乎访问是需要一些请求头的,
DEFAULT_REQUEST_HEADERS = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
'authorization': 'oauth c3cef7c66a1843f8b3a9e6a1e3160e20',
}
以大v轮子哥为例:
1.获取一个用户的信息
我们在轮子哥的关注列表,用光标轻触用户的名字就可以得到该用户的信息
2.再分析他关注的人吧
这些都是他关注的人,也就是说我们可以从这个url获取他关注的人,我们注意到
意思是每一页显示20个,我们可以用offset来控制翻页。
3.粉丝
这个我就不分析了,和轮子哥关注的列表是一样的获取方法。
全部代码:
# -*- coding: utf-8 -*-
import json
from scrapy import Spider, Request
from zhihuuser.items import UserItem
class ZhihuSpider(Spider):
name = "zhihu"
allowed_domains = ["www.zhihu.com"]
#用户的详情页 这个是你触碰某个用户的名字时会返回的
user_url = 'https://www.zhihu.com/api/v4/members/{user}?include={include}'
#用户的关注列表
follows_url = 'https://www.zhihu.com/api/v4/members/{user}/followees?include={include}&offset={offset}&limit={limit}'
#粉丝列表
followers_url = 'https://www.zhihu.com/api/v4/members/{user}/followers?include={include}&offset={offset}&limit={limit}'
start_user = 'excited-vczh'
user_query = 'locations,employments,gender,educations,business,voteup_count,thanked_Count,follower_count,following_count,cover_url,following_topic_count,following_question_count,following_favlists_count,following_columns_count,answer_count,articles_count,pins_count,question_count,commercial_question_count,favorite_count,favorited_count,logs_count,marked_answers_count,marked_answers_text,message_thread_token,account_status,is_active,is_force_renamed,is_bind_sina,sina_weibo_url,sina_weibo_name,show_sina_weibo,is_blocking,is_blocked,is_following,is_followed,mutual_followees_count,vote_to_count,vote_from_count,thank_to_count,thank_from_count,thanked_count,description,hosted_live_count,participated_live_count,allow_message,industry_category,org_name,org_homepage,badge[?(type=best_answerer)].topics'
follows_query = 'data[*].answer_count,articles_count,gender,follower_count,is_followed,is_following,badge[?(type=best_answerer)].topics'
followers_query = 'data[*].answer_count,articles_count,gender,follower_count,is_followed,is_following,badge[?(type=best_answerer)].topics'
def start_requests(self):
yield Request(self.user_url.format(user=self.start_user, include=self.user_query), self.parse_user)
yield Request(self.follows_url.format(user=self.start_user, include=self.follows_query, limit=20, offset=0),
self.parse_follows)
yield Request(self.followers_url.format(user=self.start_user, include=self.followers_query, limit=20, offset=0),
self.parse_followers)
def parse_user(self, response):
result = json.loads(response.text)
item = UserItem()
for field in item.fields:
if field in result.keys():
item[field] = result.get(field)
yield item
yield Request(
self.follows_url.format(user=result.get('url_token'), include=self.follows_query, limit=20, offset=0),
self.parse_follows)
yield Request(
self.followers_url.format(user=result.get('url_token'), include=self.followers_query, limit=20, offset=0),
self.parse_followers)
def parse_follows(self, response):
results = json.loads(response.text)
if 'data' in results.keys():
for result in results.get('data'):
yield Request(self.user_url.format(user=result.get('url_token'), include=self.user_query),
self.parse_user)
if 'paging' in results.keys() and results.get('paging').get('is_end') == False:
next_page = results.get('paging').get('next')
yield Request(next_page,
self.parse_follows)
def parse_followers(self, response):
results = json.loads(response.text)
if 'data' in results.keys():
for result in results.get('data'):
yield Request(self.user_url.format(user=result.get('url_token'), include=self.user_query),
self.parse_user)
if 'paging' in results.keys() and results.get('paging').get('is_end') == False:
next_page = results.get('paging').get('next')
yield Request(next_page,
self.parse_followers)
我主要想分析一下运行流程。
首先,start_requests函数执行,发出第一个request,即请求轮子哥的用户详情,然后回调到parse_user函数,返回轮子哥的信息,然后接着start_requests再发送request请求,是请求轮子哥的关注列表的,回调parse_follows函数,请求第一个关注者的详细信息,好了,这里生成了1个新的request请求,姑且称为请求x,请求内容就是以关注列表的第一个人的url_token为唯一标识请求该用户的详细信息。
def parse_user(self, response):
result = json.loads(response.text)
item = UserItem()
for field in item.fields:
if field in result.keys():
item[field] = result.get(field)
yield item
返回了这个关注的人的个人信息,然后发起另一个request
yield Request(
self.follows_url.format(user=result.get('url_token'), include=self.follows_query, limit=20, offset=0),
self.parse_follows)
是请求这个人的关注的人,简称为小明。
然后开始循环这个过程,继续请求这个小明的关注的人的详细信息,
循环往复直到x请求的for循环完毕,然后开始执行粉丝:
yield Request(self.followers_url.format(user=self.start_user, include=self.followers_query, limit=20, offset=0),
self.parse_followers)
这个过程和关注的人走的是一样的流程,就好比深度优先搜索,一条路走到黑然后再横向遍历。
最终覆盖所有人(不包括零粉丝零关注的人)。