python爬取知乎live_Python爬虫从入门到放弃（十九）之 Scrapy爬取所有知乎用户信息(下)...

最新推荐文章于 2020-12-01 09:38:39 发布

weixin_39644494

最新推荐文章于 2020-12-01 09:38:39 发布

阅读量88

点赞数

文章标签： python爬取知乎live

classZhihuSpider(scrapy.Spider):

name= "zhihu"allowed_domains= ["www.zhihu.com"]

start_urls= ['http://www.zhihu.com/']#这里定义一个start_user存储我们找的大V账号

start_user = "excited-vczh"

#这里把查询的参数单独存储为user_query,user_url存储的为查询用户信息的url地址

user_url = "https://www.zhihu.com/api/v4/members/{user}?include={include}"user_query= "locations,employments,gender,educations,business,voteup_count,thanked_Count,follower_count,following_count,cover_url,following_topic_count,following_question_count,following_favlists_count,following_columns_count,avatar_hue,answer_count,articles_count,pins_count,question_count,columns_count,commercial_question_count,favorite_count,favorited_count,logs_count,marked_answers_count,marked_answers_text,message_thread_token,account_status,is_active,is_bind_phone,is_force_renamed,is_bind_sina,is_privacy_protected,sina_weibo_url,sina_weibo_name,show_sina_weibo,is_blocking,is_blocked,is_following,is_followed,mutual_followees_count,vote_to_count,vote_from_count,thank_to_count,thank_from_count,thanked_count,description,hosted_live_count,participated_live_count,allow_message,industry_category,org_name,org_homepage,badge[?(type=best_answerer)].topics"

#follows_url存储的为关注列表的url地址,fllows_query存储的为查询参数。这里涉及到offset和limit是关于翻页的参数，0，20表示第一页

follows_url = "https://www.zhihu.com/api/v4/members/{user}/followees?include={include}&offset={offset}&limit={limit}"follows_query= "data%5B*%5D.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics"

#followers_url是获取粉丝列表信息的url地址，followers_query存储的为查询参数。

followers_url = "https://www.zhihu.com/api/v4/members/{user}/followers?include={include}&offset={offset}&limit={limit}"followers_query= "data%5B*%5D.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics"

defstart_requests(self):'''这里重写了start_requests方法，分别请求了用户查询的url和关注列表的查询以及粉丝列表信息查询

:return:'''

yield Request(self.user_url.format(user=self.start_user,include=self.user_query),callback=self.parse_user)yield Request(self.follows_url.format(user=self.start_user,include=self.follows_query,offset=0,limit=20),callback=self.parse_follows)yield Request(self.followers_url.format(user=self.start_user,include=self.followers_query,offset=0,limit=20),callback=self.parse_followers)defparse_user(self, response):'''因为返回的是json格式的数据，所以这里直接通过json.loads获取结果

:param response:

:return:'''result=json.loads(response.text)

item=UserItem()#这里循环判断获取的字段是否在自己定义的字段中，然后进行赋值

for field initem.fields:if field inresult.keys():

item[field]=result.get(field)#这里在返回item的同时返回Request请求，继续递归拿关注用户信息的用户获取他们的关注列表

yielditemyield Request(self.follows_url.format(user = result.get("url_token"),include=self.follows_query,offset=0,limit=20),callback=self.parse_follows)yield Request(self.followers_url.format(user = result.get("url_token"),include=self.followers_query,offset=0,limit=20),callback=self.parse_followers)defparse_follows(self, response):'''用户关注列表的解析，这里返回的也是json数据这里有两个字段data和page，其中page是分页信息

:param response:

:return:'''results=json.loads(response.text)if 'data' inresults.keys():for result in results.get('data'):yield Request(self.user_url.format(user = result.get("url_token"),include=self.user_query),callback=self.parse_user)#这里判断page是否存在并且判断page里的参数is_end判断是否为False，如果为False表示不是最后一页，否则则是最后一页

if 'page' in results.keys() and results.get('is_end') ==False:

next_page= results.get('paging').get("next")#获取下一页的地址然后通过yield继续返回Request请求，继续请求自己再次获取下页中的信息

yieldRequest(next_page,self.parse_follows)defparse_followers(self, response):'''这里其实和关乎列表的处理方法是一样的

用户粉丝列表的解析，这里返回的也是json数据这里有两个字段data和page，其中page是分页信息

:param response:

if 'page' in results.keys() and results.get('is_end') ==False:

next_page= results.get('paging').get("next")#获取下一页的地址然后通过yield继续返回Request请求，继续请求自己再次获取下页中的信息

yield Request(next_page,self.parse_followers)

weixin_39644494

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python爬取知乎live_Python爬虫从入门到放弃（十九）之 Scrapy爬取所有知乎用户信息(下)...

classZhihuSpider(scrapy.Spider):name= "zhihu"allowed_domains= ["www.zhihu.com"]start_urls= ['http://www.zhihu.com/']#这里定义一个start_user存储我们找的大V账号start_user = "excited-vczh"#这里把查询的参数单独存储为user_query,user_...
复制链接

扫一扫