scrapy抓取知乎全部用户信息

最新推荐文章于 2020-12-28 22:23:05 发布

CtrlZ1

最新推荐文章于 2020-12-28 22:23:05 发布

阅读量289

点赞数 1

分类专栏： python 爬虫文章标签： python scrapy 知乎爬虫

本文链接：https://blog.csdn.net/qq_41076797/article/details/97393228

版权

python 同时被 2 个专栏收录

20 篇文章 0 订阅

订阅专栏

爬虫

13 篇文章 1 订阅

订阅专栏

先说一下核心思想，从一个大v开始，抓取他的关注和粉丝，然后再遍历这两个群体，再抓关注和粉丝，层层抓下去，就会覆盖知乎的所有用户。

好，让我们先分析分析知乎这个网站，提示一下知乎访问是需要一些请求头的，

DEFAULT_REQUEST_HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
    'authorization': 'oauth c3cef7c66a1843f8b3a9e6a1e3160e20',
}

以大v轮子哥为例：

1.获取一个用户的信息

我们在轮子哥的关注列表，用光标轻触用户的名字就可以得到该用户的信息

相应的url是https://www.zhihu.com/api/v4/members/Yefeng7?include=allow_message%2Cis_followed%2Cis_following%2Cis_org%2Cis_blocking%2Cemployments%2Canswer_count%2Cfollower_count%2Carticles_count%2Cgender%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics

2.再分析他关注的人吧

可以看到，url为https://www.zhihu.com/api/v4/members/Talyer-Wei/followees?include=data%5B*%5D.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics&offset=20&limit=20，其返回json为

这些都是他关注的人，也就是说我们可以从这个url获取他关注的人，我们注意到

意思是每一页显示20个，我们可以用offset来控制翻页。

3.粉丝

这个我就不分析了，和轮子哥关注的列表是一样的获取方法。

全部代码：

# -*- coding: utf-8 -*-
import json

from scrapy import Spider, Request
from zhihuuser.items import UserItem


class ZhihuSpider(Spider):
    name = "zhihu"
    allowed_domains = ["www.zhihu.com"]
    #用户的详情页 这个是你触碰某个用户的名字时会返回的
    user_url = 'https://www.zhihu.com/api/v4/members/{user}?include={include}'
    #用户的关注列表
    follows_url = 'https://www.zhihu.com/api/v4/members/{user}/followees?include={include}&offset={offset}&limit={limit}'
    #粉丝列表
    followers_url = 'https://www.zhihu.com/api/v4/members/{user}/followers?include={include}&offset={offset}&limit={limit}'
    start_user = 'excited-vczh'
    user_query = 'locations,employments,gender,educations,business,voteup_count,thanked_Count,follower_count,following_count,cover_url,following_topic_count,following_question_count,following_favlists_count,following_columns_count,answer_count,articles_count,pins_count,question_count,commercial_question_count,favorite_count,favorited_count,logs_count,marked_answers_count,marked_answers_text,message_thread_token,account_status,is_active,is_force_renamed,is_bind_sina,sina_weibo_url,sina_weibo_name,show_sina_weibo,is_blocking,is_blocked,is_following,is_followed,mutual_followees_count,vote_to_count,vote_from_count,thank_to_count,thank_from_count,thanked_count,description,hosted_live_count,participated_live_count,allow_message,industry_category,org_name,org_homepage,badge[?(type=best_answerer)].topics'
    follows_query = 'data[*].answer_count,articles_count,gender,follower_count,is_followed,is_following,badge[?(type=best_answerer)].topics'
    followers_query = 'data[*].answer_count,articles_count,gender,follower_count,is_followed,is_following,badge[?(type=best_answerer)].topics'

    def start_requests(self):
        yield Request(self.user_url.format(user=self.start_user, include=self.user_query), self.parse_user)
        yield Request(self.follows_url.format(user=self.start_user, include=self.follows_query, limit=20, offset=0),
                      self.parse_follows)
        yield Request(self.followers_url.format(user=self.start_user, include=self.followers_query, limit=20, offset=0),
                      self.parse_followers)

    def parse_user(self, response):
        result = json.loads(response.text)
        item = UserItem()

        for field in item.fields:
            if field in result.keys():
                item[field] = result.get(field)
        yield item

        yield Request(
            self.follows_url.format(user=result.get('url_token'), include=self.follows_query, limit=20, offset=0),
            self.parse_follows)

        yield Request(
            self.followers_url.format(user=result.get('url_token'), include=self.followers_query, limit=20, offset=0),
            self.parse_followers)

    def parse_follows(self, response):
        results = json.loads(response.text)

        if 'data' in results.keys():
            for result in results.get('data'):
                yield Request(self.user_url.format(user=result.get('url_token'), include=self.user_query),
                              self.parse_user)

        if 'paging' in results.keys() and results.get('paging').get('is_end') == False:
            next_page = results.get('paging').get('next')
            yield Request(next_page,
                          self.parse_follows)

    def parse_followers(self, response):
        results = json.loads(response.text)

        if 'data' in results.keys():
            for result in results.get('data'):
                yield Request(self.user_url.format(user=result.get('url_token'), include=self.user_query),
                              self.parse_user)

        if 'paging' in results.keys() and results.get('paging').get('is_end') == False:
            next_page = results.get('paging').get('next')
            yield Request(next_page,
                          self.parse_followers)

我主要想分析一下运行流程。

首先，start_requests函数执行，发出第一个request，即请求轮子哥的用户详情，然后回调到parse_user函数，返回轮子哥的信息，然后接着start_requests再发送request请求，是请求轮子哥的关注列表的，回调parse_follows函数，请求第一个关注者的详细信息，好了，这里生成了1个新的request请求，姑且称为请求x，请求内容就是以关注列表的第一个人的url_token为唯一标识请求该用户的详细信息。

    def parse_user(self, response):
        result = json.loads(response.text)
        item = UserItem()

        for field in item.fields:
            if field in result.keys():
                item[field] = result.get(field)
        yield item

返回了这个关注的人的个人信息，然后发起另一个request

yield Request(
            self.follows_url.format(user=result.get('url_token'), include=self.follows_query, limit=20, offset=0),
            self.parse_follows)

是请求这个人的关注的人，简称为小明。

然后开始循环这个过程，继续请求这个小明的关注的人的详细信息，

循环往复直到x请求的for循环完毕，然后开始执行粉丝：

yield Request(self.followers_url.format(user=self.start_user, include=self.followers_query, limit=20, offset=0),
                      self.parse_followers)

这个过程和关注的人走的是一样的流程，就好比深度优先搜索，一条路走到黑然后再横向遍历。

最终覆盖所有人（不包括零粉丝零关注的人）。

CtrlZ1

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
打赏
2
评论
scrapy抓取知乎全部用户信息

先说一下核心思想，从一个大v开始，抓取他的关注和粉丝，然后再遍历这两个群体，再抓关注和粉丝，层层抓下去，就会覆盖知乎的所有用户。好，让我们先分析分析知乎这个网站，提示一下知乎访问是需要一些请求头的，DEFAULT_REQUEST_HEADERS = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) A...
复制链接

扫一扫