Python爬取知乎用户

注意:此教程仅供技术学习参考!!!!!!!!!

1.今天来介绍下用scrapy爬取知乎某个人圈子的用户信息。本文需要懂scrapy框架,不懂的请先自学,爬取结果如下图:
这里写图片描述

2.实现
2.1新建数据存储表。

CREATE TABLE `zhihu_user` (
  `id` int(10) NOT NULL AUTO_INCREMENT COMMENT '主键id',
  `zhihu_id` varchar(100) COLLATE utf8_unicode_ci DEFAULT NULL COMMENT '知乎自己的主键',
  `user_image` varchar(100) COLLATE utf8_unicode_ci DEFAULT NULL COMMENT '用户头像',
  `follower_count` int(10) DEFAULT '0' COMMENT '关注它的人',
  `gender` int(10) DEFAULT '0' COMMENT '性别, 1-男; 2-女',
  `description` varchar(500) COLLATE utf8_unicode_ci DEFAULT NULL COMMENT '个人简介',
  `business` varchar(500) COLLATE utf8_unicode_ci DEFAULT NULL COMMENT '所在行业',
  `school` varchar(500) COLLATE utf8_unicode_ci DEFAULT NULL COMMENT '学校',
  `major` varchar(500) COLLATE utf8_unicode_ci DEFAULT NULL COMMENT '专业',
  `company` varchar(500) COLLATE utf8_unicode_ci DEFAULT NULL COMMENT '公司',
  `job` varchar(500) COLLATE utf8_unicode_ci DEFAULT NULL COMMENT '工作类型',
  `headline` varchar(500) COLLATE utf8_unicode_ci DEFAULT NULL COMMENT 'headline',
  `locations` varchar(500) COLLATE utf8_unicode_ci DEFAULT NULL COMMENT '居住地',
  `name` varchar(500) COLLATE utf8_unicode_ci DEFAULT NULL COMMENT 'name',
  `url_token` varchar(50) COLLATE utf8_unicode_ci DEFAULT NULL COMMENT 'url_token',
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=2055 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci COMMENT='知乎用户表';

2.2items.py 定义要抓取的字段,这里跟数据库表字段保持一致

import scrapy

class MytestItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    zhihu_id = scrapy.Field()
    user_image = scrapy.Field()
    follower_count = scrapy.Field()
    gender = scrapy.Field()
    description = scrapy.Field()
    business = scrapy.Field()
    school = scrapy.Field()
    major = scrapy.Field()
    company = scrapy.Field()
    job = scrapy.Field()
    headline = scrapy.Field()
    locations = scrapy.Field()
    name = scrapy.Field()
    url_token = scrapy.Field()

2.3分析爬取接口
我们要爬取某个知乎用户的圈子,所以得找一个开始人,首先爬取这个人详细信息,然后在爬取这个人关注的人
这里写图片描述

然后再爬取每一个关注的人的详细信息,在爬取关注人的关注的人,这样循环递归,就爬到了此人的圈子所有的用户信息。

这里选取了一位开始人美女 https://www.zhihu.com/people/wang-jing-15-15-28/activities , 打开F12找到
这里写图片描述

没错这个就是我们要找的用户详细接口:

https://www.zhihu.com/api/v4/members/wang-jing-15-15-28?include=locations%2Cemployments%2Cgender%2Ceducations%2Cbusiness%2Cvoteup_count%2Cthanked_Count%2Cfollower_count%2Cfollowing_count%2Ccover_url%2Cfollowing_topic_count%2Cfollowing_question_count%2Cfollowing_favlists_count%2Cfollowing_columns_count%2Cavatar_hue%2Canswer_count%2Carticles_count%2Cpins_count%2Cquestion_count%2Ccolumns_count%2Ccommercial_question_count%2Cfavorite_count%2Cfavorited_count%2Clogs_count%2Cmarked_answers_count%2Cmarked_answers_text%2Cmessage_thread_token%2Caccount_status%2Cis_active%2Cis_bind_phone%2Cis_force_renamed%2Cis_bind_sina%2Cis_privacy_protected%2Csina_weibo_url%2Csina_weibo_name%2Cshow_sina_weibo%2Cis_blocking%2Cis_blocked%2Cis_following%2Cis_followed%2Cmutual_followees_count%2Cvote_to_count%2Cvote_from_count%2Cthank_to_count%2Cthank_from_count%2Cthanked_count%2Cdescription%2Chosted_live_count%2Cparticipated_live_count%2Callow_message%2Cindustry_category%2Corg_name%2Corg_homepage%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics

这里就一个动态参数:wang-jing-15-15-28  这个在知乎里面是url_token, 就是用户标识

接下来找到关注的人的接口
这里写图片描述

https://www.zhihu.com/api/v4/members/wang-jing-15-15-28/followees?include=data%5B*%5D.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics&offset=0&limit=20

这里动态参数是:wang-jing-15-15-28  , offset , limit

2.4重头戏spider蜘蛛的编写 [先调用爬取起始人的详细资料的函数,在详细函数里面调用分页爬取关注人的函数并递归调用爬取详细信息函数, 大致逻辑就是这样,看不懂文字就直接看代码]

# -*- coding: utf-8 -*-
import scrapy
import logging
import json
from mytest.items import MytestItem

###### 全局变量,个人详细资料接口,动态参数为:person_name
person_detail = 'https://www.zhihu.com/api/v4/members/{person_name}?include=locations%2Cemployments%2Cgender%2Ceducations%2Cbusiness%2Cvoteup_count%2Cthanked_Count%2Cfollower_count%2Cfollowing_count%2Ccover_url%2Cfollowing_topic_count%2Cfollowing_question_count%2Cfollowing_favlists_count%2Cfollowing_columns_count%2Cavatar_hue%2Canswer_count%2Carticles_count%2Cpins_count%2Cquestion_count%2Ccolumns_count%2Ccommercial_question_count%2Cfavorite_count%2Cfavorited_count%2Clogs_count%2Cmarked_answers_count%2Cmarked_answers_text%2Cmessage_thread_token%2Caccount_status%2Cis_active%2Cis_bind_phone%2Cis_force_renamed%2Cis_bind_sina%2Cis_privacy_protected%2Csina_weibo_url%2Csina_weibo_name%2Cshow_sina_weibo%2Cis_blocking%2Cis_blocked%2Cis_following%2Cis_followed%2Cmutual_followees_count%2Cvote_to_count%2Cvote_from_count%2Cthank_to_count%2Cthank_from_count%2Cthanked_count%2Cdescription%2Chosted_live_count%2Cparticipated_live_count%2Callow_message%2Cindustry_category%2Corg_name%2Corg_homepage%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics'

###### 全局变量,关注的人接口,动态参数为:person_name,offset,limit
his_about_person = 'https://www.zhihu.com/api/v4/members/{person_name}/followees?include=data%5B*%5D.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics&offset={offset}&limit={limit}'

class TestspiderSpider(scrapy.Spider):
    name = 'testSpider'
    allowed_domains = ['zhihu.com']
    start_urls = ['https://www.zhihu.com']


    ### 初始调用
    def start_requests(self):
        ### 起始人的 url_token
        init_url_token = 'wang-jing-15-15-28'
        # 爬取此人的 详细信息
        yield scrapy.Request(url=person_detail.format(person_name=init_url_token), callback=self.parse_person_detail)

    ## 解析 他关注的人
    def parse_his_followees(self, response):
        ## 转成字典
        detail_dict = json.loads(response.text)
        ## 解析当前页 每个followers 的 url_token
        if 'data' in detail_dict.keys():
            for item in detail_dict.get('data'):
                # 爬取此人的 详细信息
                print('urltoken:',item.get('url_token'))
                yield scrapy.Request(url=person_detail.format(person_name=item.get('url_token')), callback=self.parse_person_detail)

        ### 解析下一页的 解析每个followers
        if 'paging' in detail_dict:
            if 'is_end' in detail_dict.get('paging'):
                ## 是否是最后一页
                if detail_dict.get('paging').get('is_end'):
                    ##是最后一页, 不用请求下一页了
                    return
                ##请求下一页
                print('开始请求第 ',detail_dict.get('paging').get('next'))
                if 'next' in detail_dict.get('paging'):
                    yield scrapy.Request(url=detail_dict.get('paging').get('next'),callback=self.parse_his_followees)

    ### 解析个人详细资料
    def parse_person_detail(self,response):
        ## 转成字典
        detail_dict = json.loads(response.text)
        name = detail_dict.get('name')
        my_item = MytestItem()
        print(name)
        #######以下为数据提取
        if 'id' in detail_dict:
            my_item['zhihu_id'] = detail_dict.get('id')
        if 'avatar_url_template' in detail_dict:
            my_item['user_image'] = detail_dict.get('avatar_url_template').replace('{size}','xl')
        else:
            my_item['user_image'] = ''
        if 'follower_count' in detail_dict:
            my_item['follower_count'] = detail_dict.get('follower_count')
        else:
            my_item['follower_count'] = 0
        if 'gender' in detail_dict:
            my_item['gender'] = detail_dict.get('gender')
        else:
            my_item['gender'] = 1
        if 'description' in detail_dict:
            my_item['description'] = detail_dict.get('description')
        else:
            my_item['description'] = ''
        if 'business' in detail_dict and 'name' in detail_dict.get('business').get('name'):
            my_item['business'] = detail_dict.get('business').get('name')
        else:
            my_item['business'] =''
        if 'educations' in detail_dict:
            school_list = ''
            for item in detail_dict.get('educations'):
                if 'school' in item and 'name' in item.get('school'):
                    school_list = school_list + item.get('school').get('name') + ','
            my_item['school'] = school_list
        if 'educations' in detail_dict:
            major_list = ''
            for item in detail_dict.get('educations'):
                if 'major' in item and 'name' in item.get('major'):
                    major_list = major_list + item.get('major').get('name') + ','
            my_item['major'] = major_list
        if 'employments' in detail_dict:
            company_list = ''
            for item in detail_dict.get('employments'):
                if 'company' in item and 'name' in item.get('company'):
                    company_list = company_list + item.get('company').get('name') + ','
            my_item['company'] = company_list
        if 'employments' in detail_dict:
            job_list = ''
            for item in detail_dict.get('employments'):
                if 'job' in item and 'name' in item.get('job'):
                    job_list = job_list + item.get('job').get('name') + ','
            my_item['job'] = job_list
        if 'headline' in detail_dict:
            my_item['headline'] = detail_dict.get('headline')
        if 'locations' in detail_dict:
            location_list = ''
            for item in detail_dict.get('locations'):
                if 'name' in item:
                    location_list = location_list + item.get('name') + ','
            my_item['locations'] = location_list
        if 'name' in detail_dict:
            my_item['name'] = detail_dict.get('name')
        else:
            my_item['name'] = ''
        if 'url_token' in detail_dict:
            my_item['url_token'] = detail_dict.get('url_token')

        yield my_item
        ## 解析下面的 followees
        url_token = detail_dict.get('url_token')
        ## 第一页 0  第二页20 第三页是40
        yield scrapy.Request(url=his_about_person.format(person_name=url_token, offset=0, limit=20),
                             callback=self.parse_his_followees)

2.5pipline 存到数据库 [先根据知乎id查询是否存在记录,存在就跳过,不存在就插入数据库]

# -*- coding: utf-8 -*-

import MySQLdb
import MySQLdb.cursors
from twisted.enterprise import adbapi
import logging
import requests

class MytestPipeline(object):
    def __init__(self):
        dbargs = dict(
            host='127.0.0.1',
            db='数据库名',
            user='root',
            passwd='密码',
            charset='utf8',
            cursorclass=MySQLdb.cursors.DictCursor,
            use_unicode=True,
        )
        self.dbpool = adbapi.ConnectionPool('MySQLdb', **dbargs)

    def process_item(self, item, spider):
        res = self.dbpool.runInteraction(self.insert_into_table, item)
        return item

    def insert_into_table(self, conn, item):
        # 先查询存不存在
        if self.exsist(item, conn):
            return
        sql = 'insert into zhihu_user (zhihu_id,user_image,follower_count,gender,description,business,school,major,company,job,headline,locations,name,url_token) VALUES ('
        sql = sql + '"' + item['zhihu_id'] + '",'
        sql = sql + '"' + item['user_image'] + '",'
        sql = sql + '"' + str(item['follower_count']) + '",'
        sql = sql + '"' + str(item['gender']) + '",'
        sql = sql + '"' + item['description'] + '",'
        sql = sql + '"' + item['business'] + '",'
        sql = sql + '"' + item['school'] + '",'
        sql = sql + '"' + item['major'] + '",'
        sql = sql + '"' + item['company'] + '",'
        sql = sql + '"' + item['job'] + '",'
        sql = sql + '"' + item['headline'] + '",'
        sql = sql + '"' + item['locations'] + '",'
        sql = sql + '"' + item['name'] + '",'
        sql = sql + '"' + item['url_token'] + '",'
        sql = sql[0:-1]
        sql = sql + ')'
        try:
            conn.execute(sql)
            print(sql)
        except Exception as e:
            print( "sqlsqlsqlsqlsqlsqlsql error>> " ,str(e))

    def exsist(self, item, conn):
        sql = 'select id from zhihu_user where zhihu_id="' + item['zhihu_id']
        try:
            # 执行SQL语句
            conn.execute(sql)
            # 获取所有记录列表
            results = conn.fetchall()
            if len(results) > 0:  ## 存在
                return True
        except:
            return False
        return False

2.6settings.py[这里重要的是authorization]
这里写图片描述

# -*- coding: utf-8 -*-

BOT_NAME = 'mytest'

SPIDER_MODULES = ['mytest.spiders']
NEWSPIDER_MODULE = 'mytest.spiders'

DEFAULT_REQUEST_HEADERS = {
   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
   'Accept-Language': 'en',
   'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
   ######知乎的 header,不传会报401
   'authorization':'oauth c3cef7c66a1843f8b3a9e6a1e3160e20'
}
ITEM_PIPELINES = {
    'mytest.pipelines.MytestPipeline': 300,
}
ROBOTSTXT_OBEY = False

结果: 由于爬取一段时间服务器会认出是爬虫返回403, 这里可以用我另外的教程使用ip代理池来继续进行爬取,ip代理池博客地址:http://blog.csdn.net/luozheng4698729/article/details/77966752, 就到这里谢谢大家。

注意:此教程仅供技术学习参考!!!!!!!!!

老生常谈:深圳有爱好音乐的会打鼓(吉他,键盘,贝斯等)的程序员和其它职业可以一起交流加入我们乐队一起嗨。我的QQ:657455400

  • 3
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Python 模拟爬虫抓取知乎用户信息以及人际拓扑关系,使用scrapy爬虫框架,数据存储使用mongo数据库。   本地运行请注意:爬虫依赖mongo和rabbitmq,因此这两个服务必须正常运行和配置。为了加快下载效率,图片下载是异步任务,因此在启动爬虫进程执行需要启动异步worker,启动方式是进入zhihu_spider/zhihu目录后执行下面命令:   celery -A zhihu.tools.async worker --loglevel=info   爬虫项目部署:   进入zhihu_spider后执行```docker-compose up``` ,进入container后和本地运行方法相同,依次启动mongo、rabbitmq、异步任务、爬虫进程即可。   其它需要说明的问题:   爬虫框架从start\_requests开始执行,此部分会提交知乎主页的访问请求给引擎,并设置回调函数为post_login.   post\_login解析主页获取\_xsrf保存为成员变量中,并提交登陆的POST请求,设置回调函数为after\_login.   after\_login拿到登陆后的cookie,提交一个start\_url的GET请求给爬虫引擎,设置回调函数parse\_people.   parse\_people解析个人主页,一次提交关注人和粉丝列表页面到爬虫引擎,回调函数是parse\_follow, 并把解析好的个人数据提交爬虫引擎写入mongo。   parse\_follow会解析用户列表,同时把动态的人员列表POST请求发送只引擎,回调函数是parse\_post\_follow,把解析好的用户主页链接请求也发送到引擎,人员关系写入mongo。   parse\_post\_follow单纯解析用户列表,提交用户主页请求至引擎。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值