Scrapy爬取知乎用户信息(代理池,MongoDB,非分布式)

以下列出运行环境与主要模块:

  • macOS 10.13.4
  • Chrome/JSON-handle
  • Scrapy 1.5.0
  • Abuyun HTTP tunnel(服务器:http-dyn.abuyun.com,端口:9020)
  • MongoDB shell version v4.0.0


目标站点分析

vczh轮子哥个人主页
个人主页检查元素

  • 爬取思路
    • 以粉丝数较多的用户主页作为基准点;
    • 分别爬取其followees和followers列表下用户的信息;
    • 针对每个用户获取各自主页,重复上述步骤,迭代;
  • 注意点
    • 用户列表信息的翻页处理;
    • 递归终止条件;

代码分析

  • 根据需爬取用户的信息,创建Item类
# items.py

from scrapy import Field

class UserItem(scrapy.Item):
    id = Field()
    name = Field()
    account_status = Field()
    answer_count = Field()
    articles_count = Field()
    avatar_url = Field()
    badge = Field()
   #...
    vote_to_count = Field()
    voteup_count = Field()
  • 使用阿布云动态代理池,防止IP被403
# middlewares.py
import base64

proxyServer = "http://http-dyn.abuyun.com:9020"
proxyUser = "myUser"
proxyPass = "myPassword"
proxyAuth = "Basic " + base64.urlsafe_b64encode(bytes((proxyUser + ":" + proxyPass), "ascii")).decode("utf8")

class ProxyMiddleware(object):
    def process_request(self, request, spider):
        request.meta["proxy"] = proxyServer
        request.headers["Proxy-Authorization"] = proxyAuth
  • 连接MongoDB服务器
# pipelines.py

import pymongo

class MongoPipeline(object):
    collection_name = 'users'

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        self.db[self.collection_name].update({'url_token': item['url_token']}, {'$set': item}, upsert=True)
        return item
  • 为了便于阅读,仅在此列出已安装的配置,且微调顺序
# settings.py(部分)

# 基本配置 + headers
BOT_NAME = 'zhihuuser'
SPIDER_MODULES = ['zhihuuser.spiders']
NEWSPIDER_MODULE = 'zhihuuser.spiders'
ROBOTSTXT_OBEY = False
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
DEFAULT_REQUEST_HEADERS = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en',
}

# 安装ProxyMiddleware,并根据代理的更替速度,设置下载延迟
DOWNLOADER_MIDDLEWARES = {
    'zhihuuser.middlewares.ProxyMiddleware': 543,
}
DOWNLOAD_DELAY = 0.1
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 0.1

# 安装MongoPipeline,设置MongoDB的URI和DATABASE
ITEM_PIPELINES = {
    'zhihuuser.pipelines.MongoPipeline': 300,
}
MONGO_URI = 'localhost'
MONGO_DATABASE = 'zhihu'
  • 主程序如下
# spiders.zhihu.py

import json
import scrapy
from scrapy import Request
from ..items import UserItem

class ZhihuSpider(scrapy.Spider):
    name = 'zhihu'
    allowed_domains = ['www.zhihu.com']
    start_user = 'excited-vczh'
    user_url = 'https://www.zhihu.com/api/v4/members/{user}?include={include}'
    user_query = 'locations,employments,gender,educations,business,voteup_count,thanked_Count,follower_count,following_count,cover_url,following_topic_count,following_question_count,following_favlists_count,following_columns_count,avatar_hue,answer_count,articles_count,pins_count,question_count,columns_count,commercial_question_count,favorite_count,favorited_count,logs_count,marked_answers_count,marked_answers_text,message_thread_token,account_status,is_active,is_bind_phone,is_force_renamed,is_bind_sina,is_privacy_protected,sina_weibo_url,sina_weibo_name,show_sina_weibo,is_blocking,is_blocked,is_following,is_followed,mutual_followees_count,vote_to_count,vote_from_count,thank_to_count,thank_from_count,thanked_count,description,hosted_live_count,participated_live_count,allow_message,industry_category,org_name,org_homepage,badge[?(type=best_answerer)].topics'
    followees_url = 'https://www.zhihu.com/api/v4/members/{user}/followees?include={include}&offset={offset}&limit={limit}'
    followees_query = 'data[*].answer_count,articles_count,gender,follower_count,is_followed,is_following,badge[?(type=best_answerer)].topics'
    # followers_url = 'https://www.zhihu.com/api/v4/members/{user}/followers?include={include}&offset={offset}&limit={limit}'
    # followers_query = 'data[*].answer_count,articles_count,gender,follower_count,is_followed,is_following,badge[?(type=best_answerer)].topics'

    def start_requests(self):
        yield Request(url=self.user_url.format(user=self.start_user, include=self.user_query), callback=self.parse_user)
        yield Request(url=self.followees_url.format(user=self.start_user, include=self.followees_query, offset=0, limit=20), callback=self.parse_followees)
        # yield Request(url=self.followers_url.format(user=self.start_user, include=self.followers_query, offset=0, limit=20), callback=self.parse_followers)

    def parse_user(self, response):
        result = json.loads(response.text)
        item = UserItem()
        for field in item.fields:
            if field in result.keys():
                item[field] = result.get(field)
        yield item
        # yield Request(url=self.followees_url.format(user=result.get('url_token'), include=self.followees_query, offset=0, limit=20), callback=self.parse_followees)
        # yield Request(url=self.followers_url.format(user=result.get('url_token'), include=self.followers_query, offset=0, limit=20), callback=self.parse_followers)

    def parse_followees(self, response):
        results = json.loads(response.text)
        if 'data' in results.keys():
            for result in results.get('data'):
                yield Request(url=self.user_url.format(user=result.get('url_token'), include=self.user_query), callback=self.parse_user)
        if 'paging' in results.keys() and not results.get('paging').get('is_end'):
            yield Request(url=results.get('paging').get('next'), callback=self.parse_followees)

    # def parse_followers(self, response):
    #     results = json.loads(response.text)
    #     if 'data' in results.keys():
    #         for result in results.get('data'):
    #             yield Request(url=self.user_url.format(user=result.get('url_token'), include=self.user_query), callback=self.parse_user)
    #     if 'paging' in results.keys() and not results.get('paging').get('is_end'):
    #         yield Request(url=results.get('paging').get('next'), callback=self.parse_followers)

参考网址

收纳本次爬虫的重要参考网址,以便日后翻阅:

write-items-to-mongodb(官方)
阿布云介入指南(官方)

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值