分布式长连接 session 共享解决方案

Finchley

已于 2022-08-23 10:21:47 修改

阅读量1k

点赞数

文章标签：分布式 java 开发语言 websocket

于 2022-08-22 17:08:25 首次发布

本文链接：https://blog.csdn.net/ypw384556909/article/details/126469034

版权

分布式长连接 session 共享解决方案

背景

在Spring所集成的WebSocket里面，每个ws连接都有一个对应的session(目前底层实现是 Tomcat 的org.apache.tomcat.websocket.WsSession)，当然也可用 socketio

在Spring WebSocket中，我们建立ws连接之后可以通过实现类提供的方法与客户端的通信

因为一条websocket长连接涉及大量操作系统底层方法,并且与http 不同的是这是一个有状态的长链接,无法像 http 的 session 可以静态序列化存储,所以长连接一但建立 connect 成功后,在本次连接周期内无法转移到其他服务器上

之前设想过有没有可能将长连接复制一份,在调度到新机器上时使用关键链接信息重新构建一份长连接,但是全网没有找到成功案例,调研成本太高,放弃

架构大图

分布式消息

简单来说就是每个客户端连接时维护客户端与机器唯一标识的关系,当下次有消息需要通过此长连接推送时,将消息投送到与机器相对应的消息队列上(每台机器都有自己的队列),要保证消息转发到与相应用户建立长连接的机器上

优点:精准投递

缺点:实现稍微复杂,需要维护机器 node 节点与其上面维护的 client-user 信息

技术选型

那么问题来了,如何来实现消息跨机器共享呢?

业务场景分析:
业务场景中的 session 消息具有很强的实时性,不需要重试重复消费,但要保证消息有序消费

性能要求分析:
以每条消息 1kb 为例(消息体不大,暂不考虑消息压缩,目前业界压缩比最高 2.8左右),100MB 的 redis 内存空间可以暂存10W 条消息, Redis的QPS参考值可以达到 10W 级别,可以保证实时消费,redis 不会成为性能瓶颈

消息是即时消费即时出队的,且消费逻辑不复杂,一般都是使用 session 直接把消息推回去,假设用户产生消息速率为 10kb/S,那理论上已经能支持1w 用户同时在线

1.Redis

Redis 的列表（List）是一种线性的有序结构，可以按照元素被推入列表中的顺序来存储元素，能满足「先进先出」的需求，这些元素既可以是文字数据，又可以是二进制数据。

写消息:LPUSH

拉取消息:BRPOP(阻塞拉取消息,防止 CPU 空转)

以下为单副本 redis 性能参数参考:https://help.aliyun.com/document_detail/145227.html

2GB单节点版 redis.basic.mid.default 2 10,000 10,000 16 80,000
4GB单节点版 redis.basic.stand.default 2 10,000 10,000 24 80,000
8GB单节点版 redis.basic.large.default 2 10,000 10,000 24 80,000
16GB单节点版 redis.basic.2xlarge.default 2 10,000 10,000 32 80,000
32GB单节点版 redis.basic.4xlarge.default 2 10,000 10,000 32 80,000

2.rocketmq|kafka

丰富的消息类型，满足各种严苛场景下的高级特性需求，解决异步通知、系统（微服务）间解耦，削峰填谷，缓存同步，实时计算等问题

顺序消息按照消息的发布顺序进行顺序消费（FIFO），支持全局顺序与分区顺序
社区版单机 TPS 理论值为 7W
参考:https://help.aliyun.com/document_detail/49319.html

具体的实现根据实际情况选用即可,要注意的是 redis 作为消息队列使用 pub/sub不建议暂存消息,简单的业务场景可以直接使用 redis 的 list 实现,复杂场景,要求消息至少消费一次的场景建议还是使用消息队列实现

关键业务代码

/**
 * 全局 session 共享 Service
 *
 * @author ypw
 */
public interface GlobalSessionService {
    /**
     * initMachineRoomSession 初始化建立的链接,维护用户和机器,房间的对应关系
     *
     * @param currentSession session
     * @param message        message
     */
    void initConnectSession(Session currentSession, HandshakeMessage message);
 
    /**
     * 发送 room 全局消息,过滤掉本机session
     *
     * @param currentSessionUsers 本机 session 用户(需要过滤掉)
     * @param session             session
     * @param message             消息内容
     * @param messageTypeEnum     messageTypeEnum
     */
    void sendGlobalRoomMessage(List<SessionUser> currentSessionUsers, Session session, Object message, MessageTypeEnum messageTypeEnum, Long id);
 
    /**
     * 发送全局用户消息
     *
     * @param userId          用户 ID
     * @param response        消息内容
     * @param messageTypeEnum 消息类型
     */
    void sendGlobalUserMessage(String userId, Response<?> response, MessageTypeEnum messageTypeEnum, Long id);
 
    /**
     * 刷新房间内用户信息
     *
     * @param session   session
     */
    void refreshRoomUserInfo(Session session);

关键消息拉取 EventLoop

Runnable runnable = () -> {
           while (true) {
               try {
                   String message = redisCacheHelper.popMsgFromList(systemProperty.getNodeName());
                   if (StringUtils.isBlank(message)) {
                       //防止CPU空转,阻塞BPOP共同作用
                       Thread.sleep(configProperty.getStopPullMsgInterval());
                   } else {
                       GlobalRedisEventWrapper redisEventWrapper = JSON.parseObject(message, GlobalRedisEventWrapper.class);
                       log.info("拉取到队列{},消息{}", systemProperty.getNodeName(), JSON.toJSONString(redisEventWrapper));
                       String globalSessionId = redisEventWrapper.getGlobalSessionId();
                       //推送消息
                       SessionUser sessionUserByGlobalSessionId = sessionUserStorage.getSessionUserByGlobalSessionId(globalSessionId);
                       if (Objects.nonNull(sessionUserByGlobalSessionId)) {
                           handleMessage(sessionUserByGlobalSessionId, redisEventWrapper);
                       } else {
                           //此时 session 可能还没有维护到本机内存中就被拉取到,消息需要回源重试
                           retryEvent(redisEventWrapper);
                       }
                   }
               } catch (Exception e) {
                   log.warn("拉取redis队列msg出现异常queueName={}", systemProperty.getNodeName(), e);
               }
           }
       };
       //创建守护线程
       ThreadPoolExecutor threadPoolExecutor = new ThreadPoolExecutor(
               1,
               1,
               0L,
               TimeUnit.MILLISECONDS,
               new ArrayBlockingQueue<>(16),
               (new ThreadFactoryBuilder()).setDaemon(true).setNameFormat("pull room msg thread ".concat(systemProperty.getNodeName())).build(),
               new ThreadPoolExecutor.CallerRunsPolicy());
       threadPoolExecutor.execute(runnable);

性能压测

压测脚本 locust

import json
import random
import time
import hashlib
 
from locust import User, task, events, TaskSet, wait_time
from websocket import create_connection
import requests
 
 
def success_call(name, recvText, total_time):
    events.request_success.fire(
        request_type="[Success]",
        name=name,
        response_time=total_time,
        response_length=len(recvText)
    )
 
 
def fail_call(name, total_time, e):
    events.request_failure.fire(
        request_type="[Fail]",
        name=name,
        response_time=total_time,
        response_length=0,
        exception=e,
    )
 
 
class WebSocketClient(object):
    def __init__(self, host):
        self.host = host
        self.ws = None
 
    def connect(self, burl):
        self.ws = create_connection(burl)
 
    def recv(self):
        return self.ws.recv()
 
    def send(self, msg):
        self.ws.send(msg)
 
 
def get_room_id(host):
    now = time.time()
    timestamp = str(int(now))
    token = requests.get(
        'http://' + host + '/xxxx/getToken',
        params={"param1": "param"},
        headers={"param1": "param",}).json()
    print("*-*/-*", token)
    return token["token"]
 
 
class User(User):
    abstract = True
    room_id = None
 
    # max_wait = 50000
    # min_wait = 10000
    def __init__(self, *args, **kwargs):
        super(User, self).__init__(*args, **kwargs)
        self.client = WebSocketClient(self.host)
        self.client._locust_environment = self.environment
        self.room_id = self.room_id
 
 
class ManagerUser(User):
    abstract = True
    room_id = None
 
    # max_wait = 50000
    # min_wait = 10000
    def __init__(self, *args, **kwargs):
        super(ManagerUser, self).__init__(*args, **kwargs)
        self.client = WebSocketClient(self.host)
        self.client._locust_environment = self.environment
        self.room_id = self.room_id
 
 
class ApiUser(User, ManagerUser):
    hostList = ['your host 1',
                'your host 2']
 
    @task(1)
    def send_message(self):
        global start_time
        host = self.hostList[random.randint(0, 1)]
        url = 'ws://' + host + '/websocket'
        try:
            # 创建房间
            print("创建房间")
            room_id = get_room_id(host)
            print("创建的房间 ID:" + str(room_id))
            # 创建小程序用户
            user_1 = User(room_id)
            # 用户发起呼叫
            num = random.randint(0, 100)
            userHandShakeMsg = {
                "param1": "param"  
            }
            user_1.client.connect(url)
            user_1.client.send(json.dumps(userHandShakeMsg))
            print(f"↑user_1: {json.dumps(userHandShakeMsg)}")
            # 睡眠 5s等待
            time.sleep(5)
            #进入房间
            consultant = ManagerUser(room_id)
            appHandShakeMsg = {
               "param1": "param"  
            }
            consultant.client.connect(url)
            consultant.client.send(json.dumps(appHandShakeMsg))
            print(f"↑consultant: {json.dumps(appHandShakeMsg)}")
            print(f"↓consultant: {consultant.client.recv()}")
            # 睡眠 5s等待
            time.sleep(5)
            while 1:
                # 循环发送十条消息
                for i in range(10):
                    # 从这里计算时间,覆盖上面的 start_time
                    start_time = time.time()
                    # 发送消息
                    feData = "%wfye!smj?so~&+lbtdfhnp@wsuhwb$u$aksozkkkdf*jm@fnso*$lrhv!rirce%amgy#&mn?#bh#+ca=&bo~cnzdm" \
                             "#vx?fd_=v!jlf*+vsn$~fip-kzoep-rtyfcim!%v&g@nidb-hrzwvtmajw$b==b-yxpvk$_qmen$qrobqx$bn&ak" \
                             "+mkhbxoo@m_b^z%mzrak$sxv?w?gnk&@gunszqs~itmxl!+vd-#gn#coasfjnaa%@fhn%^~b=lw-$l "
                    transmitMsg = {
                        "param1": "param"
                    }
                    print(f"↑consultant: {json.dumps(transmitMsg)}")
                    consultant.client.send(json.dumps(transmitMsg))
                    while 1:
                        message = user_1.client.recv()
                        print(f"↓user_1接收到消息: {message}")
                        if message != "" or message is not None:
                            if json.loads(message)["data"] == feData:
                                print("成功接收到信息")
                                # 上报成功数据
                                total_time = int((time.time() - start_time) * 1000)
                                success_call("Send", "success", total_time)
                                break
                # time.sleep(1)
 
        except Exception as e:
            print(f"出现异常:{e.__cause__}")
            total_time = int((time.time() - start_time) * 1000)
            fail_call("Send", total_time, e)
        else:
            total_time = int((time.time() - start_time) * 1000)
            success_call("Send", "success", total_time)

压测结果

在这里插入图片描述

扩容到 6 台 pod 后有少量请求失败 (失败率 0.06%),消息吞吐量至少可达 5000 条/s,长连接保持数量1w, 平均响应时间 3ms ,符合预期
连接数 pod数 QPS RT P99
100 2 900 1ms 19ms 100%
1000 2 900 1ms 20ms 99.5%
10000 6 5000 3ms 3ms 99.94%