问题描述 及 错误信息:
同样都是Redshift connection termination, 如下报错:
- terminating connection due to administrator command
- unexpected EOF on client connection
- Timeout 报错:SSL connection has been closed unexpectedly
有什么区别?触发条件都有哪些?从哪些视图中有所记录?
分析过程:
三个报错都是session会话意外结束:
- terminating connection due to administrator command session会话被管理员终止(被redshift管理员kill)。
- unexpected EOF on client connection session会话被意外终止(客户端没有正常调用退出)。
- SSL connection has been closed unexpectedly (Session会话因超时终止)
我们可以用psql 连接session 模拟一下这三种报错的触发:
模拟1:
登录一个session会话:
psql -Uawsuser -p5439 -htestred.ca27z26yo8om.cn-north-1.redshift.amazonaws.com.cn dev
查询当前session会话的pid:
dev=# select pg_backend_pid();
pg_backend_pid
----------------
1081047653
(1 row)
使用管理员账号,杀掉此会话pid(1081047653)
dev=# SELECT pg_terminate_backend(1081047653);
pg_terminate_backend
----------------------
1
(1 row)
即可触发:terminating connection due to administrator command
select convert_timezone('Asia/Shanghai', recordtime),userid,pid,error from stl_error order by recordtime desc limit 1;
2022-06-06 15:12:14.214337 | 100 | 1081047653 | terminating connection due to administrator command
而pg_terminate_backend 命令,记录在SVL_TERMINATE 视图中,不记录在一般查询视图中(stl_query)。
示例:
dev=# select * from SVL_TERMINATE order by eventtime desc;
pid | eventtime | userid | type
------------+----------------------------+--------+-----------
1081047653 | 2022-06-06 07:12:14.213929 | 100 | TERMINATE
1076708170 | 2022-06-06 06:43:02.624787 | 100 | TERMINATE
(2 rows)
上面的视图的意思是: userid=100的这个用户,在2022-06-06 07:12:14.213929这个时间点(UTC时间),把pid=1081047653的会话kill了。
模拟2:
同样正常登录一个session会话: psql -Uawsuser -p5439 -htestred.ca27z26yo8om.cn-north-1.redshift.amazonaws.com.cn dev psql (13.3, server 8.0.2)
从客户端操作系统层杀掉此会话:
[root@ip-172-31-30-200 ~]# ps -ef|grep psql
root 21021 21014 0 15:14 pts/3 00:00:00 psql -h jiangred.abcdefg.cn-north-1.redshift.amazonaws.com.cn -p 5439 -U root -d dev
[root@ip-172-31-30-200 ~]# kill -9 21021
即可触发:unexpected EOF on client connection
select convert_timezone('Asia/Shanghai', recordtime),userid,pid,error from stl_error order by recordtime desc limit 1;
2022-06-06 15:15:11.873907 | 100 | 1080859777 | unexpected EOF on client connection
unexpected EOF on client connection,可以通过stl_error 找到报警:
select convert_timezone('Asia/Shanghai', recordtime),userid,pid,error from stl_error order by recordtime desc limit 1;
2022-06-06 17:23:03.223771 | 104 | 1081262480 | unexpected EOF on client connection
模拟3:
对root4这个用户设置60秒超时: dev=# alter user root4 session timeout 60; ALTER USER
登录root4并等待60秒,看到超时:
[root@ip-172-31-30-200 ~]# psql -h shiranred.abcdefg.cn-north-1.redshift.amazonaws.com.cn -p 5439 -U root4 -d dev
Password for user root4:
dev=> select pg_backend_pid();
SSL connection has been closed unexpectedly
timeout并没有记录在SVL_TERMINATE视图中,通过视图STL_CONNECTION_LOG 可以找到disconnecting session记录
dev=# select trim(event),recordtime,pid,trim(username),duration,trim(application_name) from STL_CONNECTION_LOG
dev-# where
dev-# pid in (1082705442,1082762757,1082721769,1082877350,1082770508,1082828180,1082656094,1082697015,1077534105)
dev-# order by pid,recordtime desc;
btrim | recordtime | pid | username | duration | application_name
-----------------------+----------------------------+------------+----------+-------------+-------
disconnecting session | 2022-06-08 19:30:34.587893 | 1077534105 | user0601 | 21600100274 | psql
set application_name | 2022-06-08 13:30:34.488644 | 1077534105 | user0601 | 1026 | psql
initiating session | 2022-06-08 13:30:34.488532 | 1077534105 | user0601 | 0 |
authenticated | 2022-06-08 13:30:34.488467 | 1077534105 | user0601 | 0 |
...........
总结:
如果我们在使用AWS Redshift时候遇到自己的会话被杀或者中断的情况,需要细致排查或者解释,可以参考如上几项细节解释和测试步骤。