某天线上的ImpalaJob日志时不时的报错:

ERROR:No backends configured

Couldnot execute command:xxx

并且不是一直报错,查看statstore的日志发现:

I022316:18:34.338698 10924 state-store.cc:194] Creating new topic:''impala-membership' on behalf of subscriber: 'xxxhostname:22000
I022316:18:34.338739 10924 state-store.cc:200] Registering: xxxhostname:22000
I022316:18:34.339864 10904 state-store.cc:355] Unable to update subscriber atxxxhostname:23000,  received errorCouldn't open transport for xxxhostname:23000(connect() failed: Connectionrefused)
I022316:18:34.840463 10904 state-store.cc:355] Unable to update subscriber atxxxhostname:23000,  received errorCouldn't open transport for xxxhostname:23000(connect() failed: Connectionrefused)
I022316:18:35.341156 10904 state-store.cc:355] Unable to update subscriber atxxxhostname:23000,  received errorCouldn't open transport for xxxhostname:23000(connect() failed: Connectionrefused)
I022316:18:36.843724 10904 state-store.cc:365] Subscriber: xxxhostname:22000 haseither failed or disconnected.
I022316:18:47.536650 10911 state-store.cc:355] Unable to update subscriber atxxxhostname:23000,  received errorCouldn't open transport for xxxhostname:23000(connect() failed: Connectionrefused)
I022316:18:54.343158 10924 state-store.cc:200] Registering: xxxhostname:22000
W022316:18:54.343209 10924 state-store.cc:215] Duplicate registration of subscriber:xxxhostname:22000, possible duplicate subscriber IDs or recovering subscriber

再看xxxxhostname果然一直也注册失败:

I022421:32:38.173282 33088 state-store-subscriber.cc:169] Trying to register...
I022421:32:38.173743 33088 state-store-subscriber.cc:172] Reconnected tostate-store. Exiting recovery mode
I022421:32:48.173959 33088 state-store-subscriber.cc:166] xxxhostname:22000:Connection with state-store lost, entering recovery mode
I022421:32:48.174018 33088 state-store-subscriber.cc:169] Trying to register...
W022421:32:48.174432 33088 state-store-subscriber.cc:181] Failed to re-register withstate-store: Duplicate registration of subscriber: xxxhostname:22000

估计是heartbeattimeout了,不过没找到文档有相关参数的详细解释,直接翻了下代码把相关的参数列下:最后修改statestore_subscriber_timeout_seconds=60s重启生效.

statestore_subscriber_timeout_seconds, 10, "The amount of time (in seconds) that may elapse before the connection with the statestore is considered lost.";
statestore_subscriber_cnxn_attempts, 10, "The number of times to retry an RPC connection to the statestore. A setting of 0 means retry indefinitely";
statestore_subscriber_cnxn_retry_interval_ms, 3000, "The interval, in ms, to wait between attempts to make an RPC connection to the statestore.";
statestore_max_missed_heartbeats, 5, "Maximum number of consecutive heartbeats an impalad can miss before being declared failed by the statestore.";
statestore_suspect_heartbeats, 2, "(Advanced) Number of consecutive heartbeats an impalad can miss before being suspected of failure by the statestore";
statestore_num_heartbeat_threads, 10, "(Advanced) Number of threads used to  send heartbeats in parallel to all registered subscribers.";
statestore_heartbeat_frequency_ms, 500, "(Advanced) Frequency (in ms) with which the statestore sends heartbeats to subscribers.";