项目场景:
给一个表添加协处理器 导致生产hbase集群挂掉 重启之后出现 rit问题
刺激!
问题描述
添加协处理器
给表添加AggregateImplementation 协处理器,导致整个Regionserver宕机。
2022-07-11 17:10:35,890 ERROR [RS_OPEN_REGION-regionserver/node119:16020-0] coprocessor.CoprocessorHost: The coprocessor org.apache.Hadoop.hbase.coprocessor.AggregateImplementation threw java.io.IOException: No jar path specified for org.apa
che.Hadoop.hbase.coprocessor.AggregateImplementation
2022-07-11 17:10:35,890 ERROR [RS_OPEN_REGION-regionserver/node119:16020-2] coprocessor.CoprocessorHost: The coprocessor org.apache.Hadoop.hbase.coprocessor.AggregateImplementation threw java.io.IOException: No jar path specified for org.apa
che.Hadoop.hbase.coprocessor.AggregateImplementation
2022-07-11 17:10:35,937 ERROR [RS_OPEN_REGION-regionserver/node119:16020-0] regionserver.HRegionServer: ***** ABORTING region server node119,16020,1655708005885: The coprocessor org.apache.Hadoop.hbase.coprocessor.AggregateImplementation thr
ew java.io.IOException: No jar path specified for org.apache.Hadoop.hbase.coprocessor.AggregateImplementation *****
2022-07-11 17:10:35,937 ERROR [RS_OPEN_REGION-regionserver/node119:16020-2] regionserver.HRegionServer: ***** ABORTING region server node119,16020,1655708005885: The coprocessor org.apache.Hadoop.hbase.coprocessor.AggregateImplementation thr
ew java.io.IOException: No jar path specified for org.apache.Hadoop.hbase.coprocessor.AggregateImplementation *****
2022-07-11 17:10:35,937 ERROR [RS_OPEN_REGION-regionserver/node119:16020-0] regionserver.HRegionServer: RegionServer abort: loaded coprocessors are: [org.apache.phoenix.coprocessor.SequenceRegionObserver, org.apache.phoenix.coprocessor.ScanR
egionObserver, org.apache.hadoop.hbase.coprocessor.AggregateImplementation, org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver, org.apache.phoenix.hbase.index.Indexer, org.apache.phoenix.coprocessor.PhoenixTTLRegionObserver, org
.apache.phoenix.coprocessor.GroupedAggregateRegionObserver, org.apache.phoenix.coprocessor.ChildLinkMetaDataEndpoint, org.apache.phoenix.coprocessor.ServerCachingEndpointImpl, org.apache.phoenix.hbase.index.IndexRegionObserver]
2022-07-11 17:10:35,937 ERROR [RS_OPEN_REGION-regionserver/node119:16020-2] regionserver.HRegionServer: RegionServer abort: loaded coprocessors are: [org.apache.phoenix.coprocessor.SequenceRegionObserver, org.apache.phoenix.coprocessor.ScanR
egionObserver, org.apache.hadoop.hbase.coprocessor.AggregateImplementation, org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver, org.apache.phoenix.hbase.index.Indexer, org.apache.phoenix.coprocessor.PhoenixTTLRegionObserver, org
.apache.phoenix.coprocessor.GroupedAggregateRegionObserver, org.apache.phoenix.coprocessor.ChildLinkMetaDataEndpoint, org.apache.phoenix.coprocessor.ServerCachingEndpointImpl, org.apache.phoenix.hbase.index.IndexRegionObserver]
2022-07-11 17:10:36,105 INFO [RS_OPEN_REGION-regionserver/node119:16020-2] regionserver.HRegionServer: s": 0,
2022-07-11 17:10:36,111 INFO [RS_OPEN_REGION-regionserver/node119:16020-0] regionserver.HRegionServer: s": 0,
2022-07-11 17:10:36,346 INFO [RS_OPEN_REGION-regionserver/node119:16020-0] regionserver.HRegionServer: ***** STOPPING region server 'node119,16020,1655708005885' *****
2022-07-11 17:10:36,346 INFO [RS_OPEN_REGION-regionserver/node119:16020-2] regionserver.HRegionServer: ***** STOPPING region server 'node119,16020,1655708005885' *****
hbase日志
重启之后添加协处理器的表无法删除、无法禁用。
2022-07-11 19:30:55,572 INFO [PEWorker-4] procedure.DisableTableProcedure: Not ENABLED, state=ENABLING, skipping disable; pid=128898, state=RUNNABLE:DISABLE_TABLE_PREPARE, locked=true; DisableTableProcedure table=TEST:SAMPLE_TASK_SCAN_NEWES
T
2022-07-11 19:30:55,589 INFO [PEWorker-4] procedure2.ProcedureExecutor: Rolled back pid=128898, state=ROLLEDBACK, exception=org.apache.hadoop.hbase.TableNotEnabledException via master-disable-table:org.apache.hadoop.hbase.TableNotEnabledExc
eption: tableName=TEST:SAMPLE_TASK_SCAN_NEWEST, state=ENABLING; DisableTableProcedure table=TEST:SAMPLE_TASK_SCAN_NEWEST exec-time=1 hrs, 58 mins, 13.627 sec
2022-07-11 19:30:55,594 INFO [PEWorker-4] procedure.DisableTableProcedure: Not ENABLED, state=ENABLING, skipping disable; pid=139703, state=RUNNABLE:DISABLE_TABLE_PREPARE, locked=true; DisableTableProcedure table=TEST:SAMPLE_TASK_SCAN_NEWES
T
2022-07-11 19:30:55,598 INFO [PEWorker-4] procedure2.ProcedureExecutor: Rolled back pid=139703, state=ROLLEDBACK, exception=org.apache.hadoop.hbase.TableNotEnabledException via master-disable-table:org.apache.hadoop.hbase.TableNotEnabledExc
eption: tableName=TEST:SAMPLE_TASK_SCAN_NEWEST, state=ENABLING; DisableTableProcedure table=TEST:SAMPLE_TASK_SCAN_NEWEST exec-time=1 hrs, 37 mins, 54.003 sec
2022-07-11 19:30:55,602 INFO [PEWorker-4] procedure.DisableTableProcedure: Not ENABLED, state=ENABLING, skipping disable; pid=139706, state=RUNNABLE:DISABLE_TABLE_PREPARE, locked=true; DisableTableProcedure table=TEST:SAMPLE_TASK_SCAN_NEWES
T
2022-07-11 19:30:55,605 INFO [PEWorker-4] procedure2.ProcedureExecutor: Rolled back pid=139706, state=ROLLEDBACK, exception=org.apache.hadoop.hbase.TableNotEnabledException via master-disable-table:org.apache.hadoop.hbase.TableNotEnabledExc
eption: tableName=TEST:SAMPLE_TASK_SCAN_NEWEST, state=ENABLING; DisableTableProcedure table=TEST:SAMPLE_TASK_SCAN_NEWEST exec-time=1 hrs, 30 mins, 9.834 sec
2022-07-11 19:30:55,613 INFO [PEWorker-4] procedure2.ProcedureExecutor: Rolled back pid=139712, state=ROLLEDBACK, exception=org.apache.hadoop.hbase.TableNotDisabledException via master-delete-table:org.apache.hadoop.hbase.TableNotDisabledEx
ception: Not DISABLED; tableName=TEST:SAMPLE_TASK_SCAN_NEWEST, state=ENABLING; DeleteTableProcedure table=TEST:SAMPLE_TASK_SCAN_NEWEST exec-time=1 hrs, 28 mins, 20.029 sec
2022-07-11 19:30:55,620 INFO [PEWorker-4] procedure2.ProcedureExecutor: Rolled back pid=139723, state=ROLLEDBACK, exception=org.apache.hadoop.hbase.TableNotDisabledException via master-delete-table:org.apache.hadoop.hbase.TableNotDisabledEx
ception: Not DISABLED; tableName=TEST:SAMPLE_TASK_SCAN_NEWEST, state=ENABLING; DeleteTableProcedure table=TEST:SAMPLE_TASK_SCAN_NEWEST exec-time=1 hrs, 10 mins, 41.716 sec
2022-07-11 19:30:55,623 INFO [PEWorker-4] procedure.DisableTableProcedure: Not ENABLED, state=ENABLING, skipping disable; pid=139732, state=RUNNABLE:DISABLE_TABLE_PREPARE, locked=true; DisableTableProcedure table=TEST:SAMPLE_TASK_SCAN_NEWES
T
2022-07-11 19:30:55,627 INFO [PEWorker-4] procedure2.ProcedureExecutor: Rolled back pid=139732, state=ROLLEDBACK, exception=org.apache.hadoop.hbase.TableNotEnabledException via master-disable-table:org.apache.hadoop.hbase.TableNotEnabledExc
eption: tableName=TEST:SAMPLE_TASK_SCAN_NEWEST, state=ENABLING; DisableTableProcedure table=TEST:SAMPLE_TASK_SCAN_NEWEST exec-time=1 hrs, 8 mins, 55.011 sec
2022-07-11 19:30:55,631 INFO [PEWorker-4] procedure.DisableTableProcedure: Not ENABLED, state=ENABLING, skipping disable; pid=139776, state=RUNNABLE:DISABLE_TABLE_PREPARE, locked=true; DisableTableProcedure table=TEST:SAMPLE_TASK_SCAN_NEWES
T
原因分析:
因为协处理器一直找不到所以RegionServer重启就宕机无法启动,得先解决这个问题,再解决表的问题。
看样子是TEST:SAMPLE_TASK_SCAN_NEWEST这个表的问题,重启问题解决后,我们打算把这个表给删除了。
因为协处理器无法重启
删除时却报错:
org.apache.hadoop.hbase.TableNotDisabledException: org.apache.hadoop.hbase.TableNotDisabledException: test
at org.apache.hadoop.hbase.master.HMaster.checkTableModifiable(HMaster.java:1740)
at org.apache.hadoop.hbase.master.handler.TableEventHandler.prepare(TableEventHandler.java:86)
at org.apache.hadoop.hbase.master.HMaster.deleteTable(HMaster.java:1576)
at org.apache.hadoop.hbase.master.MasterRpcServices.deleteTable(MasterRpcServices.java:463)
at org.apache.hadoop.hbase.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java:44229)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2035)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107)
at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
at java.lang.Thread.run(Thread.java:745)
而且TEST:SAMPLE_TASK_SCAN_NEWEST, state=ENABLING;这个表一直显示ENABLING
,我们也无法对他进行 disable操作
解决方案:
根据Hbase中的CoprocessorHost源码展示
// If we got here, e is not an IOException. A loaded coprocessor has a
// fatal bug, and the server (master or regionserver) should remove the
// faulty coprocessor from its set of active coprocessors. Setting
// 'hbase.coprocessor.abortonerror' to true will cause abortServer(),
// which may be useful in development and testing environments where
// 'failing fast' for error analysis is desired.
if (env.getConfiguration().getBoolean(ABORT_ON_ERROR_KEY, DEFAULT_ABORT_ON_ERROR)) {
// server is configured to abort.
abortServer(env, e);
} else {
// If available, pull a table name out of the environment
if(env instanceof RegionCoprocessorEnvironment) {
String tableName = ((RegionCoprocessorEnvironment)env).getRegionInfo().getTable().getNameAsString();
LOG.error("Removing coprocessor '" + env.toString() + "' from table '"+ tableName + "'", e);
} else {
LOG.error("Removing coprocessor '" + env.toString() + "' from " +
"environment",e);
}
添加此配置,可跳过协处理器无法加载而启动失败的问题。
<property>
<name>hbase.coprocessor.abortonerror</name>
<value>false</value>
</property>
使用Hbck2 解决重启后带来的一系列RIT问题
hbck参数
setTableState <TABLENAME> <STATE>
Possible table states: ENABLED, DISABLED, DISABLING, ENABLING
To read current table state, in the hbase shell run:
hbase> get 'hbase:meta', '<TABLENAME>', 'table:state'
A value of \x08\x00 == ENABLED, \x08\x01 == DISABLED, etc.
Can also run a 'describe "<TABLENAME>"' at the shell prompt.
An example making table name 'user' ENABLED:
$ HBCK2 setTableState users ENABLED
Returns whatever the previous table state was.
执行命令 【一次可能不行】
/bin/hbase --config /etc/hbase-conf
hbck -j ./hbase-operator-tools-1.2.0/hbase-hbck2/target/hbase-hbck2-1.2.0.jar
setTableState TEST:SAMPLE_TASK_SCAN_NEWEST DISABLED