Hbase 错误记录及修改方法

     1. Hbase 在运行或者操作过程中经常发生各种各样的问题,大部分问题是可以通过修改配置文件来解决的,当然可以修改源代码。
当hbase的并发量上来的时候,经常会导致Hbase出现“ Too Many Open Files”(打开的文件过多)的问题,日志记录如下:
2012-06-01 16:05:22,776 INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.net.SocketException: 打开的文件过多
2012-06-01 16:05:22,776 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_3790131629645188816_18192

2012-06-01 16:13:01,966 WARN org.apache.hadoop.hdfs.DFSClient: DFS Read: java.io.IOException: Could not obtain block: blk_-299035636445663861_7843 file=/hbase/SendReport/83908b7af3d5e3529e61b870a16f02dc/data/17703aa901934b39bd3b2e2d18c671b4.9a84770c805c78d2ff19ceff6fecb972
     at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1812)
     at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1638)
     at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1767)
     at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1695)
     at java.io.DataInputStream.readBoolean(DataInputStream.java:242)
     at org.apache.hadoop.hbase.io.Reference.readFields(Reference.java:116)
     at org.apache.hadoop.hbase.io.Reference.read(Reference.java:149)
     at org.apache.hadoop.hbase.regionserver.StoreFile.<init>(StoreFile.java:216)
     at org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:282)
     at org.apache.hadoop.hbase.regionserver.Store.<init>(Store.java:221)
     at org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:2510)
     at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:449)
     at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3228)
     at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3176)
     at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:331)
     at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:107)
     at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:169)
     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
     at java.lang.Thread.run(Thread.java:722)

原因及修改方法:由于 Linux系统最大可打开文件数一般默认的参数值是1024,通过 ulimit -n 65535 可即时修改,但重启后就无效了。或者有如下修改方式:
有如下三种修改方式:

1.在/etc/rc.local 中增加一行 ulimit -SHn 65535
2.在/etc/profile 中增加一行 ulimit -SHn 65535
3.在/etc/security/limits.conf最后增加如下两行记录
* soft nofile 65535
* hard nofile 65535
          
     2. 发现HDFS写入过程中有两个超时设置: dfs.socket.timeout和 dfs.datanode.socket.write.timeout;有些地方以为只是需要修改后面 的dfs.datanode.socket.write.timeout项就可以,其实看报错是READ_TIMEOUT。对应在hbase中的默认值如下: 

  // Timeouts for communicating with DataNode for streaming writes/reads

  public static int READ_TIMEOUT = 60 * 1000;   //其实是超过了这个值

  public static int READ_TIMEOUT_EXTENSION = 3 * 1000;

  public static int WRITE_TIMEOUT = 8 * 60 * 1000;

  public static int WRITE_TIMEOUT_EXTENSION = 5 * 1000; //for write pipeline

日志:
  11/10/12 10:50:44 WARN hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for block blk_8540857362443890085_4343699470java.net.SocketTimeoutException: 66000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/172.*.*.*:14707 remote=/*.*.*.24:80010] 
原因及修改方法:

所以找出来是超时导致的,所以在hadoop-site.xml配置文件中添加如下配置:

   <property>
     <name>dfs.datanode.socket.write.timeout</name>
     <value> 3000000</value>
   </property>
 
   <property>
     <name>dfs.socket.timeout</name>
     <value> 3000000</value>
   </property>
 </configuration>

Workaround 1: Start from scratch

I can testify that the following steps solve this error, but the side effects won't make you happy (me neither). The crude workaround I have found is to:

1.     stop the cluster

2.     delete the data directory on the problematic datanode: the directory is specified by dfs.data.dir in conf/hdfs-site.xml; if you followed this tutorial, the relevant directory is /usr/local/hadoop-datastore/hadoop-hadoop/dfs/data

3.     reformat the namenode (NOTE: all HDFS data is lost during this process!)

4.     restart the cluster

When deleting all the HDFS data and starting from scratch does not sound like a good idea (it might be ok during the initial setup/testing), you might give the second approach a try.

Workaround 2: Updating namespaceID of problematic datanodes

Big thanks to Jared Stehler for the following suggestion. I have not tested it myself yet, but feel free to try it out and send me your feedback. This workaround is "minimally invasive" as you only have to edit one file on the problematic datanodes:

1.     stop the datanode

2.     edit the value of namespaceID in <dfs.data.dir>/current/VERSION to match the value of the current namenode

3.     restart the datanode

If you followed the instructions in my tutorials, the full path of the relevant file is /usr/local/hadoop-datastore/hadoop-hadoop/dfs/data/current/VERSION (background: dfs.data.dir is by default set to ${hadoop.tmp.dir}/dfs/data, and we set hadoop.tmp.dir to /usr/local/hadoop-datastore/hadoop-hadoop).

If you wonder how the contents of VERSION look like, here's one of mine:

#contents of <dfs.data.dir>/current/VERSION

namespaceID=393514426

storageID=DS-1706792599-10.10.10.1-50010-1204306713481

cTime=1215607609074

storageType=DATA_NODE

layoutVersion=-13



  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值