Resource Overload Analysis on Hadoop Job Failure

Author: Alison Shu


We seesome repeated Hadoop jobs go well in most situation but fail sporadically.  Excluding obvious Hadoop platform/servicefailure or bug, we find half of cases are due to resource overload.

 

eBayanalytics platform has 3 shared Hadoop clusters with 6000+ nodes as well as 6shared Hadoop clients for 400+ batch users and 2900+ individual users.  So resource competition is common.  Hadoop client server often see storage, openprocesses, connection limit.  HDFS hasname space quota on user/directory level. Yarn capacity scheduler constrain the usage of memory via queue and userfactor.  HDFS data node has limitspace/memory, which may trap the running task on it.  I’ll analyze the resource overload issue fromthe four respects.

Resource overload onHadoop client

A simpleHadoop shell command takes long to complete. Usually it happens when server average load is high.  The resolution is to find the top consumerand kill it or turn to a free client server.

 

$time hadoop fs -ls
real 10m30.461s 
user 0m3.514s 
sys 0m5.527s

$ uptime
21:08:35 up 43 days, 6:05, 269 users, load average: 219.03, 213.97, 214.75

Thebatch account has reached max process limit. It’s a Linux server setup, but will block Hadoop job submission.  The resolution is either increasing the limitor killing some open processes.

<p><span style="color:#333333;background:whitesmoke;">$ulimt–u</span></p><p><span style="color:#333333;background:whitesmoke;">1024</span></p>

 

Disk space quotacontributes to the failure of Hadoop job. It will block Hadoop job submission and block user sudo specific batchaccount by reporting errors like the followings.

 

Sudosh [sudosh.c, line 455]: Disk quota exceeded

java.io.IOException: Mkdirs failed to create /tmp/hadoop/xxxx/hadoop-unjar4861499463567804873
	at org.apache.hadoop.util.RunJar.ensureDirectory(RunJar.java:111)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:177)

write failed, user block limit reached.
Exception in thread "main" java.io.IOException: Disk quota exceeded
        at java.io.FileOutputStream.writeBytes(Native Method)
        at java.io.FileOutputStream.write(FileOutputStream.java:282)

Disk quota set atuser level, and sometimes user can’t find the big files in the first responseif they don’t manage their files well. Then they have to ask admin to scan the whole disk to figure out the bigfiles.

 

Command example tocheck disk quota:

$quota -uvs b_xxxx 
Disk quotas for user b_xxxx(uid 741): 
     Filesystem blocks quota limit grace files quota limit grace 
/dev/cciss/c0d0p1 
                   401M 977M 977M 38306 0 0 
xxxx.lvs.ebay.com:/vol/ares/home 
                   684M 5120M 5120M 2096 4295m 4295m 

 

HDFS Name Space Quota

 

Hadoop makes largestorage cheap, but with big data and high user adoption, eBay Hadoop clustersare facing the capacity issue especially the storage is nearly fulloccasionally.  The resolution is to enforceHDFS name space quota. 

 

Error due to out ofHDFS name space quota:

 

DSQuotaExceededException: The DiskSpace quota of /user/xxxx is exceeded: quota=3298534883328 diskspace consumed=3072.3g
OutputChannel - can not move hdfs://xxx.vip.ebay.com:8020/tmp/xxxxx/tmp-out-step_1_of_1/ch30-23 to /user/xxxx/xxx

The NameSpace quota (directories and files) of directory /user/xxx is exceeded: quota=60000000 file count=60000001 at org.apache.hadoop.hdfs.server.namenode.DirectoryWithQuotaFeature.verifyNamespaceQuota(DirectoryWithQuotaFeature.java:138) at

Command example to show HDFS name space quota status:

 

$ hadoop fs -count -q /user/xxxx 
    70000000 9289978 5629499534213120 778950194745572 8426351 52283671 1608683619286097 /user/xxxx

HDFS space quota iseasy to understand, but name quota is often ignored.  Volumes of small files usually result in outof name quota.  The negative impact ofsmall files is listed below:

·        If a smallfile is smaller than HDFS block size (128MB), it’ll still occupy the whole block.

·        Every file,directory and block in HDFS is represented as an object in Hadoop name nodememory and occupy about 200-300 bytes. So volume of small files will waste name node memory, even beyond thewhole system capacity.

·        HDFS is notgeared up to efficiently accessing small files. Reading through small files will cause lots of seeks and hopping fromdata node to data node, which is inefficient.

 

Yarn Capacity Scheduler

Yarn has two types of capacity scheduler:fair scheduler and capacity scheduler.  FairScheduler allocates equal share of cluster resources to jobs; capacityscheduler uses hierarchical queues configured to assign resources.  Capacity scheduler is used by large clusterswith multiple tenants, and eBay use it.

 

Queue capacity usage is easy to catch,despite various cluster monitoring tool, built-in resource manager web URLprovide a vivid view.

In the screenshot of resource manager webURL below, it shows 3 status of queue capacity:

·        Green – used less than absolutecapacity of queue

·        Orange – used over absolutecapacity but less then absolute max capacity of queue

·        Red – reach absolute max capacityof queue



Expand the queue name, see the configured capacityparameters:


The queue absolute capacity and absolute maxis easy to understand and check.  Howeveruser may complain why they can’t use the queue absolute max capacity whilecluster has free resource.  User levelcapacity is controlled by configured user limit factor.  E.g. the value is 2 in the screenshot above,i.e.  one user could hit 2 x queue capacity(0.2%) totaling 0.4% instead ofabsolute max capacity(9.8%).  The setupconsiders queue is shared by multiple users and need to make sure one user nottake all the queue resource. When jobs stuck due to user capacity outage, we usuallysee dozens of running jobs under one account in one queue.  So it’s reasonable to split the jobs intomultiple accounts, or multiple queue or different window based on job attributesand priority.  For SLA (service levelagreement) jobs asking for immediate resolution, increasing user limit factortemporarily in resource manager to allow the user (account) take absolute maxcapacity of the queue is also feasible. In eBay, given the complex demand and resource competition, we try tochange some capacity configurations through time window.

HDFS Data Node Resource Outage

Job task may be stuck in specific task.  Checking the data node running the task, inaddition to bad node issue, we also find resource outage in some situations. 

One is no spaceleft.  Usually we could see some jobswrite huge stderr into log. The sudden scaling of scratch file will trap othertask already launched in the same data node. The resolution to save stuck job is to fail the task and a new taskattempt will launch on a resource free data node.  Meantime, figure out the job generating hugelog, kill it and purge the log.

Error due to datanode out of space:

Can't create directory application_1454552186040_202442 in /hadoop/5/scratch/local/usercache/xxxx/application_1454552186040_202442 - No space left on device
java.io.FileNotFoundException: File file:/hadoop/5/scratch/local/usercache/xxxx/application_1454552186040_202442does not exist

The other is low availablememory.  If we see job task stuck in suchnode, the resolution is to fail the task and AM will automatically launch a newone in another data node. Hadoop resource manager web URL show the data nodememory usage.


Conclusion

Resource overload is one frequent rootcause for Hadoop job failure, especially in the shared large cluster.  It responses to the critical role ofprofessional Hadoop capacity management. 



  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
### 回答1: "failed to execute 'createobjecturl' on 'url': overload resolution failed" 的意思是在执行 'createobjecturl' 函数时出现了重载解析失败的错误。这通常是由于传递给函数的参数不正确或不兼容所导致的。需要检查参数是否正确,并确保它们符合函数的要求。 ### 回答2: "failed to execute 'createobjecturl' on 'url': overload resolution failed" 是一个 JavaScript 错误。在使用 createObjectURL() 方法时,如果参数类型不正确或参数数量不符合要求,就会出现这个错误。 createObjectURL() 是一个用于创建 URL 对象的方法,它接受一个参数。这个参数可以是 Blob、File 或 MediaSource 对象,用于创建一个 URL 对象,以便在页面上播放视频或显示图片等内容。 当这个方法的参数不符合要求时,就会出现 "overload resolution failed" 的错误。这个错误表示使用的 createObjectURL() 方法的版本不支持调用当前提供的参数或参数组合。 可能出现这个错误的原因也很简单。可能错误的情况有: 1. 参数类型不正确 如果传入的参数不是 Blob、File 或 MediaSource 对象,就会导致这个方法无法执行,从而报错。因此,应该先检查传入的参数类型是否正确。 2. 参数数量不符合要求 如果传入的参数数量不符合要求,比如 createObjectURL() 方法只能接受一个参数,如果传入了两个参数,就会导致这个方法无法执行,从而报错。 综上所述,当出现 "failed to execute 'createobjecturl' on 'url': overload resolution failed" 错误的时候,应该先检查传入的参数类型是否正确,再检查传入的参数数量是否符合要求。如果这些方面都没有问题,可以考虑使用另外一个方法来达到相同的目的,比如 URL.createObjectURL() 方法。 ### 回答3: 当我们在浏览器开发过程中使用了 “createobjecturl” 方法时,有时会出现 “failed to execute 'createobjecturl' on 'url': overload resolution failed" 的错误提示。 这个错误提示实际上是因为我们使用 “createobjecturl” 方法时,传递的参数类型不匹配导致的。 在JavaScript 中,“createobjecturl” 方法主要用于创建表示指定 Blob 对象或者 File 对象的 URL。 通常情况下,我们的代码实现是如下所示: ```javascript var file = document.getElementById ('file').files [0]; var url = window.URL.createObjectURL (file); ``` 在这个代码示例中,我们尝试使用 “createobjecturl” 方法,创建一个表示指定 File 对象的 URL。 但是,由于某些原因(如浏览器版本更新、JavaScript 引擎优化等),在执行代码时,我们传递的参数类型不符合该方法的要求,导致了方法解析失败的错误。 解决这个问题的方法非常简单,我们只需要确认传递给 “createobjecturl” 方法的参数是否合适即可。 我们可以查看浏览器开发文档或者使用控制台进行调试,找到问题的根本原因。 在有些情况下,我们可以使用其他方法来代替 “createobjecturl” 方法,比如使用 FileReader 对象读取文件数据等等。 总之,只要我们认真分析错误提示信息,找到问题原因,再针对性地进行优化和调整,便能够解决这个问题。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值