Author: Alison Shu
We seesome repeated Hadoop jobs go well in most situation but fail sporadically. Excluding obvious Hadoop platform/servicefailure or bug, we find half of cases are due to resource overload.
eBayanalytics platform has 3 shared Hadoop clusters with 6000+ nodes as well as 6shared Hadoop clients for 400+ batch users and 2900+ individual users. So resource competition is common. Hadoop client server often see storage, openprocesses, connection limit. HDFS hasname space quota on user/directory level. Yarn capacity scheduler constrain the usage of memory via queue and userfactor. HDFS data node has limitspace/memory, which may trap the running task on it. I’ll analyze the resource overload issue fromthe four respects.
Resource overload onHadoop client
A simpleHadoop shell command takes long to complete. Usually it happens when server average load is high. The resolution is to find the top consumerand kill it or turn to a free client server.
$time hadoop fs -ls
real 10m30.461s
user 0m3.514s
sys 0m5.527s
$ uptime
21:08:35 up 43 days, 6:05, 269 users, load average: 219.03, 213.97, 214.75
Thebatch account has reached max process limit. It’s a Linux server setup, but will block Hadoop job submission. The resolution is either increasing the limitor killing some open processes.
<p><span style="color:#333333;background:whitesmoke;">$ulimt–u</span></p><p><span style="color:#333333;background:whitesmoke;">1024</span></p>
Disk space quotacontributes to the failure of Hadoop job. It will block Hadoop job submission and block user sudo specific batchaccount by reporting errors like the followings.
Sudosh [sudosh.c, line 455]: Disk quota exceeded
java.io.IOException: Mkdirs failed to create /tmp/hadoop/xxxx/hadoop-unjar4861499463567804873
at org.apache.hadoop.util.RunJar.ensureDirectory(RunJar.java:111)
at org.apache.hadoop.util.RunJar.main(RunJar.java:177)
write failed, user block limit reached.
Exception in thread "main" java.io.IOException: Disk quota exceeded
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:282)
Disk quota set atuser level, and sometimes user can’t find the big files in the first responseif they don’t manage their files well. Then they have to ask admin to scan the whole disk to figure out the bigfiles.
Command example tocheck disk quota:
$quota -uvs b_xxxx
Disk quotas for user b_xxxx(uid 741):
Filesystem blocks quota limit grace files quota limit grace
/dev/cciss/c0d0p1
401M 977M 977M 38306 0 0
xxxx.lvs.ebay.com:/vol/ares/home
684M 5120M 5120M 2096 4295m 4295m
HDFS Name Space Quota
Hadoop makes largestorage cheap, but with big data and high user adoption, eBay Hadoop clustersare facing the capacity issue especially the storage is nearly fulloccasionally. The resolution is to enforceHDFS name space quota.
Error due to out ofHDFS name space quota:
DSQuotaExceededException: The DiskSpace quota of /user/xxxx is exceeded: quota=3298534883328 diskspace consumed=3072.3g
OutputChannel - can not move hdfs://xxx.vip.ebay.com:8020/tmp/xxxxx/tmp-out-step_1_of_1/ch30-23 to /user/xxxx/xxx
The NameSpace quota (directories and files) of directory /user/xxx is exceeded: quota=60000000 file count=60000001 at org.apache.hadoop.hdfs.server.namenode.DirectoryWithQuotaFeature.verifyNamespaceQuota(DirectoryWithQuotaFeature.java:138) at
Command example to show HDFS name space quota status:
$ hadoop fs -count -q /user/xxxx
70000000 9289978 5629499534213120 778950194745572 8426351 52283671 1608683619286097 /user/xxxx
HDFS space quota iseasy to understand, but name quota is often ignored. Volumes of small files usually result in outof name quota. The negative impact ofsmall files is listed below:
· If a smallfile is smaller than HDFS block size (128MB), it’ll still occupy the whole block.
· Every file,directory and block in HDFS is represented as an object in Hadoop name nodememory and occupy about 200-300 bytes. So volume of small files will waste name node memory, even beyond thewhole system capacity.
· HDFS is notgeared up to efficiently accessing small files. Reading through small files will cause lots of seeks and hopping fromdata node to data node, which is inefficient.
Yarn Capacity Scheduler
Yarn has two types of capacity scheduler:fair scheduler and capacity scheduler. FairScheduler allocates equal share of cluster resources to jobs; capacityscheduler uses hierarchical queues configured to assign resources. Capacity scheduler is used by large clusterswith multiple tenants, and eBay use it.
Queue capacity usage is easy to catch,despite various cluster monitoring tool, built-in resource manager web URLprovide a vivid view.
In the screenshot of resource manager webURL below, it shows 3 status of queue capacity:
· Green – used less than absolutecapacity of queue
· Orange – used over absolutecapacity but less then absolute max capacity of queue
· Red – reach absolute max capacityof queue
Expand the queue name, see the configured capacityparameters:
The queue absolute capacity and absolute maxis easy to understand and check. Howeveruser may complain why they can’t use the queue absolute max capacity whilecluster has free resource. User levelcapacity is controlled by configured user limit factor. E.g. the value is 2 in the screenshot above,i.e. one user could hit 2 x queue capacity(0.2%) totaling 0.4% instead ofabsolute max capacity(9.8%). The setupconsiders queue is shared by multiple users and need to make sure one user nottake all the queue resource. When jobs stuck due to user capacity outage, we usuallysee dozens of running jobs under one account in one queue. So it’s reasonable to split the jobs intomultiple accounts, or multiple queue or different window based on job attributesand priority. For SLA (service levelagreement) jobs asking for immediate resolution, increasing user limit factortemporarily in resource manager to allow the user (account) take absolute maxcapacity of the queue is also feasible. In eBay, given the complex demand and resource competition, we try tochange some capacity configurations through time window.
HDFS Data Node Resource Outage
Job task may be stuck in specific task. Checking the data node running the task, inaddition to bad node issue, we also find resource outage in some situations.
One is no spaceleft. Usually we could see some jobswrite huge stderr into log. The sudden scaling of scratch file will trap othertask already launched in the same data node. The resolution to save stuck job is to fail the task and a new taskattempt will launch on a resource free data node. Meantime, figure out the job generating hugelog, kill it and purge the log.
Error due to datanode out of space:
Can't create directory application_1454552186040_202442 in /hadoop/5/scratch/local/usercache/xxxx/application_1454552186040_202442 - No space left on device
java.io.FileNotFoundException: File file:/hadoop/5/scratch/local/usercache/xxxx/application_1454552186040_202442does not exist
The other is low availablememory. If we see job task stuck in suchnode, the resolution is to fail the task and AM will automatically launch a newone in another data node. Hadoop resource manager web URL show the data nodememory usage.
Conclusion
Resource overload is one frequent rootcause for Hadoop job failure, especially in the shared large cluster. It responses to the critical role ofprofessional Hadoop capacity management.