I Like Tez, DevOps Edition (WIP)

最新推荐文章于 2024-10-01 18:52:18 发布

miller_lover

最新推荐文章于 2024-10-01 18:52:18 发布

阅读量543

点赞数

分类专栏： big data 文章标签： Tez hadoop cascading mapreduce compiler

big data 专栏收录该内容

101 篇文章 0 订阅

订阅专栏

I work on Tez, so it would be hard to not like Tez. There'sa reason for it too, whenever Tez does something I don't like, I can put myback into it and shove Tez towards that straight & narrow path.

Just beforeHortonworks, I was part of the ZCloud division in Zynga - the casual disregardfor devs towards operations has hurt my sleep cycle and general peace of mind.I know they're chasing features, but whenever someone puts in a change thattakes actual work to rollback, I cringe. And I like how Tez doesn't make thesame mistakes here.

First of all, youdon't install "Tez" on a cluster. The cluster runs YARN, which meanstwo very important things.

There is no"installing Tez" on your 350 nodes and waiting for it to start up.You throw a few jars into an HDFS directory and write tez-site.xml on exactlyone machine pointing to that HDFS path.

This means several important things for a professionaldeployment of the platform. There's no real pains about rolling upgrades,because there is nothing to restart - all existing queries use the old version,all new queries will automatically use the new version. This is particularlyrelevant for a 24 hour round-the-clock data insertion pipeline, but perhaps notfor a BI centric service where you can bounce it pretty quickly after emailinga few people.

Letting you run different versions of Tez at the same time isvery different from how MR used to behave. Personally on a day to day basis,this helps me a lot to share a multi-tenant dev environment & the overallquality of my work - I test everything I write on a big cluster, withoutworrying about whether I'll nuke anyone else's Tez builds.

Next up, I likehow Tez handles failure. You can lose connectivity to half your cluster and thetasks will keep running, perhaps a bit slowly. YARN takes care of bad nodes,cases where the nodes are having disk failures or any such hiccup in thecluster that is normal when you're maxing out 400+ nodes all day long. Andcoming from the MR school of thought, the task failiure scenario is pretty mucheasily covered with re-execution mechanisms.

There's something important to be covered here with failure.For any task attempt that accidentally kills a container (like a bad UDF with amemory leak) there is no real data loss for any previous data, because the dataalready committed in a task is not served out of a container at all. TheNodeManager serves all the data across the cluster with its own secure shufflehandlers. As long as the NodeManager is running, you could kill the existingcontainers on that node and hand off that capacity to another task.

This is veryimportant for busy clusters, because as the aphorism goes "The difference between time and space is that you can re-usespace". I guess the same applies to a container holding onto anin-memory structure, waiting for its data to be pulled off to another task.

And any hadoop-2installation already has node manager alerts/restarts already coded in withoutneeding any new devops work to bring back errant nodes back online.

This brings me tothe next bit of error tolerance in the system - the ApplicationMaster. The oldproblem with hadoop-1.x was that the JobTracker was a somewhat single point offailure for any job. With YARN, that went away entirely with theApplicationMaster being coded particularly for a task type.

Now most applications do not want to write up all the bitsand bobs required to run their own ApplicationMaster. Something like Hivecould've built its own ApplicationMaster (rather we could've built it as partof our perf effort) - after all Storm did, HBase did and so did Giraph.

The vision of Tez is that there's a possibe generalizationfor the problem. Just like MR was a simpledistribution mechanism for a bi-partite graph which spawned a huge variety oftools, there exists a way to express more complex graphs in a generic way,building a new assembly language for data driven applications.

Make no misake, Tez is an assembly language at its very core.It is raw and expressive but is an expert's tool, meantto be wielded by compiler developers catering to a tool userland. Pig and Hivealready have compilers into this new backend. Cascading and then Scalding willadd some API fun to the mix, but the framework sits below all those andconsolidates everyone's efforts into a common rich baseline for performance.And there's a secret hidden away MapReduce compiler forTez as well, which get ignored often.

A generalization is fine, but it is often a limitation aswell - nearly every tool listed above want to write small parts of thescheduling mechanisms which allows for custom data routing and connecting uptask outputs to task inputs manually (like a bucketed map-join). Tez is meantto be a good generalization to build each application's custom components ontop of, but without actually writing any of the complex YARN code required tohave error tolerance, rack/host locality and recovery from AM crashes. TheVertexManager plugin API is one classic example of how an application can nowinterfere with how a DAG is scheduled and how its individual tasks are managed.

And last of all, I like how Tez is not self centered, itworks towards global utilization ratio on a cluster, not just its own latencyfigures. It can be built to elastically respond to queue/cluster pressures fromother tasks running on the cluster.

People are doing Tez a disfavour by comparing it to framworkswhich rely on keeping slaves running to not just execute CPU tasks but to holdonto temporary storage as well. On a production cluster, getting 4 fewercontainers than you asked for will not stall Tez, because of the way it usesthe Shuffle mechanism as a temporary data store between DAG vertexes - it isdesigned to be all-or-best-effort, instead of waiting for the perfect moment torun your entire query. A single stalling reducer doesn't require any of theother JVMs to stay resident and wait. This isn't a problem for a daemon basedmulti-tenant cluster, because if there is another job for that cluster it willexecute, but for a hadoop ecosystem cluster system built on YARN, this meansthat your cluster utilization takes a nose-dive due to the inability to acquireor release cluster resources incrementally/elastically during your actual dataoperation.

Between the frameworks I've played with, that is the realdifferentiating feature of Tez - Tez does notrequire containers to be kept running to do anything, just the AM running inthe idle periods between different queries. You can hold onto containers, butit is an optimization, not a requirement during idle periods for the session.

I might notexactly be a fan of the user-friendliness of this assembly language layer forhadoop, but the flexibility of this more than compensates.