Running Hadoop MapReduce on Tachyon

This guide describes how to get Tachyon running with Hadoop MapReduce, so that you can easily use your MapReduce programs with files stored on Tachyon.

Prerequisites

The prerequisite for this part is that you have Java. We also assume that you have set up Tachyon and Hadoop in accordance to these guidesLocal Mode or Cluster Mode

If running a Hadoop 1.x cluster, ensure that the hadoop/conf/core-site.xml file in your Hadoop installation’s conf directory has the following properties added:

<property>
  <name>fs.tachyon.impl</name>
  <value>tachyon.hadoop.TFS</value>
</property>
<property>
  <name>fs.tachyon-ft.impl</name>
  <value>tachyon.hadoop.TFSFT</value>
</property>

This will allow your MapReduce jobs to use Tachyon for their input and output files. If you are using HDFS as the underlying store for Tachyon, it may be necessary to add these properties to the hdfs-site.xml conf file as well.

If the cluster is a 2.x cluster, then these properties are not needed.

Distributing Tachyon Executables

In order for the MapReduce job to be able to use files via Tachyon, we will need to distribute the Tachyon jar amongst all the nodes in the cluster. This will allow the TaskTracker and JobClient to have all the requisite executables to interface with Tachyon.

We are presented with three options that for distributing the jars as outlined by this guide from Cloudera.

Assuming that Tachyon will be used prominently, it is best to ensure that the Tachyon jar will permanently reside on each node, so that we do not rely on the Hadoop DistributedCache to avoid the network costs of distributing the jar for each job (Option 1), and don’t significantly increase our job jar size by packaging Tachyon with it (Option 2). For this reason, of the three options laid out, it is highly recommended to consider the third route, by installing the Tachyon jar on each node.

  • For installing Tachyon on each node, you must place the tachyon-client-0.6.3-jar-with-dependencies.jar, located in thetachyon/client/target directory, in the $HADOOP_HOME/lib directory of each node, and then restart all of the TaskTrackers. One downfall of this approach is that the jars must be installed again for each update to a new release.

  • You can also run a job by using the -libjars command line option when using hadoop jar..., and specifying/pathToTachyon/core/target/tachyon-client=0.6.3-jar-with-dependencies.jar as the argument. This will place the jar in the Hadoop DistributedCache, and is desirable only if you are updating the Tachyon jar a non-trivial number of times.

  • For those interested in the second option, please revisit the Cloudera guide for more assistance. One must simply package the Tachyon jar in the lib subdirectory of the job jar. This option is the most undesirable since for every change in Tachyon, we must recreate the job jar, thereby incurring a network cost for every job by increasing the size of the job jar.

In order to make the Tachyon executables available to the JobClient, one can also install the Tachyon jar in the $HADOOP_HOME/lib directory, or modify HADOOP_CLASSPATH by changing hadoop-env.sh to:

$ export HADOOP_CLASSPATH=/pathToTachyon/client/target/tachyon-client-0.6.3-jar-with-dependencies.jar

This will allow the code that creates the Job and submits it to reference Tachyon if necessary.

Example

For simplicity, we will assume a psuedo-distributed Hadoop cluster.

$ cd $HADOOP_HOME
$ ./bin/stop-all.sh
$ ./bin/start-all.sh

Because we have a psuedo-distributed cluster, copying the Tachyon jar into $HADOOP_HOME/lib makes the Tachyon executables available to both the TaskTrackers and the JobClient. We can now verify that it is working by the following:

$ cd $HADOOP_HOME
$ ./bin/hadoop jar hadoop-examples-1.0.4.jar wordcount -libjars /pathToTachyon/client/target/tachyon-client-0.6.3-jar-with-dependencies.jar tachyon://localhost:19998/X tachyon://localhost:19998/X-wc

Where X is some file on Tachyon and, the results of the wordcount job is in the X-wc directory.

For example, say you have text files in HDFS directory /user/hduser/gutenberg/. You can run the following:

$ cd $HADOOP_HOME
$ ./bin/hadoop jar hadoop-examples-1.0.4.jar wordcount -libjars /pathToTachyon/client/target/tachyon-client-0.6.3-jar-with-dependencies.jar tachyon://localhost:19998/user/hduser/gutenberg tachyon://localhost:19998/user/hduser/output

The above command tell the wordcount to load the files from HDFS directory /user/hduser/gutenberg/ into Tachyon and then save the output result to /user/hduser/output/ in Tachyon.


转载于:https://my.oschina.net/Rayn/blog/397496

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值