在HDFS上配置Alluxio

初始步骤

要在一组机器上运行一个Alluxio集群,需要在每台机器上部署Alluxio二进制包。你可以自己编译Alluxio,或者下载二进制包

注意,在默认情况下,预编译的Alluxio二进制包适用于HDFS 2.2.0,若使用其他版本的Hadoop,需要从Alluxio源代码重新编译,且编译时应照以下方法中的一种设置Hadoop版本号。假定Alluxio源代码的根目录为${ALLUXIO_HOME}

  • 修改${ALLUXIO_HOME}/pom.xml配置文件中的hadoop.version标签。例如,若使用Hadoop 2.6.0,将该pom文件中的”<hadoop.version>2.2.0</hadoop.version>“修改为”<hadoop.version>2.6.0</hadoop.version>“,接着使用maven重新编译:
$ mvn clean package
  • 另外,也可以选择使用maven编译时在命令行中指定对应的Hadoop版本号,例如,若使用Hadoop HDFS 2.6.0
$ mvn -Dhadoop.version=2.6.0 clean package

如果一切正常,在assembly/target目录中应当能看到alluxio-assemblies-1.0.1-jar-with-dependencies.jar文件,使用该jar文件即可运行Alluxio Master和Worker。

配置Alluxio

要运行Alluxio二进制包,一定要先创建配置文件,从template文件创建一个配置文件:

$ cp conf/alluxio-env.sh.template conf/alluxio-env.sh

接着修改alluxio-env.sh文件,将底层存储系统的地址设置为HDFS namenode的地址(例如,若你的HDFS namenode是在本地默认端口运行,则该地址为hdfs://localhost:9000)。

export ALLUXIO_UNDERFS_ADDRESS=hdfs://NAMENODE:PORT

使用HDFS在本地运行Alluxio

配置完成后,你可以在本地启动Alluxio,观察一切是否正常运行:

$ ./bin/alluxio format
$ ./bin/alluxio-start.sh local

该命令应当会启动一个Alluxio master和一个Alluxio worker,可以在浏览器中访问http://localhost:19999查看master Web UI。

接着,你可以运行一个简单的示例程序:

$ ./bin/alluxio runTests

运行成功后,访问HDFS Web UI http://localhost:50070,确认其中包含了由Alluxio创建的文件和目录。在该测试中,创建的文件名称应像这样:/alluxio/data/default_tests_files/BasicFile_STORE_SYNC_PERSIST

运行以下命令停止Alluxio:

$ ./bin/alluxio-stop.sh all

Running Spark on Alluxio

This guide describes how to run Apache Spark on Alluxio. HDFS is used as an example of a distributed under storage system. Note that, Alluxio supports many other under storage systems in addition to HDFS and enables frameworks like Spark to read data from or write data to any number of those systems.

Compatibility

Alluxio works together with Spark 1.1 or later out-of-the-box.

Prerequisites

General Setup

  • Alluxio cluster has been set up in accordance to these guides for either Local Mode or Cluster Mode.

  • Alluxio client will need to be compiled with the Spark specific profile. Build the entire project from the top level alluxio directory with the following command:

mvn clean package -Pspark -DskipTests
  • Add the following line to spark/conf/spark-env.sh.
export SPARK_CLASSPATH=/pathToAlluxio/core/client/target/alluxio-core-client-1.0.1-jar-with-dependencies.jar:$SPARK_CLASSPATH

Additional Setup for HDFS

  • If Alluxio is run on top of a Hadoop 1.x cluster, create a new file spark/conf/core-site.xml with the following content:
<configuration>
  <property>
    <name>fs.alluxio.impl</name>
    <value>alluxio.hadoop.FileSystem</value>
  </property>
</configuration>
  • If you are running alluxio in fault tolerant mode with zookeeper and the Hadoop cluster is a 1.x, add the following additionally entry to the previously created spark/conf/core-site.xml:
<property>
  <name>fs.alluxio-ft.impl</name>
  <value>alluxio.hadoop.FaultTolerantFileSystem</value>
</property>

and the following line to spark/conf/spark-env.sh:

export SPARK_JAVA_OPTS="
  -Dalluxio.zookeeper.address=zookeeperHost1:2181,zookeeperHost2:2181
  -Dalluxio.zookeeper.enabled=true
  $SPARK_JAVA_OPTS
"

Use Alluxio as Input and Output

This section shows how to use Alluxio as input and output sources for your Spark applications.

Use Data Already in Alluxio

First, we will copy some local data to the Alluxio file system. Put the file LICENSE into Alluxio, assuming you are in the Alluxio project directory:

$ bin/alluxio fs copyFromLocal LICENSE /LICENSE

Run the following commands from spark-shell, assuming Alluxio Master is running on localhost:

> val s = sc.textFile("alluxio://localhost:19998/LICENSE")
> val double = s.map(line => line + line)
> double.saveAsTextFile("alluxio://localhost:19998/LICENSE2")

Open your browser and check http://localhost:19999/browse. There should be an output file LICENSE2 which doubles each line in the file LICENSE.

Use Data from HDFS

Alluxio supports transparently fetching the data from the under storage system, given the exact path. Put a file LICENSE into HDFS, assuming the namenode is running on localhost and the Alluxio project directory is /alluxio:

$ hadoop fs -put -f /alluxio/LICENSE hdfs://localhost:9000/LICENSE

Note that Alluxio has no notion of the file. You can verify this by going to the web UI. Run the following commands from spark-shell, assuming Alluxio Master is running on localhost:

> val s = sc.textFile("alluxio://localhost:19998/LICENSE")
> val double = s.map(line => line + line)
> double.saveAsTextFile("alluxio://localhost:19998/LICENSE2")

Open your browser and check http://localhost:19999/browse. There should be an output file LICENSE2 which doubles each line in the file LICENSE. Also, the LICENSE file now appears in the Alluxio file system space.

NOTE: It is possible that the LICENSE file is not in Alluxio storage (Not In-Memory). This is because Alluxio only stores fully read blocks, and if the file is too small, the Spark job will have each executor read a partial block. To avoid this behavior, you can specify the partition count in Spark. For this example, we would set it to 1 as there is only 1 block.

> val s = sc.textFile("alluxio://localhost:19998/LICENSE", 1)
> val double = s.map(line => line + line)
> double.saveAsTextFile("alluxio://localhost:19998/LICENSE2")

Using Fault Tolerant Mode

When running Alluxio with fault tolerant mode, you can point to any Alluxio master:

> val s = sc.textFile("alluxio-ft://stanbyHost:19998/LICENSE")
> val double = s.map(line => line + line)
> double.saveAsTextFile("alluxio-ft://activeHost:19998/LICENSE2")

Data Locality

If Spark task locality is ANY while it should be NODE_LOCAL, it is probably because Alluxio and Spark use different network address representations, maybe one of them uses hostname while another uses IP address. Please refer to this jira ticket for more details, where you can find solutions from the Spark community.

Note: Alluxio uses hostname to represent network address except in version 0.7.1 where IP address is used. Spark v1.5.x ships with Alluxio v0.7.1 by default, in this case, by default, Spark and Alluxio both use IP address to represent network address, so data locality should work out of the box. But since release 0.8.0, to be consistent with HDFS, Alluxio represents network address by hostname. There is a workaround when launching Spark to achieve data locality. Users can explicitly specify hostnames by using the following script offered in Spark. Start Spark Worker in each slave node with slave-hostname:

$ $SPARK_HOME/sbin/start-slave.sh -h <slave-hostname> <spark master uri>

For example:

$ $SPARK_HOME/sbin/start-slave.sh -h simple30 spark://simple27:7077

You can also set the SPARK_LOCAL_HOSTNAME in $SPARK_HOME/conf/spark-env.sh to achieve this. For example:

SPARK_LOCAL_HOSTNAME=simple30

In either way, the Spark Worker addresses become hostnames and Locality Level becomes NODE_LOCAL as shown in Spark WebUI below.

hostname

locality

Running Hadoop MapReduce on Alluxio

This guide describes how to get Alluxio running with Apache Hadoop MapReduce, so that you can easily run your MapReduce programs with files stored on Alluxio.

Initial Setup

The prerequisite for this part is that you have Java. We also assume that you have set up Alluxio and Hadoop in accordance to these guides Local Mode or Cluster Mode. In order to run some simple map-reduce examples, we also recommend you download the map-reduce examples jar, or if you are using Hadoop 1, this examples jar.

Compiling the Alluxio Client

In order to use Alluxio with your version of Hadoop, you will have to re-compile the Alluxio client jar, specifying your Hadoop version. You can do this by running the following in your Alluxio directory:

$ mvn install -Dhadoop.version=<YOUR_HADOOP_VERSION> -DskipTests

The version <YOUR_HADOOP_VERSION> supports many different distributions of Hadoop. For example, mvn install -Dhadoop.version=2.7.1 -DskipTests would compile Alluxio for the Apache Hadoop version 2.7.1. Please visit the Building Alluxio Master Branch page for more information about support for other distributions.

After the compilation succeeds, the new Alluxio client jar can be found at:

core/client/target/alluxio-core-client-1.0.1-jar-with-dependencies.jar

This is the jar that you should use for the rest of this guide.

Configuring Hadoop

You need to add the following three properties to core-site.xml file in your Hadoop installation conf directory:

<property>
  <name>fs.alluxio.impl</name>
  <value>alluxio.hadoop.FileSystem</value>
  <description>The Alluxio FileSystem (Hadoop 1.x and 2.x)</description>
</property>
<property>
  <name>fs.alluxio-ft.impl</name>
  <value>alluxio.hadoop.FaultTolerantFileSystem</value>
  <description>The Alluxio FileSystem (Hadoop 1.x and 2.x) with fault tolerant support</description>
</property>
<property>
  <name>fs.AbstractFileSystem.alluxio.impl</name>
  <value>alluxio.hadoop.AlluxioFileSystem</value>
  <description>The Alluxio AbstractFileSystem (Hadoop 2.x)</description>
</property>

This will allow your MapReduce jobs to use Alluxio for their input and output files. If you are using HDFS as the under storage system for Alluxio, it may be necessary to add these properties to the hdfs-site.xml file as well.

In order for the Alluxio client jar to be available to the JobClient, you can modify HADOOP_CLASSPATH by changing hadoop-env.sh to:

$ export HADOOP_CLASSPATH=/<PATH_TO_ALLUXIO>/core/client/target/alluxio-core-client-1.0.1-jar-with-dependencies.jar:${HADOOP_CLASSPATH}

This allows the code that creates and submits the Job to use URIs with Alluxio scheme.

Distributing the Alluxio Client Jar

In order for the MapReduce job to be able to read and write files in Alluxio, the Alluxio client jar must be distributed to all the nodes in the cluster. This allows the TaskTracker and JobClient to have all the requisite executables to interface with Alluxio.

This guide on how to include 3rd party libraries from Cloudera describes several ways to distribute the jars. From that guide, the recommended way to distributed the Alluxio client jar is to use the distributed cache, via the -libjars command line option. Another way to distribute the client jar is to manually distribute it to all the Hadoop nodes. Below are instructions for the 2 main alternatives:

1.Using the -libjars command line option. You can run a job by using the -libjars command line option when using hadoop jar ..., specifying/<PATH_TO_ALLUXIO>/core/client/target/alluxio-core-client-1.0.1-jar-with-dependencies.jar as the argument. This will place the jar in the Hadoop DistributedCache, making it available to all the nodes. For example, the following command adds the Alluxio client jar to the -libjars option:

$ hadoop jar hadoop-examples-1.2.1.jar wordcount -libjars /<PATH_TO_ALLUXIO>/core/client/target/alluxio-core-client-1.0.1-jar-with-dependencies.jar <INPUT FILES> <OUTPUT DIRECTORY>`

2.Distributing the jars to all nodes manually. For installing Alluxio on each node, you must place the client jar alluxio-core-client-1.0.1-jar-with-dependencies.jar (located in the /<PATH_TO_ALLUXIO>/core/client/target/ directory), in the $HADOOP_HOME/lib (may be$HADOOP_HOME/share/hadoop/common/lib for different versions of Hadoop) directory of every MapReduce node, and then restart all of the TaskTrackers. One caveat of this approach is that the jars must be installed again for each update to a new release. On the other hand, when the jar is already on every node, then the -libjars command line option is not needed.

Running Hadoop wordcount with Alluxio Locally

First, compile Alluxio with the appropriate Hadoop version:

$ mvn clean install -Dhadoop.version=<YOUR_HADOOP_VERSION>

For simplicity, we will assume a pseudo-distributed Hadoop cluster, started by running:

$ cd $HADOOP_HOME
$ ./bin/stop-all.sh
$ ./bin/start-all.sh

Configure Alluxio to use the local HDFS cluster as its under storage system. You can do this by modifying conf/alluxio-env.sh to include:

export ALLUXIO_UNDERFS_ADDRESS=hdfs://localhost:9000

Start Alluxio locally:

$ ./bin/alluxio-stop.sh all
$ ./bin/alluxio-start.sh local

You can add a sample file to Alluxio to run wordcount on. From your Alluxio directory:

$ ./bin/alluxio fs copyFromLocal LICENSE /wordcount/input.txt

This command will copy the LICENSE file into the Alluxio namespace with the path /wordcount/input.txt.

Now we can run a MapReduce job for wordcount.

$ bin/hadoop jar hadoop-examples-1.2.1.jar wordcount -libjars /<PATH_TO_ALLUXIO>/core/client/target/alluxio-core-client-1.0.1-jar-with-dependencies.jar alluxio://localhost:19998/wordcount/input.txt alluxio://localhost:19998/wordcount/output

After this job completes, the result of the wordcount will be in the /wordcount/output directory in Alluxio. You can see the resulting files by running:

$ ./bin/alluxio fs ls /wordcount/output
$ ./bin/alluxio fs cat /wordcount/output/part-r-00000
阅读更多
换一批

没有更多推荐了,返回首页