# 初始步骤

$mvn clean package  • 另外，也可以选择使用maven编译时在命令行中指定对应的Hadoop版本号，例如，若使用Hadoop HDFS 2.6.0 $ mvn -Dhadoop.version=2.6.0 clean package


$cp conf/alluxio-env.sh.template conf/alluxio-env.sh  接着修改alluxio-env.sh文件，将底层存储系统的地址设置为HDFS namenode的地址（例如，若你的HDFS namenode是在本地默认端口运行，则该地址为hdfs://localhost:9000）。 export ALLUXIO_UNDERFS_ADDRESS=hdfs://NAMENODE:PORT  # 使用HDFS在本地运行Alluxio 配置完成后，你可以在本地启动Alluxio，观察一切是否正常运行： $ ./bin/alluxio format
$./bin/alluxio-start.sh local  该命令应当会启动一个Alluxio master和一个Alluxio worker，可以在浏览器中访问http://localhost:19999查看master Web UI。 接着，你可以运行一个简单的示例程序： $ ./bin/alluxio runTests


$./bin/alluxio-stop.sh all # Running Spark on Alluxio This guide describes how to run Apache Spark on Alluxio. HDFS is used as an example of a distributed under storage system. Note that, Alluxio supports many other under storage systems in addition to HDFS and enables frameworks like Spark to read data from or write data to any number of those systems. ## Compatibility Alluxio works together with Spark 1.1 or later out-of-the-box. ## Prerequisites ### General Setup • Alluxio cluster has been set up in accordance to these guides for either Local Mode or Cluster Mode. • Alluxio client will need to be compiled with the Spark specific profile. Build the entire project from the top level alluxio directory with the following command: mvn clean package -Pspark -DskipTests  • Add the following line to spark/conf/spark-env.sh. export SPARK_CLASSPATH=/pathToAlluxio/core/client/target/alluxio-core-client-1.0.1-jar-with-dependencies.jar:$SPARK_CLASSPATH


• If Alluxio is run on top of a Hadoop 1.x cluster, create a new file spark/conf/core-site.xml with the following content:
<configuration>
<property>
<name>fs.alluxio.impl</name>
</property>
</configuration>

• If you are running alluxio in fault tolerant mode with zookeeper and the Hadoop cluster is a 1.x, add the following additionally entry to the previously created spark/conf/core-site.xml:
<property>
<name>fs.alluxio-ft.impl</name>
</property>


and the following line to spark/conf/spark-env.sh:

export SPARK_JAVA_OPTS="
-Dalluxio.zookeeper.enabled=true
$SPARK_JAVA_OPTS "  ## Use Alluxio as Input and Output This section shows how to use Alluxio as input and output sources for your Spark applications. ### Use Data Already in Alluxio First, we will copy some local data to the Alluxio file system. Put the file LICENSE into Alluxio, assuming you are in the Alluxio project directory: $ bin/alluxio fs copyFromLocal LICENSE /LICENSE


Run the following commands from spark-shell, assuming Alluxio Master is running on localhost:

> val s = sc.textFile("alluxio://localhost:19998/LICENSE")
> val double = s.map(line => line + line)


Open your browser and check http://localhost:19999/browse. There should be an output file LICENSE2 which doubles each line in the file LICENSE.

### Use Data from HDFS

Alluxio supports transparently fetching the data from the under storage system, given the exact path. Put a file LICENSE into HDFS, assuming the namenode is running on localhost and the Alluxio project directory is /alluxio:

$hadoop fs -put -f /alluxio/LICENSE hdfs://localhost:9000/LICENSE  Note that Alluxio has no notion of the file. You can verify this by going to the web UI. Run the following commands from spark-shell, assuming Alluxio Master is running on localhost: > val s = sc.textFile("alluxio://localhost:19998/LICENSE") > val double = s.map(line => line + line) > double.saveAsTextFile("alluxio://localhost:19998/LICENSE2")  Open your browser and check http://localhost:19999/browse. There should be an output file LICENSE2 which doubles each line in the file LICENSE. Also, the LICENSE file now appears in the Alluxio file system space. NOTE: It is possible that the LICENSE file is not in Alluxio storage (Not In-Memory). This is because Alluxio only stores fully read blocks, and if the file is too small, the Spark job will have each executor read a partial block. To avoid this behavior, you can specify the partition count in Spark. For this example, we would set it to 1 as there is only 1 block. > val s = sc.textFile("alluxio://localhost:19998/LICENSE", 1) > val double = s.map(line => line + line) > double.saveAsTextFile("alluxio://localhost:19998/LICENSE2")  ### Using Fault Tolerant Mode When running Alluxio with fault tolerant mode, you can point to any Alluxio master: > val s = sc.textFile("alluxio-ft://stanbyHost:19998/LICENSE") > val double = s.map(line => line + line) > double.saveAsTextFile("alluxio-ft://activeHost:19998/LICENSE2")  ## Data Locality If Spark task locality is ANY while it should be NODE_LOCAL, it is probably because Alluxio and Spark use different network address representations, maybe one of them uses hostname while another uses IP address. Please refer to this jira ticket for more details, where you can find solutions from the Spark community. Note: Alluxio uses hostname to represent network address except in version 0.7.1 where IP address is used. Spark v1.5.x ships with Alluxio v0.7.1 by default, in this case, by default, Spark and Alluxio both use IP address to represent network address, so data locality should work out of the box. But since release 0.8.0, to be consistent with HDFS, Alluxio represents network address by hostname. There is a workaround when launching Spark to achieve data locality. Users can explicitly specify hostnames by using the following script offered in Spark. Start Spark Worker in each slave node with slave-hostname: $ $SPARK_HOME/sbin/start-slave.sh -h <slave-hostname> <spark master uri>  For example: $ $SPARK_HOME/sbin/start-slave.sh -h simple30 spark://simple27:7077  You can also set the SPARK_LOCAL_HOSTNAME in $SPARK_HOME/conf/spark-env.sh to achieve this. For example:

SPARK_LOCAL_HOSTNAME=simple30


In either way, the Spark Worker addresses become hostnames and Locality Level becomes NODE_LOCAL as shown in Spark WebUI below.

# Running Hadoop MapReduce on Alluxio

This guide describes how to get Alluxio running with Apache Hadoop MapReduce, so that you can easily run your MapReduce programs with files stored on Alluxio.

# Initial Setup

The prerequisite for this part is that you have Java. We also assume that you have set up Alluxio and Hadoop in accordance to these guides Local Mode or Cluster Mode. In order to run some simple map-reduce examples, we also recommend you download the map-reduce examples jar, or if you are using Hadoop 1, this examples jar.

# Compiling the Alluxio Client

In order to use Alluxio with your version of Hadoop, you will have to re-compile the Alluxio client jar, specifying your Hadoop version. You can do this by running the following in your Alluxio directory:

$mvn install -Dhadoop.version=<YOUR_HADOOP_VERSION> -DskipTests  The version <YOUR_HADOOP_VERSION> supports many different distributions of Hadoop. For example, mvn install -Dhadoop.version=2.7.1 -DskipTests would compile Alluxio for the Apache Hadoop version 2.7.1. Please visit the Building Alluxio Master Branch page for more information about support for other distributions. After the compilation succeeds, the new Alluxio client jar can be found at: core/client/target/alluxio-core-client-1.0.1-jar-with-dependencies.jar  This is the jar that you should use for the rest of this guide. # Configuring Hadoop You need to add the following three properties to core-site.xml file in your Hadoop installation conf directory: <property> <name>fs.alluxio.impl</name> <value>alluxio.hadoop.FileSystem</value> <description>The Alluxio FileSystem (Hadoop 1.x and 2.x)</description> </property> <property> <name>fs.alluxio-ft.impl</name> <value>alluxio.hadoop.FaultTolerantFileSystem</value> <description>The Alluxio FileSystem (Hadoop 1.x and 2.x) with fault tolerant support</description> </property> <property> <name>fs.AbstractFileSystem.alluxio.impl</name> <value>alluxio.hadoop.AlluxioFileSystem</value> <description>The Alluxio AbstractFileSystem (Hadoop 2.x)</description> </property>  This will allow your MapReduce jobs to use Alluxio for their input and output files. If you are using HDFS as the under storage system for Alluxio, it may be necessary to add these properties to the hdfs-site.xml file as well. In order for the Alluxio client jar to be available to the JobClient, you can modify HADOOP_CLASSPATH by changing hadoop-env.sh to: $ export HADOOP_CLASSPATH=/<PATH_TO_ALLUXIO>/core/client/target/alluxio-core-client-1.0.1-jar-with-dependencies.jar:${HADOOP_CLASSPATH}  This allows the code that creates and submits the Job to use URIs with Alluxio scheme. # Distributing the Alluxio Client Jar In order for the MapReduce job to be able to read and write files in Alluxio, the Alluxio client jar must be distributed to all the nodes in the cluster. This allows the TaskTracker and JobClient to have all the requisite executables to interface with Alluxio. This guide on how to include 3rd party libraries from Cloudera describes several ways to distribute the jars. From that guide, the recommended way to distributed the Alluxio client jar is to use the distributed cache, via the -libjars command line option. Another way to distribute the client jar is to manually distribute it to all the Hadoop nodes. Below are instructions for the 2 main alternatives: 1.Using the -libjars command line option. You can run a job by using the -libjars command line option when using hadoop jar ..., specifying/<PATH_TO_ALLUXIO>/core/client/target/alluxio-core-client-1.0.1-jar-with-dependencies.jar as the argument. This will place the jar in the Hadoop DistributedCache, making it available to all the nodes. For example, the following command adds the Alluxio client jar to the -libjars option: $ hadoop jar hadoop-examples-1.2.1.jar wordcount -libjars /<PATH_TO_ALLUXIO>/core/client/target/alluxio-core-client-1.0.1-jar-with-dependencies.jar <INPUT FILES> <OUTPUT DIRECTORY>


2.Distributing the jars to all nodes manually. For installing Alluxio on each node, you must place the client jar alluxio-core-client-1.0.1-jar-with-dependencies.jar (located in the /<PATH_TO_ALLUXIO>/core/client/target/ directory), in the $HADOOP_HOME/lib (may be$HADOOP_HOME/share/hadoop/common/lib for different versions of Hadoop) directory of every MapReduce node, and then restart all of the TaskTrackers. One caveat of this approach is that the jars must be installed again for each update to a new release. On the other hand, when the jar is already on every node, then the -libjars command line option is not needed.

# Running Hadoop wordcount with Alluxio Locally

First, compile Alluxio with the appropriate Hadoop version:

$mvn clean install -Dhadoop.version=<YOUR_HADOOP_VERSION>  For simplicity, we will assume a pseudo-distributed Hadoop cluster, started by running: $ cd $HADOOP_HOME$ ./bin/stop-all.sh
$./bin/start-all.sh  Configure Alluxio to use the local HDFS cluster as its under storage system. You can do this by modifying conf/alluxio-env.sh to include: export ALLUXIO_UNDERFS_ADDRESS=hdfs://localhost:9000  Start Alluxio locally: $ ./bin/alluxio-stop.sh all
$./bin/alluxio-start.sh local  You can add a sample file to Alluxio to run wordcount on. From your Alluxio directory: $ ./bin/alluxio fs copyFromLocal LICENSE /wordcount/input.txt


This command will copy the LICENSE file into the Alluxio namespace with the path /wordcount/input.txt.

Now we can run a MapReduce job for wordcount.

$bin/hadoop jar hadoop-examples-1.2.1.jar wordcount -libjars /<PATH_TO_ALLUXIO>/core/client/target/alluxio-core-client-1.0.1-jar-with-dependencies.jar alluxio://localhost:19998/wordcount/input.txt alluxio://localhost:19998/wordcount/output  After this job completes, the result of the wordcount will be in the /wordcount/output directory in Alluxio. You can see the resulting files by running: $ ./bin/alluxio fs ls /wordcount/output
\$ ./bin/alluxio fs cat /wordcount/output/part-r-00000`