Spark1.6.0官方文档翻译01--Spark Overview

http://spark.apache.org/docs/latest/

Spark Overview

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

Downloading

Get Spark from the downloads page of the project website. This documentation is for Spark version 1.6.0.Spark uses Hadoop’s client libraries for HDFS and YARN. Downloads are pre-packaged for a handful of popular Hadoop versions. Users can also download a “Hadoop free” binary and run Spark with any Hadoop version by augmenting Spark’s classpath.


Using Spark's "Hadoop Free" Build        http://spark.apache.org/docs/latest/hadoop-provided.html

Spark uses Hadoop client libraries for HDFS and YARN. Starting in version Spark 1.4, the project packages “Hadoop free” builds that lets you more easily connect a single Spark binary to any Hadoop version. To use these builds, you need to modify SPARK_DIST_CLASSPATH to include Hadoop’s package jars. The most convenient place to do this is by adding an entry in conf/spark-env.sh.

从spark1.4.0开始,支持Hadoop free方式的编译,这使得用户可以方便的在spark中使用任意版本的Hadoop。使用此方式的时候,用户必须调整SPARK_DIST_CLASSPATH参数以便于包含Hadoop的jars。最方便的设置方式是在conf/spark-env.sh配置文件中指定SPARK_DIST_CLASSPATH。

This page describes how to connect Spark to Hadoop for different types of distributions.

Apache Hadoop

For Apache distributions, you can use Hadoop’s ‘classpath’ command. For instance:

### in conf/spark-env.sh ### 三种方式来制定hadoop的配置

# If 'hadoop' binary is on your PATH 如果已经在PATH环境变量中制定了hadoop的路径
export SPARK_DIST_CLASSPATH=$(hadoop classpath)

# With explicit path to 'hadoop' binary  明确的指定hadoop的路径
export SPARK_DIST_CLASSPATH=$(/path/to/hadoop/bin/hadoop classpath)

# Passing a Hadoop configuration directory  传递一个hadoop的配置文件的路径
export SPARK_DIST_CLASSPATH=$(hadoop --config /path/to/configs classpath)

可以下载包含Hadoop库的版本,这样的版本可以使用HDFS和YARN,但是对Hadoop的版本有要求。也可以下载不包含预编译Hadoop的版本,这样可以使用任何版本的Hadoop,只需要在运行Spark时候在参数中指定任意版本的Hadoop的classpath。

If you’d like to build Spark from source, visit Building Spark.也可以利用源码自行编译Spark。

Spark runs on both Windows and UNIX-like systems (e.g. Linux, Mac OS). It’s easy to run locally on one machine — all you need is to have java installed on your system PATH, or the JAVA_HOME environment variable pointing to a Java installation.

Spark runs on Java 7+, Python 2.6+ and R 3.1+. For the Scala API, Spark 1.6.0 uses Scala 2.10. You will need to use a compatible Scala version (2.10.x).

Running the Examples and Shell

Spark comes with several sample programs. Scala, Java, Python and R examples are in the examples/src/main directory. To run one of the Java or Scala sample programs, use bin/run-example <class> [params] in the top-level Spark directory. (Behind the scenes, this invokes the more general spark-submit script for launching applications). For example,

Spark自带一些scala,java,python和R的示例程序,位置在examples/src/main目录。使用Spark顶层目录的bin/run-example <class>[params]命令来运行java和scala程序。(其实这是调用spark-submit脚本来启动程序)。其中<class>就是准备运行的程序的类文件,[params]是所需的参数列表。

./bin/run-example SparkPi 10

You can also run Spark interactively through a modified version of the Scala shell. This is a great way to learn the framework.

也可以使用spark-shell来进行交互式的执行。

./bin/spark-shell --master local[2] 其中,--master用于指定master的url,这里使用local[2]表示使用本地模式,2表示使用两个线程。

The --master option specifies the master URL for a distributed cluster, or local to run locally with one thread, or local[N] to run locally with N threads. You should start by using local for testing. For a full list of options, run Spark shell with the --help option.

Spark also provides a Python API. To run Spark interactively in a Python interpreter, use bin/pyspark:使用python

./bin/pyspark --master local[2]

Example applications are also provided in Python. For example,

./bin/spark-submit examples/src/main/python/pi.py 10

Spark also provides an experimental R API since 1.4 (only DataFrames APIs included). To run Spark interactively in a R interpreter, usebin/sparkR:

./bin/sparkR --master local[2]

Example applications are also provided in R. For example,

./bin/spark-submit examples/src/main/r/dataframe.R

Launching on a Cluster

The Spark cluster mode overview explains the key concepts in running on a cluster. Spark can run both by itself, or over several existing cluster managers. It currently provides several options for deployment:

spark提供了可选择的集群模式。包括Amazon EC2,standalone,Mesos和Hadoop Yarn。这里主要指的是Spark运行时的资源调度策略的选择。

Where to Go from Here

Programming Guides:

API Docs:

Deployment Guides:

Other Documents:

External Resources:

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值