搭建spark开发环境-CSDN博客

本文链接：https://blog.csdn.net/flash8627/article/details/51636919

1.1 搭建和设置IDEA开发环境

ide: Intellij idea

路径：/usr/local/idea

idea各版本下载地址：

https://confluence.jetbrains.com/display/IntelliJIDEA/Previous+IntelliJ+IDEA+Releases

版本：13.1.7

需要安装桌面，在第二章节有说明

idea plugin download

https://confluence.jetbrains.com/display/SCA/Scala+Plugin+for+IntelliJ+IDEA

http://plugins.jetbrains.com/plugin/?idea&id=1347

1.1.1 创建项目:

如果没有scala选项，可以先建一个文件来激活．

项目建好后结构自动构建，如下图：

如果没有Scala的插件，可以新建文件按提示扭出jdk安装和Scala的安装．我在开始时急切的把欢迎页面Ｘ掉了，所以没有configation入口去配置jdk.

主题设置：

新建scala class

建好的scala class:

输入的代码：

/**

* Created by root on 16-6-12.

object firstScalaApp {

def main(args: Array[String]){

print("20160612 HelloScala!!! ")

}

右键运行

Run 'firstScalaApp' 后直接得到打印：

还可以直接启动scala控制台．

Run Scala Console后得到scala提示符：

搭建spark开发　环境：

新建工程-->导入库-->

创建Scala Class 返回类型object,也就是创建Spark运行的类

建项目　FirstsparkApp

选择运行的main方法　类

删除后的结果：

类源代码：

package com.spark.firstapp

import org.apache.spark.{SparkContext, SparkConf}

import scala.math.random

import org.apache.spark._

/**

* Created by root on 16-6-12.

class HelloSpark {

def main(args: Array[String]): Unit ={

//val conf = new SparkConf().setAppName("Spark Pi").setMaster("spark://hadoop:7070").setJars(List("out\\artifacts\\sparkTest_jar\\sparkTest.jar"))

val spark = new SparkContext("spark://Master:7070", "Spark Pi", "/usr/local/spark/spark-1.6.1-bin-hadoop2.6", List("out\\artifacts\\sparkTest_jar\\sparkTest.jar"))

//val spark = new SparkContext(conf)

val slices = if (args.length > 0) args(0).toInt else 2

val n = 100000 * slices

val count = spark.parallelize(1 to n, slices).map { i =>

val x = random * 2 - 1

val y = random * 2 - 1

if (x*x + y*y < 1) 1 else 0

}.reduce(_ + _)

println("Pi is roughly " + 4.0 * count / n)

spark.stop()

}

接下来build:

进入编译后的目录查看：

root@Master:~/IdeaProjects/FirstSparkApp/out/artifacts/FirstSparkApp_jar#cd /root/IdeaProjects/FirstSparkApp/out/artifacts/FirstSparkApp_jar

root@Master:~/IdeaProjects/FirstSparkApp/out/artifacts/FirstSparkApp_jar#ls

FirstSparkApp.jar

使用spark-submit运行该程序：

1.1 测试IDEA环境

开发第一个Spark程序。打开Spark自带的Examples目录：

root@Master:/usr/local/spark/spark-1.6.1-bin-hadoop2.6/examples/src/main/scala/org/apache/spark/examples#

此时发现内部有很多文件，这些都是Spark给我提供的实例。

在我们的在我们的第一Scala工程的src下创建一个名称为SparkPi的Scala的object：

此时打开Spark自带的Examples下的SparkPi文件：

package main.scala.com.spark.firstapp

import scala.math.random

import org.apache.spark._

/**

* Created by root on 16-6-19.

object SparkPi {

def main(args: Array[String]) {

val conf = new SparkConf().setAppName("Spark Pi")

conf.setMaster("spark://Master:7077")

val spark = new SparkContext(conf)

val slices = if (args.length > 0) args(0).toInt else 2

val n = math.min(100000L * slices, Int.MaxValue).toInt // avoid overflow

val count = spark.parallelize(1 until n, slices).map { i =>

val x = random * 2 - 1

val y = random * 2 - 1

if (x*x + y*y < 1) 1 else 0

}.reduce(_ + _)

println("Pi is roughly " + 4.0 * count / n)

spark.stop()

}

***实际使用的是作业第一个测试代码

此时我们直接选择SparkPi并运行的话会出现如下错误提示：

从提示中可以看出是找不到Spark程序运行的Master机器。

此时需要配置SparkPi的执行环境：

选择“Edit Configurations”进入配置界面：

我们在Program arguments中输入“local”：

此配置说明我们的程序以local本地的模式运行，配置好后保存。

此时再次运行程序即可。

1.1.1 构建项目源码包

添加依赖：

1.1.2 项目打包　

选择HelloSpark即可．

改名为FirstSparkJar

每台机器　上都有spark和scala，删除多余的jar包．

build　

build success:

编译后的文件：

1.1.3 发布项目　　

使用spark-submit发布程序

root@Master:/usr/local/spark/spark-1.6.1-bin-hadoop2.6/bin#spark-submit

root@Master:/usr/local/spark/spark-1.6.1-bin-hadoop2.6/bin# cp/root/IdeaProjects/FirstSparkApp/out/artifacts/FirstSparkAppJar/FirstSparkApp.jar./

root@Master:/usr/local/hadoop/input# hadoop fs -put SogouQ.sample /data/SogouQ.sample

root@Master:/usr/local/spark/spark-1.6.1-bin-hadoop2.6/bin# spark-submit --master spark://Master:7077 --classcom.spark.firstapp.HelloSpark --executor-memory 1g ./FirstSparkApp.jarhdfs://Master:9000/data/SogouQ.sample hdfs://Master:9000/data/SogouResult

1.1.3.1 更多参数：

Usage: spark-submit [options] <appjar | python file> [app arguments]

Usage: spark-submit --kill[submission ID] --master [spark://...]

Usage: spark-submit --status[submission ID] --master [spark://...]

Options:

--master MASTER_URL spark://host:port, mesos://host:port, yarn, or local.

--deploy-mode DEPLOY_MODE Whether to launch the driver program locally ("client") or

on one of theworker machines inside the cluster ("cluster")

(Default: client).

--class CLASS_NAME Yourapplication's main class (for Java / Scala apps).

--name NAME A nameof your application.

--jars JARS Comma-separated list of local jars to include on the driver

and executorclasspaths.

--packages Comma-separated list of maven coordinates of jars to include

on the driver andexecutor classpaths. Will search the local

mavenrepo, then maven central and any additional remote

repositoriesgiven by --repositories. The format for the

coordinatesshould be groupId:artifactId:version.

--exclude-packages Comma-separated list of groupId:artifactId, to exclude while

resolving thedependencies provided in --packages to avoid

dependencyconflicts.

--repositories Comma-separatedlist of additional remote repositories to

search for themaven coordinates given with --packages.

--py-files PY_FILES Comma-separated list of .zip, .egg, or .py files to place

on the PYTHONPATH for Python apps.

--files FILES Comma-separated list of files to be placed in the working

directory of eachexecutor.

--conf PROP=VALUE Arbitrary Spark configuration property.

--properties-file FILE Pathto a file from which to load extra properties. If not

specified, thiswill look for conf/spark-defaults.conf.

--driver-memory MEM Memoryfor driver (e.g. 1000M, 2G) (Default: 1024M).

--driver-java-options ExtraJava options to pass to the driver.

--driver-library-path Extralibrary path entries to pass to the driver.

--driver-class-path Extraclass path entries to pass to the driver. Note that

jars added with --jars areautomatically included in the

classpath.

--executor-memory MEM Memoryper executor (e.g. 1000M, 2G) (Default: 1G).

--proxy-user NAME Userto impersonate when submitting the application.

--help, -h Showthis help message and exit

--verbose, -v Printadditional debug output

--version, Printthe version of current Spark

Spark standalone with cluster deploy modeonly:

--driver-cores NUM Coresfor driver (Default: 1).

Spark standalone or Mesos with cluster deploymode only:

--supervise Ifgiven, restarts the driver on failure.

--kill SUBMISSION_ID Ifgiven, kills the driver specified.

--status SUBMISSION_ID Ifgiven, requests the status of the driver specified.

Spark standalone and Mesos only:

--total-executor-cores NUM Totalcores for all executors.

Spark standalone and YARN only:

--executor-cores NUM Number of cores per executor. (Default: 1 inYARN mode,

or all availablecores on the worker in standalone mode)

YARN-only:

--driver-cores NUM Numberof cores used by the driver, only in cluster mode

(Default: 1).

--queue QUEUE_NAME TheYARN queue to submit to (Default: "default").

--num-executors NUM Numberof executors to launch (Default: 2).

--archives ARCHIVES Commaseparated list of archives to be extracted into the

working directoryof each executor.

--principal PRINCIPAL Principal to be used to login to KDC, while running on

secure HDFS.

--keytab KEYTAB Thefull path to the file that contains the keytab for the

principalspecified above. This keytab will be copied to

the node runningthe Application Master via the Secure

DistributedCache, for renewing the login tickets and the

delegation tokensperiodically.

翻译版：

Options:

  --master MASTER_URL       spark://host:port, mesos://host:port, yarn, orlocal.
  --deploy-mode DEPLOY_MODE driver运行之处，client运行在本机，cluster运行在集群
  --class CLASS_NAME       应用程序包的要运行的class
  --name NAME               应用程序名称
  --jars JARS               用逗号隔开的driver本地jar包列表以及executor类路径
  --py-files PY_FILES       用逗号隔开的放置在Python应用程序PYTHONPATH上的.zip, .egg, .py文件列表
  --files FILES            用逗号隔开的要放置在每个executor工作目录的文件列表
  --properties-file FILE    设置应用程序属性的文件放置位置，默认是conf/spark-defaults.conf
  --driver-memory MEM       driver内存大小，默认512M
  --driver-java-options    driver的java选项
  --driver-library-path    driver的库路径Extra library path entries to pass to the driver
  --driver-class-path       driver的类路径，用--jars 添加的jar包会自动包含在类路径里
  --executor-memory MEM    executor内存大小，默认1G

Spark standalone with cluster deploy mode only:
  --driver-cores NUM       driver使用内核数，默认为1
  --supervise               如果设置了该参数，driver失败是会重启

Sparkstandalone and Mesos only:
--total-executor-cores NUM executor使用的总核数

YARN-only:
--executor-cores NUM 每个executor使用的内核数，默认为1

--queue QUEUE_NAME 提交应用程序给哪个YARN的队列，默认是default队列

--num-executors NUM 启动的executor数量，默认是2个

--archives ARCHIVES 被每个executor提取到工作目录的档案列表，用逗号隔开

关于以上spark-submit的help信息，有几点需要强调一下：
使用类似 --master spark://host:port --deploy-modecluster会将driver提交给cluster，然后就将worker给kill的现象。

如果要使用--properties-file的话，在--properties-file中定义的属性就不必要在spark-sumbit中再定义了，比如在conf/spark-defaults.conf 定义了spark.master，就可以不使用--master了。关于Spark属性的优先权为：SparkConf方式 > 命令行参数方式 >文件配置方式，具体参见Spark1.0.0属性配置。
和之前的版本不同，Spark1.0.0会将自身的jar包和--jars选项中的jar包自动传给集群。
Spark使用下面几种URI来处理文件的传播：
file:// 使用file://和绝对路径，是由driver的HTTP server来提供文件服务，各个executor从driver上拉回文件。
hdfs:, http:, https:, ftp: executor直接从URL拉回文件
local: executor本地本身存在的文件，不需要拉回；也可以是通过NFS网络共享的文件。
如果需要查看配置选项是从哪里来的，可以用打开--verbose选项来生成更详细的运行信息以做参考。

1.1.3.2 实验数据来源

实验用的数据来源于搜狗实验室下载地址为：http://download.labs.sogou.com/resources.html?v=1

互联网语料库(SogouT)
网页搜索结果评价(SogouE)
链接关系库(SogouT-Link)
SogouRank库(SogouT-Rank)
用户查询日志(SogouQ)
互联网词库(SogouW)

1.2 作业

1.2.1 RDD执行transformation和执行action的区别是什么?

1、transformation是得到一个新的RDD，方式很多，比如从数据源生成一个新的RDD，从RDD生成一个新的RDD

2、action是得到一个值，或者一个结果（直接将RDDcache到内存中）。所有的transformation都是采用的懒策略，就是如果只是将transformation提交是不会执行计算的，计算只有在action被提交的时候才被触发。

1.2.2 说明narrow dependency 和 wide dependency的区别?

从计算和容错两方面说明!

在Spark中，每一个 RDD 是对于数据集在某一状态下的表现形式，而这个状态有可能是从前一状态转换而来的，因此换句话说这一个 RDD 有可能与之前的 RDD(s) 有依赖关系。根据依赖关系的不同，可以将 RDD 分成两种不同的类型： Narrow Dependency和 Wide Dependency 。

Narrow Dependency 指的是 child RDD 只依赖于 parent RDD(s) 固定数量的partition。

Wide Dependency 指的是 child RDD 的每一个partition都依赖于 parent RDD(s) 所有partition。

它们之间的区别可参看下图：

根据 RDD 依赖关系的不同，Spark也将每一个job分为不同的stage，而stage之间的依赖关系则形成了DAG。对于 Narrow Dependency ，Spark会尽量多地将 RDD 转换放在同一个stage中；而对于 Wide Dependency ，由于 WideDependency 通常意味着shuffle操作，因此Spark会将此stage定义为 ShuffleMapStage ，以便于向MapOutputTracker 注册shuffle操作。对于stage的划分可参看下图，Spark通常将shuffle操作定义为stage的边界。

1.2.3 RDD cache默认的StorageLevel级别是什么？

1）RDD的cache()方法其实调用的就是persist方法，缓存策略均为MEMORY_ONLY；

2）可以通过persist方法手工设定StorageLevel来满足工程需要的存储级别；

3）cache或者persist并不是action；

Spark目前支持哪几种语言的API？

ScalaAPI/JavaAPI/PythonAPI

1.2.4 练习

下载搜狗实验室用户查询日志精版:http://www.sogou.com/labs/dl/q.html(63M),做以下查询:

用户在00:00:00到12:00:00之间的查询数?

packagecn.chinahadoop.scala

importorg.apache.spark.{SparkContext, SparkConf}

objectSogouA {

def main(args: Array[String]) {

if (args.length == 0) {

System.err.println("Usage: SogouA<file1>")

System.exit(1)

}

val conf = new SparkConf().setAppName("SogouA")

val sc = new SparkContext(conf)

val sgRDD=sc.textFile(args(0))

sgRDD.map(_.split('\t')(0)).filter(x => x >= "00:00:00"&& x <= "12:00:00").saveAsTextFile(args(1))

sc.stop()

}

客户端运行命令：

./spark-submit

--master spark://SparkMaster:7077

--name chinahadoop

--class cn.chinahadoop.scala.SogouA

/home/chinahadoop.jar

hdfs://SparkMaster:9000/data/SogouQ.reduced

hdfs://SparkMaster:9000/data/a

1.2.4.1 搜索结果排名第一,但是点击次序排在第二的数据有多少?

package cn.chinahadoop.scala

import org.apache.spark.{SparkContext,SparkConf}

object SogouB {

defmain(args: Array[String]) {

if (args.length == 0) {

System.err.println("Usage: SogouB <file1>")

System.exit(1)

}

val conf = new SparkConf().setAppName("SogouB")

val sc = new SparkContext(conf)

val sgRDD=sc.textFile(args(0))

println(sgRDD.map(_.split('\t')).filter(_.length ==5).map(_(3).split('')).filter(_(0).toInt ==1).filter(_(1).toInt ==2).count)

sc.stop()

}

客户端运行命令：与上雷同

1.2.4.2 一个session内查询次数最多的用户的session与相应的查询次数?

package cn.chinahadoop.scala

import org.apache.spark.{SparkContext,SparkConf}

import org.apache.spark.SparkContext._

object SogouC {

defmain(args: Array[String]) {

if (args.length == 0) {

System.err.println("Usage: SogouC <file1>")

System.exit(1)

}

val conf = new SparkConf().setAppName("SogouC")

val sc = new SparkContext(conf)

val sgRDD=sc.textFile(args(0))

sgRDD.map(_.split('\t')).filter(_.length==5).map(x=>(x(1),1)).reduceByKey(_+_).map(x=>(x._2,x._1)).sortByKey(false).map(x=>(x._2,x._1)).take(10).foreach(println)

sc.stop()

}

客户端运行命令：与上雷同

图片在拷贝的时候没有贴进来，如果有需要的朋友直接留言．也可以进QQ群 <大数据交流 208881891>

transformation operations动手实战