基本安装
官网
http://spark.apache.org/docs/2.3.0/quick-start.html
解压并创建软连接
tar -xzvf spark-2.3.0-bin-hadoop2.6.tgz
ln -s spark-2.3.0-bin-hadoop2.6 spark
spark shell
[root@cdh1 bin]# ./spark-shell
scala> val line = sc.textFile("/root/app/test")
line: org.apache.spark.rdd.RDD[String] = /root/app/test MapPartitionsRDD[1] at textFile at <console>:24
scala> line.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).collect().foreach(println)
(spark,3)
(hadoop,3)
scala> :quit
#_.split(" "),每行数据用空格切分,_代表每行数据
#(_,1)每个单词统计一次,_代表每个单词
#reduceByKey(_+_),聚合时,同一个单词两两相加
#collect()聚合操作
#foreach 将统计的数据打印到控制台
spark 自带的程序
[root@cdh1 bin]# ./run-example SparkPi
Pi is roughly 3.140955704778524
Java版本的wordcount
pom
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.3.0</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.4.1</version>
<executions>
<!-- Run shade goal on package phase -->
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<transformers>
<!-- add Main-Class to manifest file -->
<transformer
implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass>com.pcitc.sparkbegin</mainClass>
</transformer>
</transformers>
<createDependencyReducedPom>false</createDependencyReducedPom>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
解决Jar下载太慢的问题
maven 安装目录,找到settings.xml,在标签内添加阿里云的仓库地址镜像
<mirror>
<id>nexus</id>
<mirrorOf>*</mirrorOf>
<url>http://maven.aliyun.com/nexus/content/groups/public/</url>
</mirror>
<mirror>
<id>nexus-public-snapshots</id>
<mirrorOf>public-snapshots</mirrorOf>
<url>http://maven.aliyun.com/nexus/content/repositories/snapshots/</url>
</mirror>
解决因pom下载慢强制关闭escplise导致其无法打开
删除D:\esplisegit\.metadata\.plugins\org.eclipse.e4.workbench\workbench.xmi
MyJavaWordCount
public class MyJavaWordCount {
public static void main(String[] args) {
//参数检查
if(args.length<2){
System.err.println("Usage:MyJavaWordCount input output");
System.exit(1);
}
//输入路径
String inputPath = args[0];
//输出路径
String outputPath = args[1];
//创建SparkContext
SparkConf conf = new SparkConf().setAppName("MyJavaWordCount");
JavaSparkContext jsc = new JavaSparkContext(conf);
//读取数据
JavaRDD<String> inputRDD = jsc.textFile(inputPath);
//flatmap扁平化操作,切分每行数据的每个单词,用一个或多个空格分隔
JavaRDD<String> words = inputRDD.flatMap(new FlatMapFunction<String, String>() {
private static final long serialVersionUID = 1L;
public Iterator<String> call(String line) throws Exception {
return Arrays.asList(line.split("\\s+")).iterator();
}
});
//map 操作 ,处理每个work
JavaPairRDD<String, Integer> pairRDD = words.mapToPair(new PairFunction<String, String, Integer>() {
private static final long serialVersionUID = 1L;
public Tuple2<String, Integer> call(String word) throws Exception {
return new Tuple2<String, Integer>(word,1);
}
});
//reduce操作,相同key的值进行迭代累积
JavaPairRDD<String, Integer> result = pairRDD.reduceByKey(new Function2<Integer, Integer, Integer>() {
private static final long serialVersionUID = 1L;
public Integer call(Integer x, Integer y) throws Exception {
return x+y;
}
});
//保存结果
result.saveAsTextFile(outputPath);
//关闭jsc
jsc.close();
}
}
Eclipse 本地运行 Wordcount
Programs arguments :
D:\spark\file\test.txt D:\spark\file\output
VM options:
-Dspark.master=local
项目打包
提交到spark运行
[root@cdh1 bin]# ./spark-submit --master local[2] --class com.pcitc.sparkbegin.MyJavaWordCount /root/app/sparktest/javawordcount.jar /root/app/sparktest/test.txt /root/app/sparktest/out
[root@cdh1 sparktest]# ls
javawordcount.jar out test.txt
[root@cdh1 sparktest]# cd out
[root@cdh1 out]# ls
part-00000 part-00001 _SUCCESS
[root@cdh1 out]# cat ./*
(spark,3)
(hadoop,3)
--local[2]:分配2个线程
Scala版本的wordcount
Scala插件的安装
查看版本
Version: Mars.1 Release (4.5.1)
Build id: 20150924-1200
下载4.5版本对应的插件
http://scala-ide.org/download/prev-stable.html
解压缩,将相应的文件夹放到对应的目录
对应的目录[D:\esplisegit\eclipse\dropins\]下建scala文件夹:
将压缩包下的两个目录拷贝到D:\esplisegit\eclipse\dropins\scala
重启esclipse
安装成功后效果,New
创建scala项目
Add Scala Nature
先创建普通的maven项目,然后选择Configure -> Add Scala Nature
New Scala Object
Scala Object
object test {
def main(args: Array[String]): Unit = {
println("hello,scala")
}
}
Run As Scala Applic本地运行
Linux运行
[root@cdh1 bin]# ./scala -version
Scala code runner version 2.11.8
[root@cdh1 bin]# ./scala -cp /root/app/sparktest/sparkscalatest-0.0.1-SNAPSHOT.jar com.pcitc.scala.sparkscalatest.test
hello,scala
MyScalaWordCout
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
object MyScalaWordCout {
def main(args: Array[String]): Unit = {
if(args.length<2){
System.err.println("Usage:MyScalaWordCout <input> <output>")
System.exit(1)
}
//输入路径
val input = args(0)
//输出路径
val output = args(1)
val conf = new SparkConf().setAppName("MyScalaWordCout")
val sc = new SparkContext(conf)
//读取数据
val lines = sc.textFile(input)
val resultRDD = lines.flatMap(_.split("\\s+")).map((_,1)).reduceByKey(_+_)
//保存结果
resultRDD.saveAsTextFile(output)
sc.stop()
}
}
Eclipse 本地运行 Wordcount
Programs arguments :
D:\spark\file\test.txt D:\spark\file\output2
VM options:
-Dspark.master=local
提交到spark运行
./spark-submit --master local[2] --class com.pcitc.scala.sparkscalatest.MyScalaWordCout /root/app/sparktest/sparkscalatest-0.0.1-SNAPSHOT.jar /root/app/sparktest/test.txt /root/app/sparktest/out2