spark快速入门

基本安装

官网

http://spark.apache.org/docs/2.3.0/quick-start.html

解压并创建软连接

tar -xzvf spark-2.3.0-bin-hadoop2.6.tgz 

ln -s spark-2.3.0-bin-hadoop2.6 spark

spark shell

[root@cdh1 bin]# ./spark-shell 

 


scala> val line = sc.textFile("/root/app/test")

line: org.apache.spark.rdd.RDD[String] = /root/app/test MapPartitionsRDD[1] at textFile at <console>:24



scala> line.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).collect().foreach(println)

(spark,3)                                                                       

(hadoop,3)



scala> :quit



#_.split(" "),每行数据用空格切分,_代表每行数据

#(_,1)每个单词统计一次,_代表每个单词

#reduceByKey(_+_),聚合时,同一个单词两两相加

#collect()聚合操作

#foreach 将统计的数据打印到控制台


spark 自带的程序 

[root@cdh1 bin]# ./run-example SparkPi

Pi is roughly 3.140955704778524

 Java版本的wordcount 

pom

  <dependency>
	    <groupId>org.apache.spark</groupId>
	    <artifactId>spark-core_2.11</artifactId>
	    <version>2.3.0</version>
	</dependency>    
  </dependencies>


   <build>
		<plugins>
			<plugin>
				<groupId>org.apache.maven.plugins</groupId>
				<artifactId>maven-shade-plugin</artifactId>
				<version>2.4.1</version>
				<executions>
					<!-- Run shade goal on package phase -->
					<execution>
						<phase>package</phase>
						<goals>
							<goal>shade</goal>
						</goals>
						<configuration>
							<transformers>
								<!-- add Main-Class to manifest file -->
								<transformer
									implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
									<mainClass>com.pcitc.sparkbegin</mainClass>
								</transformer>
							</transformers>
							<createDependencyReducedPom>false</createDependencyReducedPom>
						</configuration>
					</execution>
				</executions>
			</plugin>
		</plugins>
	</build> 


解决Jar下载太慢的问题 


maven 安装目录,找到settings.xml,在标签内添加阿里云的仓库地址镜像

<mirror>

      <id>nexus</id>

      <mirrorOf>*</mirrorOf> 

      <url>http://maven.aliyun.com/nexus/content/groups/public/</url>

    </mirror>

    <mirror>

      <id>nexus-public-snapshots</id>

      <mirrorOf>public-snapshots</mirrorOf>

      <url>http://maven.aliyun.com/nexus/content/repositories/snapshots/</url>

</mirror>


解决因pom下载慢强制关闭escplise导致其无法打开 

删除D:\esplisegit\.metadata\.plugins\org.eclipse.e4.workbench\workbench.xmi

MyJavaWordCount

public class MyJavaWordCount {
	public static void main(String[] args) {
		//参数检查
		if(args.length<2){
			System.err.println("Usage:MyJavaWordCount input output");
			System.exit(1);
		}		
		//输入路径
		String inputPath = args[0];
		//输出路径
		String outputPath = args[1];		
		//创建SparkContext
		SparkConf conf = new SparkConf().setAppName("MyJavaWordCount");
		JavaSparkContext jsc = new JavaSparkContext(conf);		
		//读取数据
		JavaRDD<String> inputRDD = jsc.textFile(inputPath);		
		//flatmap扁平化操作,切分每行数据的每个单词,用一个或多个空格分隔
		JavaRDD<String> words = inputRDD.flatMap(new FlatMapFunction<String, String>() {
			private static final long serialVersionUID = 1L;
			public Iterator<String> call(String line) throws Exception {
				return Arrays.asList(line.split("\\s+")).iterator();
			}
		});
		//map 操作 ,处理每个work
		JavaPairRDD<String, Integer> pairRDD = words.mapToPair(new PairFunction<String, String, Integer>() {
			private static final long serialVersionUID = 1L;
			public Tuple2<String, Integer> call(String word) throws Exception {
				return new Tuple2<String, Integer>(word,1);
			}
		});
		//reduce操作,相同key的值进行迭代累积
		JavaPairRDD<String, Integer> result = pairRDD.reduceByKey(new Function2<Integer, Integer, Integer>() {
			private static final long serialVersionUID = 1L;
			public Integer call(Integer x, Integer y) throws Exception {
				return x+y;
			}
		});
		//保存结果
		result.saveAsTextFile(outputPath);
		//关闭jsc
		jsc.close();
	}
}


Eclipse 本地运行 Wordcount 

Programs arguments :
  D:\spark\file\test.txt D:\spark\file\output
VM options:
  -Dspark.master=local

 

 

 

 

项目打包

 

提交到spark运行


[root@cdh1 bin]# ./spark-submit --master local[2] --class com.pcitc.sparkbegin.MyJavaWordCount /root/app/sparktest/javawordcount.jar /root/app/sparktest/test.txt /root/app/sparktest/out



[root@cdh1 sparktest]# ls

javawordcount.jar  out  test.txt

[root@cdh1 sparktest]# cd out

[root@cdh1 out]# ls

part-00000  part-00001  _SUCCESS

[root@cdh1 out]# cat ./*

(spark,3)

(hadoop,3)



--local[2]:分配2个线程

 


Scala版本的wordcount

Scala插件的安装

查看版本

Version: Mars.1 Release (4.5.1)

Build id: 20150924-1200

下载4.5版本对应的插件

http://scala-ide.org/download/prev-stable.html

 

解压缩,将相应的文件夹放到对应的目录

对应的目录[D:\esplisegit\eclipse\dropins\]下建scala文件夹:

 

将压缩包下的两个目录拷贝到D:\esplisegit\eclipse\dropins\scala

 

重启esclipse

 

 

安装成功后效果,New

 

 

创建scala项目

Add Scala Nature

先创建普通的maven项目,然后选择Configure -> Add Scala Nature

 

 

New Scala Object

 

Scala Object

object test {
  def main(args: Array[String]): Unit = {
    println("hello,scala")
  }
}


Run As Scala Applic本地运行

 

 

Linux运行

[root@cdh1 bin]# ./scala -version
Scala code runner version 2.11.8  
[root@cdh1 bin]# ./scala -cp /root/app/sparktest/sparkscalatest-0.0.1-SNAPSHOT.jar com.pcitc.scala.sparkscalatest.test
hello,scala


MyScalaWordCout


import org.apache.spark.SparkContext
import org.apache.spark.SparkConf

object MyScalaWordCout {
  def main(args: Array[String]): Unit = {
    if(args.length<2){
      System.err.println("Usage:MyScalaWordCout <input> <output>")
      System.exit(1)
    }
    
    //输入路径
    val input = args(0)
    //输出路径
    val output = args(1)
    
    val conf = new SparkConf().setAppName("MyScalaWordCout")
    
    val sc = new SparkContext(conf)
    
    //读取数据
    val lines = sc.textFile(input)
    
    val resultRDD = lines.flatMap(_.split("\\s+")).map((_,1)).reduceByKey(_+_)
    
    
    
    //保存结果
    resultRDD.saveAsTextFile(output)
    
    sc.stop()
    
  }
}

Eclipse 本地运行 Wordcount

Programs arguments :
  D:\spark\file\test.txt D:\spark\file\output2
VM options:
  -Dspark.master=local

提交到spark运行

./spark-submit --master local[2] --class com.pcitc.scala.sparkscalatest.MyScalaWordCout /root/app/sparktest/sparkscalatest-0.0.1-SNAPSHOT.jar /root/app/sparktest/test.txt /root/app/sparktest/out2


 


 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值