在IDEA中编写Spark的WordCount程序
1:spark shell仅在测试和验证我们的程序时使用的较多,在生产环境中,通常会在IDE中编制程序,然后打成jar包,然后提交到集群,最常用的是创建一个Maven项目,利用Maven来管理jar包的依赖。
2:配置Maven的pom.xml:
<?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.bie</groupId> <artifactId>sparkWordCount</artifactId> <version>1.0-SNAPSHOT</version> <properties> <maven.compiler.source>1.7</maven.compiler.source> <maven.compiler.target>1.7</maven.compiler.target> <encoding>UTF-8</encoding> <scala.version>2.10.6</scala.version> <scala.compat.version>2.10</scala.compat.version> </properties> <dependencies> <dependency> <groupId>org.scala-lang</groupId> <artifactId>scala-library</artifactId> <version>${scala.version}</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.10</artifactId> <version>1.5.2</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-streaming_2.10</artifactId> <version>1.5.2</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>2.6.2</version> </dependency> </dependencies> <build> <sourceDirectory>src/main/scala</sourceDirectory> <testSourceDirectory>src/test/scala</testSourceDirectory> <plugins> <plugin> <groupId>net.alchim31.maven</groupId> <artifactId>scala-maven-plugin</artifactId> <version>3.2.0</version> <executions> <execution> <goals> <goal>compile</goal> <goal>testCompile</goal> </goals> <configuration> <args> <arg>-make:transitive</arg> <arg>-dependencyfile</arg> <arg>${project.build.directory}/.scala_dependencies</arg> </args> </configuration> </execution> </executions> </plugin> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-surefire-plugin</artifactId> <version>2.18.1</version> <configuration> <useFile>false</useFile> <disableXmlReport>true</disableXmlReport> <includes> <include>**/*Test.*</include> <include>**/*Suite.*</include> </includes> </configuration> </plugin> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-shade-plugin</artifactId> <version>2.3</version> <executions> <execution> <phase>package</phase> <goals> <goal>shade</goal> </goals> <configuration> <filters> <filter> <artifact>*:*</artifact> <excludes> <exclude>META-INF/*.SF</exclude> <exclude>META-INF/*.DSA</exclude> <exclude>META-INF/*.RSA</exclude> </excludes> </filter> </filters> <transformers> <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer"> <mainClass>com.bie.WordCount</mainClass> </transformer> </transformers> </configuration> </execution> </executions> </plugin> </plugins> </build> </project>
注意:配置好pom.xml以后,点击Enable Auto-Import即可;
3:将src/main/java和src/test/java分别修改成src/main/scala和src/test/scala,与pom.xml中的配置保持一致();
4:新建一个scala class,类型为Object,然后编写spark程序,如下所示:
import org.apache.spark.{SparkConf, SparkContext} object WordCount { def main(args: Array[String]): Unit = { //创建SparkConf()并且设置App的名称 val conf = new SparkConf().setAppName("wordCount"); //创建SparkContext,该对象是提交spark app的入口 val sc = new SparkContext(conf); //使用sc创建rdd,并且执行相应的transformation和action sc.textFile(args(0)).flatMap(_.split(" ")).map((_ ,1)).reduceByKey(_ + _,1).sortBy(_._2,false).saveAsTextFile(args(1)); //停止sc,结束该任务 sc.stop(); } }
5:使用Maven打包:首先修改pom.xml中的mainClass,使其和自己的类路径对应起来:
然后,点击idea右侧的Maven Project选项,点击Lifecycle,选择clean和package,然后点击Run Maven Build:
等待编译完成,选择编译成功的jar包,并将该jar上传到Spark集群中的某个节点上:
记得,启动你的hdfs和Spark集群,然后使用spark-submit命令提交Spark应用(注意参数的顺序):
可以看下简单的几行代码,但是打成的包就将近百兆,都是封装好的啊,感觉牛人太多了。
然后开始进行Spark Submit提交操作,命令如下所示:
[root@master spark-1.6.1-bin-hadoop2.6]# bin/spark-submit \ > --class com.bie.WordCount \ > --master spark://master:7077 \ > --executor-memory 512M \ > --total-executor-cores 2 \ > /home/hadoop/data_hadoop/sparkWordCount-1.0-SNAPSHOT.jar \ > hdfs://master:9000/wordcount.txt \ > hdfs://master:9000/output
或者如下:
bin/spark-submit --class com.bie.WordCount --master spark://master:7077 --executor-memory 512M --total-executor-cores 2 /home/hadoop/data_hadoop/sparkWordCount-1.0-SNAPSHOT.jar hdfs://master:9000/wordcount.txt hdfs://master:9000/outpu
操作如下所示:
可以在图形化页面看到多了一个Application:
然后呢,就出错了,学知识,不出点错,感觉都不正常:
1 org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]. This timeout is controlled by spark.rpc.askTimeout 2 at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) 3 at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) 4 at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) 5 at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) 6 at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76) 7 at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101) 8 at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77) 9 at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.removeExecutor(CoarseGrainedSchedulerBackend.scala:359) 10 at org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend.executorRemoved(SparkDeploySchedulerBackend.scala:144) 11 at org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anonfun$receive$1.applyOrElse(AppClient.scala:186) 12 at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116) 13 at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204) 14 at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) 15 at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215) 16 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 17 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 18 at java.lang.Thread.run(Thread.java:745) 19 Caused by: java.util.concurrent.TimeoutException: Futures timed out after [120 seconds] 20 at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) 21 at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) 22 at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) 23 at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) 24 at scala.concurrent.Await$.result(package.scala:107) 25 at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) 26 ... 12 more 27 18/02/23 01:28:46 WARN NettyRpcEndpointRef: Error sending message [message = UpdateBlockInfo(BlockManagerId(driver, 192.168.3.129, 60565),broadcast_1_piece0,StorageLevel(false, true, false, false, 1),2358,0,0)] in 1 attempts 28 org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply in 120 seconds. This timeout is controlled by spark.rpc.askTimeout 29 at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) 30 at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) 31 at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) 32 at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) 33 at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:185) 34 at scala.util.Try$.apply(Try.scala:161) 35 at scala.util.Failure.recover(Try.scala:185) 36 at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324) 37 at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324) 38 at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) 39 at org.spark-project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293) 40 at scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:133) 41 at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40) 42 at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248) 43 at scala.concurrent.Promise$class</