我们需要先用Maven创建一个scala的工程,具体步骤可以参考:https://docs.scala-lang.org/tutorials/scala-with-maven.html
然后用IntelliJ IDEA打开这个Maven Project。其中在根目录下游一个pom.xml文件,针对我们项目的需求需要加上相应的dependency包。比较我们要建一个spark streaming的project,所以我们必须要加spark相应的包。其中需要注意的是scope的功能。根据我的实验,如果这个参数值为provided后,最后project package完的with dependency的jar里面就不会包含有所依赖的jar。
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.scalatest</groupId>
<artifactId>scalatest_2.11</artifactId>
<version>2.2.4</version>
<scope>test</scope>
</dependency>
</dependencies>
而repository则表示我们所依赖的包的查询地方,它可以是本地已经下载好的一个folder,也可以是网络上指定的服务器位置。对于服务器repository,有的可能涉及到权限问题,就需要到userprofile目录下的.m2文件夹的settings.xml中配置。
<repositories>
<repository>
<id>my-local-repo</id>
<url>file://${basedir}/repo</url>
</repository>
</repositories>
<repositories>
<repository>
<id>apache.snapshots</id>
<name>Apache Development Snapshot Repository</name>
<url>https://repository.apache.org/content/repositories/snapshots/</url>
<releases>
<enabled>false</enabled>
</releases>
<snapshots>
<enabled>true</enabled>
</snapshots>
</repository>
<repository>
<id>xxx</id>
<url>https://xxx.visualstudio.com/_packaging/xx/maven/v1</url>
<releases>
<enabled>true</enabled>
</releases>
<snapshots>
<enabled>true</enabled>
</snapshots>
</repository>
</repositories>
当project准备好了后,我们就可以在scala文件夹里面添加文件了。这里以常用的NetworkWordCount为例写了一个简单的程序。考虑到databricks不支持new SparkContext,所以我们需要替换代码为SparkContext.getOrCreate。下面是一个验证了可以在azure databricks上跑的程序。
package org.twenz
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.dstream.ConstantInputDStream
object NetworkWordCount extends App {
val checkpointDirectory = "/usr/twenz/test"
val sentences = Array (
"This is a probe for test spark streaming",
"The cow jumped over the moon",
"An apple a day keeps the doctor away",
"Four score and seven years ago",
"Snow white and the seven dwarfs",
"I am at two with nature");
// Function to create and setup a new StreamingContext
def functionToCreateContext(): StreamingContext = {
val sparkConf = new SparkConf(true).setAppName("NetworkWordCount").set("spark.streaming.unpersist","true")
val ssc = new StreamingContext(SparkContext.getOrCreate(sparkConf), Seconds(30)) // new context
ssc.checkpoint(checkpointDirectory) // set checkpoint directory
ssc
}
val ssc = StreamingContext.getOrCreate(checkpointDirectory, functionToCreateContext _)
val rdd = ssc.sparkContext.parallelize(sentences)
val lines = new ConstantInputDStream(ssc, rdd) // create DStreams
val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
// Print the first ten elements of each RDD generated in this DStream to the console
wordCounts.print()
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to terminate
ssc.stop()
}
在IJ IDEA中,View->Tool Windows->Maven Projects打开右边的Maven Projects窗口,在lifecycle中package打包,这将会在target中产生两个jar文件,其中一个有dependency一个没有。接下来我们就可以在databricks网站上建立一个job,然后把有dependency的jar拖上去,并且在Main class中填上上面入口org.twenz.NetworkWordCount。Job开始运行后就可以看Log,查看streaming job的输出。