Azure中Databricks上运行spark streaming job

最新推荐文章于 2023-10-30 19:01:28 发布

tzw_cs

最新推荐文章于 2023-10-30 19:01:28 发布

阅读量860

点赞数

分类专栏： Hadoop 文章标签： azure databricks spark streaming maven

本文链接：https://blog.csdn.net/tanzhangwen/article/details/80703023

版权

Hadoop 专栏收录该内容

10 篇文章 0 订阅

订阅专栏

我们需要先用Maven创建一个scala的工程，具体步骤可以参考：https://docs.scala-lang.org/tutorials/scala-with-maven.html

然后用IntelliJ IDEA打开这个Maven Project。其中在根目录下游一个pom.xml文件，针对我们项目的需求需要加上相应的dependency包。比较我们要建一个spark streaming的project，所以我们必须要加spark相应的包。其中需要注意的是scope的功能。根据我的实验，如果这个参数值为provided后，最后project package完的with dependency的jar里面就不会包含有所依赖的jar。

    <dependencies>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>${spark.version}</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming_2.11</artifactId>
            <version>${spark.version}</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>org.scalatest</groupId>
            <artifactId>scalatest_2.11</artifactId>
            <version>2.2.4</version>
            <scope>test</scope>
        </dependency>
    </dependencies>

而repository则表示我们所依赖的包的查询地方，它可以是本地已经下载好的一个folder，也可以是网络上指定的服务器位置。对于服务器repository，有的可能涉及到权限问题，就需要到userprofile目录下的.m2文件夹的settings.xml中配置。

    <repositories>
        <repository>
            <id>my-local-repo</id>
            <url>file://${basedir}/repo</url>
        </repository>
    </repositories>

  <repositories>
    <repository>
      <id>apache.snapshots</id>
      <name>Apache Development Snapshot Repository</name>
      <url>https://repository.apache.org/content/repositories/snapshots/</url>
      <releases>
        <enabled>false</enabled>
      </releases>
      <snapshots>
        <enabled>true</enabled>
      </snapshots>
    </repository>
    <repository>
      <id>xxx</id>
      <url>https://xxx.visualstudio.com/_packaging/xx/maven/v1</url>
      <releases>
        <enabled>true</enabled>
      </releases>
      <snapshots>
        <enabled>true</enabled>
      </snapshots>
    </repository>
  </repositories>

当project准备好了后，我们就可以在scala文件夹里面添加文件了。这里以常用的NetworkWordCount为例写了一个简单的程序。考虑到databricks不支持new SparkContext，所以我们需要替换代码为SparkContext.getOrCreate。下面是一个验证了可以在azure databricks上跑的程序。

package org.twenz

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.dstream.ConstantInputDStream

object NetworkWordCount extends App {
  val checkpointDirectory = "/usr/twenz/test"
  val sentences = Array (
    "This is a probe for test spark streaming",
    "The cow jumped over the moon",
    "An apple a day keeps the doctor away",
    "Four score and seven years ago",
    "Snow white and the seven dwarfs",
    "I am at two with nature");
  // Function to create and setup a new StreamingContext
  def functionToCreateContext(): StreamingContext = {
    val sparkConf = new SparkConf(true).setAppName("NetworkWordCount").set("spark.streaming.unpersist","true")
    val ssc = new StreamingContext(SparkContext.getOrCreate(sparkConf), Seconds(30))   // new context
    ssc.checkpoint(checkpointDirectory)   // set checkpoint directory
    ssc
  }

  val ssc = StreamingContext.getOrCreate(checkpointDirectory, functionToCreateContext _)
  val rdd = ssc.sparkContext.parallelize(sentences)
  val lines = new ConstantInputDStream(ssc, rdd) // create DStreams
  val words = lines.flatMap(_.split(" "))
  val pairs = words.map(word => (word, 1))
  val wordCounts = pairs.reduceByKey(_ + _)
  // Print the first ten elements of each RDD generated in this DStream to the console
  wordCounts.print()

  ssc.start() // Start the computation
  ssc.awaitTermination() // Wait for the computation to terminate
  ssc.stop()
}

在IJ IDEA中，View->Tool Windows->Maven Projects打开右边的Maven Projects窗口，在lifecycle中package打包，这将会在target中产生两个jar文件，其中一个有dependency一个没有。接下来我们就可以在databricks网站上建立一个job，然后把有dependency的jar拖上去，并且在Main class中填上上面入口org.twenz.NetworkWordCount。Job开始运行后就可以看Log，查看streaming job的输出。

tzw_cs

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
Azure中Databricks上运行spark streaming job

我们需要先用Maven创建一个scala的工程，具体步骤可以参考然后用IntelliJ IDEA打开这个Maven Project。其中在根目录下游一个pom.xml文件，针对我们项目的需求需要加上相应的dependency包。比较我们要建一个spark streaming的project，所以我们必须要加spark相应的包。其中需要注意的是scope的功能。根据我的实验，如果这个参数值为prov...
复制链接

扫一扫