Example Self Contained Spark Application

For the past few weeks we’ve showed some simple examples on how to use Hive and Impala with different file formats along with partitioning. These approaches have exercised two very popular interfaces into Hadoop – namely Map/Reduce with Hive and Impala.

We’re now shifting gears to introduce a new up and comer – namely Spark. The goal of this post is to get you up and running with a very simple self contained Spark application. (“Self Contained” is an important concept in this posting as instead of simply pulling open a scala shell, we are showing you how to use a IDE locally and produce an artifact you can publish to a Spark cluster.)

Before you get started, make sure to go through the first two steps of loading some CSV data into Hadoop. Just the unzip, load, add to HDFS step and the load some initial data step will do. You technically don’t even need to do the second step, but the example code references the data from Hive so we just used the same source.

Oh, and while you could do this in Java or Python, we picked Scala as that is something a bit new to many folks so good to get some run time using a new language – if only to play.

Step 1) Install a Scala IDE. There are a bunch of them, but here is a very common one.http://scala-ide.org/

Step 2) Install a Scala build tool. Here is a good one. http://www.scala-sbt.org/

Step 3) Open the IDE and create a Scale Project and create a /src/main/scala folder structure. (See here for the recommended directory structure from Spark themselves.) Make the “scala” folder a Source Folder. Then add a Scala Class file called Temp.scala and along with one called TempsUtil.scala. Copy and paste these contents into them. Change the “<namenode>” value with the one that fits your environment. Your Hadoop administrator should know this.

Here is the Temps.Util.scala file


import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.sql.SQLContext

object TempsUtil extends App {
    override def main(args: Array[String]) {
      val tempFile = "hdfs://:8020/user/hive/warehouse/temps_txt/hourly_TEMP_2014.csv"
      val conf = new SparkConf().setAppName("Get Average Temperatures")
      val sc = new SparkContext(conf)
      val sqlContext = new SQLContext(sc)
      import sqlContext.createSchemaRDD
      val temps = sc.textFile(tempFile, 2).cache()
      val tempsrdd = temps.map (_.split(",")).map(t => Temp(
          t(0),
          t(1),
          t(2),
          t(3),
          t(4),
          t(5),
          t(6),
          t(7),
          t(8),
          t(9),
          t(10),
          t(11),
          t(12),
          t(13).toDouble,
          t(14),
          t(15),
          t(16),
          t(17),
          t(18),
          t(19),
          t(20),
          t(21)
      ))
      
      tempsrdd.registerTempTable("temps_txt")
      val avgTemps = sqlContext.sql("select avg(degrees) from temps_txt")
      avgTemps.map(t => "Average Temp is " + t(0)).collect().foreach(println)
  }
}

Here is the Temp.scala file

case class Temp(
  statecode:String,
  countrycode: String,
  sitenum: String,
  paramcode: String,
  poc: String,
  latitude: String,
  longitude: String,
  datum: String,
  param: String,
  datelocal: String,
  timelocal: String,
  dategmt: String,
  timegmt: String,
  degrees: Double,
  uom: String,
  mdl: String,
  uncert: String,
  qual: String,
  method: String,
  methodname: String,
  state: String,
  county: String
)

Step 4) In the IDE, create a file called temp.sbt with the following contents. This file is used by the Scala build tool you installed in step 2. Think of it like a Maven POM file. Just much simpler.

name := "Temperature Application"
version := "1.0"
scalaVersion := "2.10.4"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.2.1"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.2.1"

Step 5) Convert the project to a Maven project (Right Click | Configure | Convert to Maven Project) and add the following two dependencies. If you’re not familiar with Maven, simply right click on the project and select Maven | Add Dependencies. Then fill out the Group ID, Artifact ID and Version information that you get from these two links.

http://mvnrepository.com/artifact/org.apache.spark/spark-sql_2.10/1.2.1

http://mvnrepository.com/artifact/org.apache.spark/spark-core_2.10/1.2.1

Step 6) Now go out to the command line and get to the root directory of your project. E.g. C:\Users\Admin\workspace_scala\MyScalaProject. And then issue “sbt.bat package”. You might have to fully qualify the “sbt.bat” file. Under the “target” (might be one more deep than that) you should find your jar file. (Don’t laugh that we’re using Windows for this example. We’re doing that on purpose. Or at least that is the story we’re sticking with…)

Step 7) Move this jar file out to your Spark cluster and issue the following command to kick off the job.

spark-submit --master yarn --class TempsUtil temperature-application_2.10-1.0.jar

If you run into any issues, feel free to add a comment and we’ll try to resolve them for you. The goal is to make this as simple as possible so you can quickly get a perspective into what writing a standalone Spark job entails.

  

或者

spark-submit --master yarn-client --class TempsUtil temperature-application_2.10-1.0.jar

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
目标检测(Object Detection)是计算机视觉领域的一个核心问题,其主要任务是找出图像中所有感兴趣的目标(物体),并确定它们的类别和位置。以下是对目标检测的详细阐述: 一、基本概念 目标检测的任务是解决“在哪里?是什么?”的问题,即定位出图像中目标的位置并识别出目标的类别。由于各类物体具有不同的外观、形状和姿态,加上成像时光照、遮挡等因素的干扰,目标检测一直是计算机视觉领域最具挑战性的任务之一。 二、核心问题 目标检测涉及以下几个核心问题: 分类问题:判断图像中的目标属于哪个类别。 定位问题:确定目标在图像中的具体位置。 大小问题:目标可能具有不同的大小。 形状问题:目标可能具有不同的形状。 三、算法分类 基于深度学习的目标检测算法主要分为两大类: Two-stage算法:先进行区域生成(Region Proposal),生成有可能包含待检物体的预选框(Region Proposal),再通过卷积神经网络进行样本分类。常见的Two-stage算法包括R-CNN、Fast R-CNN、Faster R-CNN等。 One-stage算法:不用生成区域提议,直接在网络中提取特征来预测物体分类和位置。常见的One-stage算法包括YOLO系列(YOLOv1、YOLOv2、YOLOv3、YOLOv4、YOLOv5等)、SSD和RetinaNet等。 四、算法原理 以YOLO系列为例,YOLO将目标检测视为回归问题,将输入图像一次性划分为多个区域,直接在输出层预测边界框和类别概率。YOLO采用卷积网络来提取特征,使用全连接层来得到预测值。其网络结构通常包含多个卷积层和全连接层,通过卷积层提取图像特征,通过全连接层输出预测结果。 五、应用领域 目标检测技术已经广泛应用于各个领域,为人们的生活带来了极大的便利。以下是一些主要的应用领域: 安全监控:在商场、银行
目标检测(Object Detection)是计算机视觉领域的一个核心问题,其主要任务是找出图像中所有感兴趣的目标(物体),并确定它们的类别和位置。以下是对目标检测的详细阐述: 一、基本概念 目标检测的任务是解决“在哪里?是什么?”的问题,即定位出图像中目标的位置并识别出目标的类别。由于各类物体具有不同的外观、形状和姿态,加上成像时光照、遮挡等因素的干扰,目标检测一直是计算机视觉领域最具挑战性的任务之一。 二、核心问题 目标检测涉及以下几个核心问题: 分类问题:判断图像中的目标属于哪个类别。 定位问题:确定目标在图像中的具体位置。 大小问题:目标可能具有不同的大小。 形状问题:目标可能具有不同的形状。 三、算法分类 基于深度学习的目标检测算法主要分为两大类: Two-stage算法:先进行区域生成(Region Proposal),生成有可能包含待检物体的预选框(Region Proposal),再通过卷积神经网络进行样本分类。常见的Two-stage算法包括R-CNN、Fast R-CNN、Faster R-CNN等。 One-stage算法:不用生成区域提议,直接在网络中提取特征来预测物体分类和位置。常见的One-stage算法包括YOLO系列(YOLOv1、YOLOv2、YOLOv3、YOLOv4、YOLOv5等)、SSD和RetinaNet等。 四、算法原理 以YOLO系列为例,YOLO将目标检测视为回归问题,将输入图像一次性划分为多个区域,直接在输出层预测边界框和类别概率。YOLO采用卷积网络来提取特征,使用全连接层来得到预测值。其网络结构通常包含多个卷积层和全连接层,通过卷积层提取图像特征,通过全连接层输出预测结果。 五、应用领域 目标检测技术已经广泛应用于各个领域,为人们的生活带来了极大的便利。以下是一些主要的应用领域: 安全监控:在商场、银行
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值