[ Hadoop | Spark | Scala ] 搭建 Scoobi 开发环境

Scoobi: An open source Scala library for Hadoop MapReduce. It combines the simplicity of functional programming with the strength of distributed data processing powered by Hadoop. It can dramatically increase the productivity of application development on Hadoop by reducing the Hadoop MapReduce development efforts.


In this article, we will introduce how to setup the Scoobi development environment on Unix/Linux by creating a classic Hadoop WordCount project.


Basically, Scoobi project is an sbt project, which follows the specification of sbt project. To learn more about sbt, please reference to another article introduction to sbt.


Step by Step Guide

1. Prerequisites

  • Unix or Linux, the operating system
  • Hadoop cluster, to run the Scoobi program

2. Install sbt


sbt (Simple Build Tool, for building scala code, but it seems to me not that simple....)


Step 1: Download the jar

[jinmfeng@localhost Downloads]$wget http://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/sbt-launch/0.13.2/sbt-launch.jar -O sbt-launch.jar

--2014-04-25 11:15:48--  http://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/sbt-launch/0.13.2/sbt-launch.jar
Resolving repo.typesafe.com... 54.236.87.147, 107.21.30.253, 54.236.91.228
Connecting to repo.typesafe.com|54.236.87.147|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1183053 (1.1M) [application/java-archive]
Saving to: “sbt-launch.jar”


100%[===================================================================================================================>] 1,183,053    373K/s   in 3.1s    


2014-04-25 11:15:54 (373 KB/s) - “sbt-launch.jar” saved [1183053/1183053]


Step 2: Copy the jar to a bin directory

[jinmfeng@localhost ~]$cp ~/Downloads/sbt-launch.jar ~/bin


Step 3: Create file sbt in the same directory and change it to be executable:

[jinmfeng@localhost bin]$vi sbt

SBT_OPTS="-Xms512M -Xmx1536M -Xss1M -XX:+CMSClassUnloadingEnabled -XX:MaxPermSize=256M"
java $SBT_OPTS -jar `dirname $0`/sbt-launch.jar "$@"
[jinmfeng@localhost bin]$chmod u+x sbt


Step 4: Check if the ~/bin in our PATH, add below to your ~/.bash_profile

export PATH=~/bin:$PATH


Step 5: Check if the configuration is OK

[jinmfeng@localhost ~]$sbt sbt-version

Getting org.scala-sbt sbt 0.13.2 ...
downloading http://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/sbt/0.13.2/jars/sbt.jar ...
[SUCCESSFUL ] org.scala-sbt#sbt;0.13.2!sbt.jar (3863ms)

..................

..................

[info] Set current project to jinmfeng (in build file:/home/jinmfeng/)
[info] 0.13.2

Now you have two directories ~/.sbt and ~/.ivy2, .sbt is for global sbt configuration such as sbt plugins; .ivy2 just like Maven's .m2


3. Create our Scala project Word Count

Project Structure:

wordCount/

  build.sbt

  src/

    main/

      scala/             --Scala code should put in this derectory

        WordCount.scala

    test/

      scala/

        WordCountTest.scala

  project/               --Project level configurations

    build.properties

    plugins/

      plugins.sbt        --The project level plugin configuration


Step 1: create the project directories as below:

[jinmfeng@localhost ~]$cd wordCount
[jinmfeng@localhost wordCount]$mkdir -p src/main/scala
[jinmfeng@localhost wordCount]$vi src/main/scala/WordCount.scala

import com.nicta.scoobi.Scoobi._
import Reduction._


object WordCount extends ScoobiApp {
  def run() {
    val lines: DList[String] = fromTextFile(args(0))


    val counts: DList[(String, Int)] = lines.mapFlatten(_.split(" "))
                                            .map(word => (word, 1))
                                            .groupByKey
                                            .combine(Sum.int)

    persist(counts.toTextFile(args(1)))
  }
}

Step 2: Create the sbt configuration for the project

[jinmfeng@localhost wordCount]$mkdir project
[jinmfeng@localhost wordCount]$vi project/plugins.sbt

addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.10.2")

Notes: This is the project level plugins configuration file, you can also create the file in.sbt/0.13/plugins/plugins.sbt so the plugin can be used by all the project.

[jinmfeng@localhost wordCount]$vi build.sbt

import AssemblyKeys._

assemblySettings

name := "ScoobiWordCount"

version := "1.0"

scalaVersion := "2.10.3"

libraryDependencies ++= Seq(
    "com.nicta" %% "scoobi" % "0.8.3" intransitive(),
    "org.apache.hadoop" % "hadoop-common" % "2.2.0" % "provided",
    "org.apache.hadoop" % "hadoop-hdfs" % "2.2.0" % "provided",
    "org.apache.hadoop" % "hadoop-mapreduce-client-app" % "2.2.0" % "provided",
    "org.apache.hadoop" % "hadoop-mapreduce-client-core" % "2.2.0" % "provided",
    "org.apache.hadoop" % "hadoop-mapreduce-client-jobclient" % "2.2.0" % "provided",
    "org.apache.hadoop" % "hadoop-mapreduce-client-core" % "2.2.0" % "provided",
    "org.apache.hadoop" % "hadoop-annotations" % "2.2.0" % "provided",
    "org.apache.avro" % "avro-mapred" % "1.7.4" classifier "hadoop2",
    "org.scala-lang" % "scala-compiler" % "2.10.3",
    "org.scalaz" %% "scalaz-core" % "7.0.2",
    "com.thoughtworks.xstream" % "xstream" % "1.4.4" intransitive(),
    "javassist" % "javassist" % "3.12.1.GA",
    "com.googlecode.kiama" %% "kiama" % "1.5.2",
    "com.chuusai" % "shapeless_2.10.2" % "2.0.0-M1")

resolvers ++= Seq(Resolver.sonatypeRepo("releases"), Resolver.sonatypeRepo("snaspshots"))

Notes: This file is just like Maven's pom.xml, which is used to define the project. 

please note you have to keep the blank line between each statement, or you will get the compile error;

libraryDependencies is to define the dependencies, the hadoop jars are provided by the hadoop cluster, so you do not need to include them;

the resolvers is just like the pom.xml's repository, it defines where to down load the dependent jars


Step 3: Compile the project

[jinmfeng@localhost wordCount]$sbt compile
[info] Loading project definition from /home/jinmfeng/wordCount/project
[info] Updating {file:/home/jinmfeng/wordCount/project/}wordcount-build...
[info] Resolving org.fusesource.jansi#jansi;1.4 ...
[info] downloading http://repo.scala-sbt.org/scalasbt/sbt-plugin-releases/com.eed3si9n/sbt-assembly/scala_2.10/sbt_0.13/0.10.2/jars/sbt-assembly.jar ...
[info] [SUCCESSFUL ] com.eed3si9n#sbt-assembly;0.10.2!sbt-assembly.jar (7635ms)

...............

...............

[info] Done updating.
[info] Compiling 1 Scala source to /home/jinmfeng/wordCount/target/scala-2.10/classes...
[success] Total time: 765 s, completed Apr 25, 2014 11:21:25 AM

Step 4: Assemble the project to generate the jar which can use hadoop jar ... to run on the cluster

sbt-assembly is a sbt plugins used to export a runable jar.;

[jinmfeng@localhost wordCount]$sbt assembly

[info] Loading project definition from /home/jinmfeng/wordCount/project
[info] Set current project to ScoobiWordCount (in build file:/home/jinmfeng/wordCount/)
[info] Including: scala-compiler.jar
[info] Including: scala-library.jar

..................

..................

[info] Strategy 'deduplicate' was applied to 3 files (Run the task at debug level to see details)
[warn] Strategy 'discard' was applied to 2 files
[warn] Strategy 'rename' was applied to 7 files
[info] SHA-1: 90d2a74d3034ab58287e1a861fb8f72f2a948c12
[info] Packaging 
/home/jinmfeng/wordCount/target/scala-2.10/ScoobiWordCount-assembly-1.0.jar ...
[info] Done packaging.
[success] Total time: 50 s, completed Apr 25, 2014 11:24:38 AM

Notes: Most likely it will fail on this step, because of the dependencies issues. If so, you should change the libraryDependencies configuration or add a MergeStrategy to resolve it.


Step 5: Run the jar on the hadoop cluster. In my case, I run the jar on my hadoop 2.2.0 cluster. 

[jinmfeng@jimmy1-197873 ~]$hadoop jar ScoobiWordCount-assembly-1.0.jar input output

14/04/25 17:05:39 INFO Configuration.deprecation: mapred.used.genericoptionsparser is deprecated. Instead, use mapreduce.client.genericoptionsparser.used
14/04/25 17:05:39 INFO Configuration.deprecation: mapred.map.child.log.level is deprecated. Instead, use mapreduce.map.log.level
14/04/25 17:05:39 INFO Configuration.deprecation: mapred.reduce.child.log.level is deprecated. Instead, use mapreduce.reduce.log.level
14/04/25 17:05:39 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
[INFO] ScoobiApp - the URL of Java (evidenced with the java.lang.String class) is jar:file:/home/jinmfeng/bin/jdk1.7.0_51/jre/lib/rt.jar!/java/lang/String.class
[INFO] ScoobiApp - the URL of Scala (evidenced with the scala.collection.immutable.Range class) is file:/home/jinmfeng/file:/home/jinmfeng/temp/hadoop-unjar7738534783898350189/scala/collection/immutable/Range.class
[INFO] ScoobiApp - the URL of Hadoop (evidenced with the org.apache.hadoop.io.Writable class) is jar:file:/home/jinmfeng/lib/hadoop-2.2.0/share/hadoop/common/hadoop-common-2.2.0.jar!/org/apache/hadoop/io/Writable.class
[INFO] ScoobiApp - the URL of Avro (evidenced with the org.apache.avro.Schema class) is jar:file:/home/jinmfeng/lib/hadoop-2.2.0/share/hadoop/common/lib/avro-1.7.4.jar!/org/apache/avro/Schema.class
[INFO] ScoobiApp - the URL of Kiama (evidenced with the org.kiama.rewriting.Rewriter class) is file:/home/jinmfeng/file:/home/jinmfeng/temp/hadoop-unjar7738534783898350189/org/kiama/rewriting/Rewriter.class
[INFO] ScoobiApp - the URL of Scoobi (evidenced with the com.nicta.scoobi.core.ScoobiConfiguration class) is file:/home/jinmfeng/file:/home/jinmfeng/temp/hadoop-unjar7738534783898350189/com/nicta/scoobi/core/ScoobiConfiguration.class
[INFO] Sink - Output path: output
[INFO] Source - Input path: input (1.33 KiB)
[INFO] HadoopMode - ======================================================================
[INFO] HadoopMode - ===== START OF SCOOBI JOB 'WordCount$-0425-170539-1139641331' ========
[INFO] HadoopMode - ======================================================================

[INFO] HadoopMode - Executing layers
Layer(1
  ParallelDo (17)[String,(String,Int),((Unit,Unit),Unit)] (bridge d7229) ,
  GroupByKey (18)[String,Int] (bridge 9e16e) ,
  Combine (19)[String,Int] (bridge fb8a1) [sinks: Some(output)])

Mscr(1

  inputs: + GbkInputChannel(Load (1)[String] (TextSource(1)
input
) )

          mappers
          ParallelDo (17)[String,(String,Int),((Unit,Unit),Unit)] (bridge d7229)

          last mappers
          ParallelDo (17)[String,(String,Int),((Unit,Unit),Unit)] (bridge d7229)

  outputs: + GbkOutputChannel(GroupByKey (18)[String,Int] (bridge 9e16e) , combiner = Combine (19)[String,Int] (bridge fb8a1) [sinks: Some(output)]))

[INFO] HadoopMode - Executing layer 1
Layer(1
  ParallelDo (17)[String,(String,Int),((Unit,Unit),Unit)] (bridge d7229) ,
  GroupByKey (18)[String,Int] (bridge 9e16e) ,
  Combine (19)[String,Int] (bridge fb8a1) [sinks: Some(output)])

[INFO] HadoopMode - executing map reduce jobs
Mscr(1

  inputs: + GbkInputChannel(Load (1)[String] (TextSource(1)
input
) )

          mappers
          ParallelDo (17)[String,(String,Int),((Unit,Unit),Unit)] (bridge d7229)

          last mappers
          ParallelDo (17)[String,(String,Int),((Unit,Unit),Unit)] (bridge d7229)

  outputs: + GbkOutputChannel(GroupByKey (18)[String,Int] (bridge 9e16e) , combiner = Combine (19)[String,Int] (bridge fb8a1) [sinks: Some(output)]))


14/04/25 17:05:45 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
14/04/25 17:05:46 INFO Configuration.deprecation: mapred.output.value.groupfn.class is deprecated. Instead, use mapreduce.job.output.group.comparator.class
14/04/25 17:05:46 INFO Configuration.deprecation: mapred.output.key.comparator.class is deprecated. Instead, use mapreduce.job.output.key.comparator.class
14/04/25 17:05:46 INFO Configuration.deprecation: mapred.cache.files is deprecated. Instead, use mapreduce.job.cache.files
14/04/25 17:05:46 INFO Configuration.deprecation: mapred.mapoutput.key.class is deprecated. Instead, use mapreduce.map.output.key.class
14/04/25 17:05:46 INFO Configuration.deprecation: mapreduce.inputformat.class is deprecated. Instead, use mapreduce.job.inputformat.class
14/04/25 17:05:46 INFO Configuration.deprecation: mapreduce.partitioner.class is deprecated. Instead, use mapreduce.job.partitioner.class
14/04/25 17:05:46 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name
14/04/25 17:05:46 INFO Configuration.deprecation: mapred.mapoutput.value.class is deprecated. Instead, use mapreduce.map.output.value.class
14/04/25 17:05:46 INFO Configuration.deprecation: fs.default.name is deprecated. Instead, use fs.defaultFS
14/04/25 17:05:46 INFO Configuration.deprecation: mapred.local.dir is deprecated. Instead, use mapreduce.cluster.local.dir
14/04/25 17:05:47 INFO Configuration.deprecation: fs.checkpoint.edits.dir is deprecated. Instead, use dfs.namenode.checkpoint.edits.dir
14/04/25 17:05:47 INFO Configuration.deprecation: fs.checkpoint.dir is deprecated. Instead, use dfs.namenode.checkpoint.dir
14/04/25 17:05:47 INFO Configuration.deprecation: mapred.temp.dir is deprecated. Instead, use mapreduce.cluster.temp.dir
14/04/25 17:05:47 INFO Configuration.deprecation: mapred.system.dir is deprecated. Instead, use mapreduce.jobtracker.system.dir
14/04/25 17:05:47 INFO Configuration.deprecation: mapreduce.combine.class is deprecated. Instead, use mapreduce.job.combine.class
14/04/25 17:05:47 INFO Configuration.deprecation: fs.default.name is deprecated. Instead, use fs.defaultFS
14/04/25 17:05:47 INFO Configuration.deprecation: mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class
14/04/25 17:05:47 INFO Configuration.deprecation: dfs.name.edits.dir is deprecated. Instead, use dfs.namenode.edits.dir
[INFO] MapReduceJob - Total input size: 1.33 KiB
[INFO] MapReduceJob - Number of reducers: 1
[INFO] RMProxy - Connecting to ResourceManager at Jimmy1/10.9.241.97:8032
14/04/25 17:06:02 INFO Configuration.deprecation: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
[INFO] FileInputFormat - Total input paths to process : 1
[INFO] JobSubmitter - number of splits:1
14/04/25 17:06:02 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name
14/04/25 17:06:02 INFO Configuration.deprecation: mapred.cache.files.filesizes is deprecated. Instead, use mapreduce.job.cache.files.filesizes
14/04/25 17:06:02 INFO Configuration.deprecation: fs.default.name is deprecated. Instead, use fs.defaultFS
14/04/25 17:06:02 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
14/04/25 17:06:02 INFO Configuration.deprecation: mapreduce.reduce.class is deprecated. Instead, use mapreduce.job.reduce.class
14/04/25 17:06:02 INFO Configuration.deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
14/04/25 17:06:02 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
14/04/25 17:06:02 INFO Configuration.deprecation: mapred.cache.files.timestamps is deprecated. Instead, use mapreduce.job.cache.files.timestamps
14/04/25 17:06:02 INFO Configuration.deprecation: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir
[INFO] JobSubmitter - Submitting tokens for job: job_1397115520850_0020
[INFO] Job - The url to track the job: http://Jimmy1:8088/proxy/application_1397115520850_0020/
[INFO] MapReduceJob - MapReduce job 'job_1397115520850_0020' submitted. Please see http://Jimmy1:8088/proxy/application_1397115520850_0020/ for more info.
[INFO] MapReduceJob - Map 100%    Reduce   0%
[INFO] MapReduceJob - Map 100%    Reduce 100%
[INFO] HadoopMode - ===== END OF MAP-REDUCE JOB 1 of 1 (for the Scoobi job 'WordCount$-0425-170539-1139641331') ======

[INFO] HadoopMode - Layer sinks:  List(TextFileSink: output)
[INFO] HadoopMode - ===== END OF LAYER 1 ======

[INFO] HadoopMode - ======================================================================
[INFO] HadoopMode - ===== END OF SCOOBI JOB 'WordCount$-0425-170539-1139641331'   ========
[INFO] HadoopMode - ======================================================================


Step 6: Check the result:

[jinmfeng@jimmy1-197873 ~]$ hadoop fs -ls output

Found 2 items
-rw-r--r--   2 jinmfeng supergroup          0 2014-04-25 17:06 output/_SUCCESS
-rw-r--r--   2 jinmfeng supergroup       1574 2014-04-25 17:06 output/ch18out17-r-00000

[jinmfeng@jimmy1-197873 ~]$ hadoop fs -text output/ch18out17-r-00000
(,18)
((BIS),,1)
((ECCN),1)
((TSU),1)
((see,1)
(5D002.C.1,,1)
(740.13),1)
(<http://www.wassenaar.org/>,1)

........


Now you have successfully launched the first Scoobi project. Follow the same instructions, try more Scoobi...


[关于作者]

Jimmy,软件工程师。目前供职于某大型跨国互联网公司,从事大数据平台上的数据产品研发。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值