Scoobi: An open source Scala library for Hadoop MapReduce. It combines the simplicity of functional programming with the strength of distributed data processing powered by Hadoop. It can dramatically increase the productivity of application development on Hadoop by reducing the Hadoop MapReduce development efforts.
In this article, we will introduce how to setup the Scoobi development environment on Unix/Linux by creating a classic Hadoop WordCount project.
Basically, Scoobi project is an sbt project, which follows the specification of sbt project. To learn more about sbt, please reference to another article introduction to sbt.
Step by Step Guide
1. Prerequisites
- Unix or Linux, the operating system
- Hadoop cluster, to run the Scoobi program
2. Install sbt
sbt (Simple Build Tool, for building scala code, but it seems to me not that simple....)
Step 1: Download the jar
[jinmfeng@localhost Downloads]$wget http://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/sbt-launch/0.13.2/sbt-launch.jar -O sbt-launch.jar
--2014-04-25 11:15:48-- http://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/sbt-launch/0.13.2/sbt-launch.jar
Resolving repo.typesafe.com... 54.236.87.147, 107.21.30.253, 54.236.91.228
Connecting to repo.typesafe.com|54.236.87.147|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1183053 (1.1M) [application/java-archive]
Saving to: “sbt-launch.jar”
100%[===================================================================================================================>] 1,183,053 373K/s in 3.1s
2014-04-25 11:15:54 (373 KB/s) - “sbt-launch.jar” saved [1183053/1183053]
Step 2: Copy the jar to a bin directory
[jinmfeng@localhost ~]$cp ~/Downloads/sbt-launch.jar ~/bin
Step 3: Create file sbt in the same directory and change it to be executable:
[jinmfeng@localhost bin]$vi sbt
SBT_OPTS="-Xms512M -Xmx1536M -Xss1M -XX:+CMSClassUnloadingEnabled -XX:MaxPermSize=256M" java $SBT_OPTS -jar `dirname $0`/sbt-launch.jar "$@"
[jinmfeng@localhost bin]$chmod u+x sbt
Step 4: Check if the ~/bin in our PATH, add below to your ~/.bash_profile
export PATH=~/bin:$PATH
Step 5: Check if the configuration is OK
[jinmfeng@localhost ~]$sbt sbt-version
Getting org.scala-sbt sbt 0.13.2 ...
downloading http://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/sbt/0.13.2/jars/sbt.jar ...
[SUCCESSFUL ] org.scala-sbt#sbt;0.13.2!sbt.jar (3863ms)
..................
..................
[info] Set current project to jinmfeng (in build file:/home/jinmfeng/)
[info] 0.13.2
Now you have two directories ~/.sbt and ~/.ivy2, .sbt is for global sbt configuration such as sbt plugins; .ivy2 just like Maven's .m2
3. Create our Scala project Word Count
Project Structure:
wordCount/
build.sbt
src/
main/
scala/ --Scala code should put in this derectory
WordCount.scala
test/
scala/
WordCountTest.scala
project/ --Project level configurations
build.properties
plugins/
plugins.sbt --The project level plugin configuration
Step 1: create the project directories as below:
[jinmfeng@localhost ~]$cd wordCount
[jinmfeng@localhost wordCount]$mkdir -p src/main/scala
[jinmfeng@localhost wordCount]$vi src/main/scala/WordCount.scala
import com.nicta.scoobi.Scoobi._ import Reduction._ object WordCount extends ScoobiApp { def run() { val lines: DList[String] = fromTextFile(args(0)) val counts: DList[(String, Int)] = lines.mapFlatten(_.split(" ")) .map(word => (word, 1)) .groupByKey .combine(Sum.int) persist(counts.toTextFile(args(1))) } }
Step 2: Create the sbt configuration for the project
[jinmfeng@localhost wordCount]$mkdir project
[jinmfeng@localhost wordCount]$vi project/plugins.sbt
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.10.2")
Notes: This is the project level plugins configuration file, you can also create the file in.sbt/0.13/plugins/plugins.sbt so the plugin can be used by all the project.
[jinmfeng@localhost wordCount]$vi build.sbt
import AssemblyKeys._ assemblySettings name := "ScoobiWordCount" version := "1.0" scalaVersion := "2.10.3" libraryDependencies ++= Seq( "com.nicta" %% "scoobi" % "0.8.3" intransitive(), "org.apache.hadoop" % "hadoop-common" % "2.2.0" % "provided", "org.apache.hadoop" % "hadoop-hdfs" % "2.2.0" % "provided", "org.apache.hadoop" % "hadoop-mapreduce-client-app" % "2.2.0" % "provided", "org.apache.hadoop" % "hadoop-mapreduce-client-core" % "2.2.0" % "provided", "org.apache.hadoop" % "hadoop-mapreduce-client-jobclient" % "2.2.0" % "provided", "org.apache.hadoop" % "hadoop-mapreduce-client-core" % "2.2.0" % "provided", "org.apache.hadoop" % "hadoop-annotations" % "2.2.0" % "provided", "org.apache.avro" % "avro-mapred" % "1.7.4" classifier "hadoop2", "org.scala-lang" % "scala-compiler" % "2.10.3", "org.scalaz" %% "scalaz-core" % "7.0.2", "com.thoughtworks.xstream" % "xstream" % "1.4.4" intransitive(), "javassist" % "javassist" % "3.12.1.GA", "com.googlecode.kiama" %% "kiama" % "1.5.2", "com.chuusai" % "shapeless_2.10.2" % "2.0.0-M1") resolvers ++= Seq(Resolver.sonatypeRepo("releases"), Resolver.sonatypeRepo("snaspshots"))
Notes: This file is just like Maven's pom.xml, which is used to define the project.
please note you have to keep the blank line between each statement, or you will get the compile error;
libraryDependencies is to define the dependencies, the hadoop jars are provided by the hadoop cluster, so you do not need to include them;
the resolvers is just like the pom.xml's repository, it defines where to down load the dependent jars
Step 3: Compile the project
[jinmfeng@localhost wordCount]$sbt compile
[info] Loading project definition from /home/jinmfeng/wordCount/project
[info] Updating {file:/home/jinmfeng/wordCount/project/}wordcount-build...
[info] Resolving org.fusesource.jansi#jansi;1.4 ...
[info] downloading http://repo.scala-sbt.org/scalasbt/sbt-plugin-releases/com.eed3si9n/sbt-assembly/scala_2.10/sbt_0.13/0.10.2/jars/sbt-assembly.jar ...
[info] [SUCCESSFUL ] com.eed3si9n#sbt-assembly;0.10.2!sbt-assembly.jar (7635ms)
...............
...............
[info] Done updating.
[info] Compiling 1 Scala source to /home/jinmfeng/wordCount/target/scala-2.10/classes...
[success] Total time: 765 s, completed Apr 25, 2014 11:21:25 AM
Step 4: Assemble the project to generate the jar which can use hadoop jar ... to run on the cluster
sbt-assembly is a sbt plugins used to export a runable jar.;
[jinmfeng@localhost wordCount]$sbt assembly
[info] Loading project definition from /home/jinmfeng/wordCount/project
[info] Set current project to ScoobiWordCount (in build file:/home/jinmfeng/wordCount/)
[info] Including: scala-compiler.jar
[info] Including: scala-library.jar
..................
..................
[info] Strategy 'deduplicate' was applied to 3 files (Run the task at debug level to see details)
[warn] Strategy 'discard' was applied to 2 files
[warn] Strategy 'rename' was applied to 7 files
[info] SHA-1: 90d2a74d3034ab58287e1a861fb8f72f2a948c12
[info] Packaging /home/jinmfeng/wordCount/target/scala-2.10/ScoobiWordCount-assembly-1.0.jar ...
[info] Done packaging.
[success] Total time: 50 s, completed Apr 25, 2014 11:24:38 AMNotes: Most likely it will fail on this step, because of the dependencies issues. If so, you should change the libraryDependencies configuration or add a MergeStrategy to resolve it.
Step 5: Run the jar on the hadoop cluster. In my case, I run the jar on my hadoop 2.2.0 cluster.
[jinmfeng@jimmy1-197873 ~]$hadoop jar ScoobiWordCount-assembly-1.0.jar input output
14/04/25 17:05:39 INFO Configuration.deprecation: mapred.used.genericoptionsparser is deprecated. Instead, use mapreduce.client.genericoptionsparser.used
14/04/25 17:05:39 INFO Configuration.deprecation: mapred.map.child.log.level is deprecated. Instead, use mapreduce.map.log.level
14/04/25 17:05:39 INFO Configuration.deprecation: mapred.reduce.child.log.level is deprecated. Instead, use mapreduce.reduce.log.level
14/04/25 17:05:39 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
[INFO] ScoobiApp - the URL of Java (evidenced with the java.lang.String class) is jar:file:/home/jinmfeng/bin/jdk1.7.0_51/jre/lib/rt.jar!/java/lang/String.class
[INFO] ScoobiApp - the URL of Scala (evidenced with the scala.collection.immutable.Range class) is file:/home/jinmfeng/file:/home/jinmfeng/temp/hadoop-unjar7738534783898350189/scala/collection/immutable/Range.class
[INFO] ScoobiApp - the URL of Hadoop (evidenced with the org.apache.hadoop.io.Writable class) is jar:file:/home/jinmfeng/lib/hadoop-2.2.0/share/hadoop/common/hadoop-common-2.2.0.jar!/org/apache/hadoop/io/Writable.class
[INFO] ScoobiApp - the URL of Avro (evidenced with the org.apache.avro.Schema class) is jar:file:/home/jinmfeng/lib/hadoop-2.2.0/share/hadoop/common/lib/avro-1.7.4.jar!/org/apache/avro/Schema.class
[INFO] ScoobiApp - the URL of Kiama (evidenced with the org.kiama.rewriting.Rewriter class) is file:/home/jinmfeng/file:/home/jinmfeng/temp/hadoop-unjar7738534783898350189/org/kiama/rewriting/Rewriter.class
[INFO] ScoobiApp - the URL of Scoobi (evidenced with the com.nicta.scoobi.core.ScoobiConfiguration class) is file:/home/jinmfeng/file:/home/jinmfeng/temp/hadoop-unjar7738534783898350189/com/nicta/scoobi/core/ScoobiConfiguration.class
[INFO] Sink - Output path: output
[INFO] Source - Input path: input (1.33 KiB)
[INFO] HadoopMode - ======================================================================
[INFO] HadoopMode - ===== START OF SCOOBI JOB 'WordCount$-0425-170539-1139641331' ========
[INFO] HadoopMode - ======================================================================
[INFO] HadoopMode - Executing layers
Layer(1
ParallelDo (17)[String,(String,Int),((Unit,Unit),Unit)] (bridge d7229) ,
GroupByKey (18)[String,Int] (bridge 9e16e) ,
Combine (19)[String,Int] (bridge fb8a1) [sinks: Some(output)])
Mscr(1
inputs: + GbkInputChannel(Load (1)[String] (TextSource(1)
input
) )
mappers
ParallelDo (17)[String,(String,Int),((Unit,Unit),Unit)] (bridge d7229)
last mappers
ParallelDo (17)[String,(String,Int),((Unit,Unit),Unit)] (bridge d7229)
outputs: + GbkOutputChannel(GroupByKey (18)[String,Int] (bridge 9e16e) , combiner = Combine (19)[String,Int] (bridge fb8a1) [sinks: Some(output)]))
[INFO] HadoopMode - Executing layer 1
Layer(1
ParallelDo (17)[String,(String,Int),((Unit,Unit),Unit)] (bridge d7229) ,
GroupByKey (18)[String,Int] (bridge 9e16e) ,
Combine (19)[String,Int] (bridge fb8a1) [sinks: Some(output)])
[INFO] HadoopMode - executing map reduce jobs
Mscr(1
inputs: + GbkInputChannel(Load (1)[String] (TextSource(1)
input
) )
mappers
ParallelDo (17)[String,(String,Int),((Unit,Unit),Unit)] (bridge d7229)
last mappers
ParallelDo (17)[String,(String,Int),((Unit,Unit),Unit)] (bridge d7229)
outputs: + GbkOutputChannel(GroupByKey (18)[String,Int] (bridge 9e16e) , combiner = Combine (19)[String,Int] (bridge fb8a1) [sinks: Some(output)]))
14/04/25 17:05:45 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
14/04/25 17:05:46 INFO Configuration.deprecation: mapred.output.value.groupfn.class is deprecated. Instead, use mapreduce.job.output.group.comparator.class
14/04/25 17:05:46 INFO Configuration.deprecation: mapred.output.key.comparator.class is deprecated. Instead, use mapreduce.job.output.key.comparator.class
14/04/25 17:05:46 INFO Configuration.deprecation: mapred.cache.files is deprecated. Instead, use mapreduce.job.cache.files
14/04/25 17:05:46 INFO Configuration.deprecation: mapred.mapoutput.key.class is deprecated. Instead, use mapreduce.map.output.key.class
14/04/25 17:05:46 INFO Configuration.deprecation: mapreduce.inputformat.class is deprecated. Instead, use mapreduce.job.inputformat.class
14/04/25 17:05:46 INFO Configuration.deprecation: mapreduce.partitioner.class is deprecated. Instead, use mapreduce.job.partitioner.class
14/04/25 17:05:46 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name
14/04/25 17:05:46 INFO Configuration.deprecation: mapred.mapoutput.value.class is deprecated. Instead, use mapreduce.map.output.value.class
14/04/25 17:05:46 INFO Configuration.deprecation: fs.default.name is deprecated. Instead, use fs.defaultFS
14/04/25 17:05:46 INFO Configuration.deprecation: mapred.local.dir is deprecated. Instead, use mapreduce.cluster.local.dir
14/04/25 17:05:47 INFO Configuration.deprecation: fs.checkpoint.edits.dir is deprecated. Instead, use dfs.namenode.checkpoint.edits.dir
14/04/25 17:05:47 INFO Configuration.deprecation: fs.checkpoint.dir is deprecated. Instead, use dfs.namenode.checkpoint.dir
14/04/25 17:05:47 INFO Configuration.deprecation: mapred.temp.dir is deprecated. Instead, use mapreduce.cluster.temp.dir
14/04/25 17:05:47 INFO Configuration.deprecation: mapred.system.dir is deprecated. Instead, use mapreduce.jobtracker.system.dir
14/04/25 17:05:47 INFO Configuration.deprecation: mapreduce.combine.class is deprecated. Instead, use mapreduce.job.combine.class
14/04/25 17:05:47 INFO Configuration.deprecation: fs.default.name is deprecated. Instead, use fs.defaultFS
14/04/25 17:05:47 INFO Configuration.deprecation: mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class
14/04/25 17:05:47 INFO Configuration.deprecation: dfs.name.edits.dir is deprecated. Instead, use dfs.namenode.edits.dir
[INFO] MapReduceJob - Total input size: 1.33 KiB
[INFO] MapReduceJob - Number of reducers: 1
[INFO] RMProxy - Connecting to ResourceManager at Jimmy1/10.9.241.97:8032
14/04/25 17:06:02 INFO Configuration.deprecation: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
[INFO] FileInputFormat - Total input paths to process : 1
[INFO] JobSubmitter - number of splits:1
14/04/25 17:06:02 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name
14/04/25 17:06:02 INFO Configuration.deprecation: mapred.cache.files.filesizes is deprecated. Instead, use mapreduce.job.cache.files.filesizes
14/04/25 17:06:02 INFO Configuration.deprecation: fs.default.name is deprecated. Instead, use fs.defaultFS
14/04/25 17:06:02 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
14/04/25 17:06:02 INFO Configuration.deprecation: mapreduce.reduce.class is deprecated. Instead, use mapreduce.job.reduce.class
14/04/25 17:06:02 INFO Configuration.deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
14/04/25 17:06:02 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
14/04/25 17:06:02 INFO Configuration.deprecation: mapred.cache.files.timestamps is deprecated. Instead, use mapreduce.job.cache.files.timestamps
14/04/25 17:06:02 INFO Configuration.deprecation: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir
[INFO] JobSubmitter - Submitting tokens for job: job_1397115520850_0020
[INFO] Job - The url to track the job: http://Jimmy1:8088/proxy/application_1397115520850_0020/
[INFO] MapReduceJob - MapReduce job 'job_1397115520850_0020' submitted. Please see http://Jimmy1:8088/proxy/application_1397115520850_0020/ for more info.
[INFO] MapReduceJob - Map 100% Reduce 0%
[INFO] MapReduceJob - Map 100% Reduce 100%
[INFO] HadoopMode - ===== END OF MAP-REDUCE JOB 1 of 1 (for the Scoobi job 'WordCount$-0425-170539-1139641331') ======
[INFO] HadoopMode - Layer sinks: List(TextFileSink: output)
[INFO] HadoopMode - ===== END OF LAYER 1 ======
[INFO] HadoopMode - ======================================================================
[INFO] HadoopMode - ===== END OF SCOOBI JOB 'WordCount$-0425-170539-1139641331' ========
[INFO] HadoopMode - ======================================================================
Step 6: Check the result:[jinmfeng@jimmy1-197873 ~]$ hadoop fs -ls output
Found 2 items
-rw-r--r-- 2 jinmfeng supergroup 0 2014-04-25 17:06 output/_SUCCESS
-rw-r--r-- 2 jinmfeng supergroup 1574 2014-04-25 17:06 output/ch18out17-r-00000
[jinmfeng@jimmy1-197873 ~]$ hadoop fs -text output/ch18out17-r-00000
(,18)
((BIS),,1)
((ECCN),1)
((TSU),1)
((see,1)
(5D002.C.1,,1)
(740.13),1)
(<http://www.wassenaar.org/>,1)
........
Now you have successfully launched the first Scoobi project. Follow the same instructions, try more Scoobi...
[关于作者]
Jimmy,软件工程师。目前供职于某大型跨国互联网公司,从事大数据平台上的数据产品研发。