IntelliJ IDEA构建基于maven的spark+hbase工程（scala语言）

最新推荐文章于 2022-04-24 17:01:31 发布

烫烫烫口

最新推荐文章于 2022-04-24 17:01:31 发布

阅读量4.9k

点赞数 3

分类专栏：数据库文章标签： spark hbase idea scala

本文链接：https://blog.csdn.net/fzuzhanghao1993/article/details/78479889

版权

数据库专栏收录该内容

12 篇文章 0 订阅

订阅专栏

摘要

利用IDEA来编写基于maven的scala程序，主要功能用来支持从hbase中拉取数据供spark进行mapreduce运算。

软件准备

首先下载安装IntelliJ IDEA
https://www.jetbrains.com/idea/download/#section=windows

不需要javaee支持的话，直接选择Community版本就行了，毕竟免费，也足够支持maven,scala,git,spark,hbase了。

安装过程中选择scala支持

安装完成后，配置全局的maven，指定自己安装的maven也可以使用idea默认自带maven。

工程构建

新建maven project，类似eclipse的simple project，不需要其他附属，scala支持后续添加

新工程删除java文件夹，新建scala文件夹。

配置pom，因为需要编译scala，所以plugin选择maven-scala-plugin

pom.xml

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.test</groupId>
    <artifactId>LdSparkHbase</artifactId>
    <version>1.0-SNAPSHOT</version>
    <properties>
        <jdk.version>1.8</jdk.version>
        <logback.version>1.1.2</logback.version>
        <slf4j.version>1.7.7</slf4j.version>
        <junit.version>4.11</junit.version>
        <spark.version>2.1.0</spark.version>
        <hadoop.version>2.6.5</hadoop.version>
        <hbase.version>1.2.6</hbase.version>
    </properties>
    <dependencies>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.10</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>${hadoop.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>${hadoop.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-hdfs</artifactId>
            <version>${hadoop.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-client</artifactId>
            <version>${hbase.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-server</artifactId>
            <version>${hbase.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase</artifactId>
            <version>${hbase.version}</version>
            <type>pom</type>
        </dependency>
    </dependencies>

    <build>
        <sourceDirectory>src/main/scala</sourceDirectory>
        <plugins>
            <plugin>
                <groupId>org.scala-tools</groupId>
                <artifactId>maven-scala-plugin</artifactId>
                <version>2.15.2</version>
                <executions>
                    <execution>
                        <goals>
                            <goal>compile</goal>
                            <goal>testCompile</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>3.1.0</version>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                            <createDependencyReducedPom>false</createDependencyReducedPom>
                            <filters>
                                <filter>
                                    <artifact>*:*</artifact>
                                    <excludes>
                                        <exclude>META-INF/*.SF</exclude>
                                        <exclude>META-INF/*.DSA</exclude>
                                        <exclude>META-INF/*.RSA</exclude>
                                    </excludes>
                                </filter>
                            </filters>
                            <transformers>
                                <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
                                    <mainClass>com.test.SparkCount</mainClass>
                                </transformer>
                            </transformers>

                        </configuration>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>

</project>

工程右键，选择add framwork support，在打开的选项中添加scala支持
这里写图片描述

先看看工程能不能用，建议新建一个hello work类，run一下，看看工程构建是否正常。

新建SparkCount类，这里遍历采用官方demo，mapreduce自己修改

SparkCount

package com.test

import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor, TableName}
import org.apache.hadoop.hbase.client.HBaseAdmin
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.spark._
import org.apache.hadoop.hbase.client.HTable
import org.apache.hadoop.hbase.client.Put
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.mapreduce.TableOutputFormat
import org.apache.hadoop.mapred.JobConf
import org.apache.hadoop.io._

object SparkCount {
  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setMaster("spark://testserverip:7077")
      .setAppName("reduce")
    val sc = new SparkContext(sparkConf)

    val tablename = "apos_status"
    val conf = HBaseConfiguration.create()
    //设置zooKeeper集群地址，也可以通过将hbase-site.xml导入classpath，但是建议在程序里这样设置
    conf.set("hbase.zookeeper.quorum", "localhost")
    //设置zookeeper连接端口，默认2181
    conf.set("hbase.zookeeper.property.clientPort", "2181")
    conf.set(TableInputFormat.INPUT_TABLE, tablename)
    conf.set(TableInputFormat.SCAN_COLUMNS, "apos:type")


    //读取数据并转化成rdd
    val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
      classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
      classOf[org.apache.hadoop.hbase.client.Result])

    val count = hBaseRDD.count()
    println(count)
    hBaseRDD.foreach { case (_, result) => {
      //获取行键
      val key = Bytes.toString(result.getRow)
      //通过列族和列名获取列
      val typenames = Bytes.toString(result.getValue("apos".getBytes, "type".getBytes))
      if (key != null && typenames != null) {
        println(key + ":" + typenames);
      }

    }
    }
    println("map begin");
    val result = hBaseRDD.map(tuple=>Bytes.toString(tuple._2.getValue("apos".getBytes, "type".getBytes))).map(s=>(s,1)).reduceByKey((a,b)=>a+b)
    println("map end");

//最终结果写入hdfs，也可以写入hbase   result.saveAsTextFile("hdfs://localhost:9070/user/root/aposStatus-out")

//也可以选择写入hbase,写入配置
    var resultConf = HBaseConfiguration.create()
    //设置zooKeeper集群地址，也可以通过将hbase-site.xml导入classpath，但是建议在程序里这样设置
    resultConf.set("hbase.zookeeper.quorum", "localhost")
    //设置zookeeper连接端口，默认2181
    resultConf.set("hbase.zookeeper.property.clientPort", "2181")
    //注意这里是output
    resultConf.set(TableOutputFormat.OUTPUT_TABLE, "count-result")
    var job = Job.getInstance(resultConf)
    job.setOutputKeyClass(classOf[ImmutableBytesWritable])
    job.setOutputValueClass(classOf[org.apache.hadoop.hbase.client.Result])
    job.setOutputFormatClass(classOf[TableOutputFormat[ImmutableBytesWritable]])
    val hbaseOut = result.map(tuple=>{
      val put = new Put(Bytes.toBytes(UUID.randomUUID().toString))
      put.addColumn(Bytes.toBytes("result"), Bytes.toBytes("type"), Bytes.toBytes(tuple._1))
      //直接写入整型会以十六进制存储
      put.addColumn(Bytes.toBytes("result"), Bytes.toBytes("count"), Bytes.toBytes(tuple._2+""))
      (new ImmutableBytesWritable, put)
    })
    hbaseOut.saveAsNewAPIHadoopDataset(job.getConfiguration)
    sc.stop()

  }

}