大数据-spark：spark操作HBase数据库

最新推荐文章于 2024-01-20 12:36:48 发布

sxjlinux

最新推荐文章于 2024-01-20 12:36:48 发布

阅读量749

点赞数 1

分类专栏：大数据

本文链接：https://blog.csdn.net/sunxiaoju/article/details/102134914

版权

大数据专栏收录该内容

27 篇文章 1 订阅

订阅专栏

一、从HBase数据库读

1、首先向数据库中插入数据，插入方法请看：https://blog.csdn.net/sunxiaoju/article/details/101908533

2、然后打开idea创建maven项目，填写pom.xml，pom.xml内容如下：

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>spark-operate-hbase</groupId>
    <artifactId>sparkoperatehbase</artifactId>
    <version>1.0-SNAPSHOT</version>

    <dependencies>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>2.4.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-client</artifactId>
            <version>2.0.4</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-server</artifactId>
            <version>2.0.4</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-common</artifactId>
            <version>2.0.4</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-mapreduce</artifactId>
            <version>2.0.4</version>
        </dependency>
    </dependencies>
    <build>
        <plugins>
            <plugin>
                <groupId>org.scala-tools</groupId>
                <artifactId>maven-scala-plugin</artifactId>
                <version>2.15.2</version>
                <executions>
                    <execution>
                        <goals>
                            <goal>compile</goal>
                            <goal>testCompile</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-assembly-plugin</artifactId>
                <version>3.1.1</version>
                <configuration>
                    <archive>
                        <manifest>
                            <mainClass>SparkOperateHBase</mainClass>
                        </manifest>
                    </archive>
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
                </configuration>
                <executions>
                    <execution>
                        <id>make-assembly</id>
                        <phase>package</phase>
                        <goals>
                            <goal>single</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>

</project>

3、新建一个scala代码SparkOperateHBase.scala，内容如下：

import org.apache.hadoop.hbase._
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.util.Bytes
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.client.Result
object SparkOperateHBase {
  def main(args:Array[String]): Unit ={
    //新建一个HBase配置项，用于配置读取那个表的信息等
    val conf=HBaseConfiguration.create()
    //新建一个rdd
    val sc=new SparkContext(new SparkConf())
    //配置以什么方式读取HBase中的那个表
    conf.set(TableInputFormat.INPUT_TABLE,"student")
    //从hbase数据库中读取数据，并以(ImmutableBytesWritable,Result)类型的键值对保存到stuRdd中，
    val stuRdd=sc.newAPIHadoopRDD(conf,classOf[TableInputFormat],classOf[ImmutableBytesWritable],classOf[Result])
    //计算stuRdd读取数据的个数
    val count=stuRdd.count()
    println("Students RDD Count:"+count)
    //将数据缓存起来
    stuRdd.cache()
    //遍历stuRdd
    stuRdd.foreach({
      case (_,result)=>{//_表示ImmutableBytesWritable类型的键，result表示Result
        val key=Bytes.toString(result.getRow)//获得行键
        val name=Bytes.toString(result.getValue("info".getBytes,"name".getBytes))//获得info簇中的name列值
        val gender=Bytes.toString(result.getValue("info".getBytes,"gender".getBytes))//获得info簇中的gender列值
        val age=Bytes.toString(result.getValue("info".getBytes,"age".getBytes))//获得info簇中的age列值
        println("Row key:"+key+" Name:"+name+" Gender:"+gender+" Age:"+age)
      }
    })
  }
}

4、然后使用mvn package打包，如下图所示：

5、然后通过spark-submit target/sparkoperatehbase-1.0-SNAPSHOT-jar-with-dependencies.jar提交运行，如下图所示：

6、此时就可以在终端上显示输出结果，如下图所示：

二、向hbase中写记录

1、新建一个SparkWriteHBase.scala文件，代码如下：

import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.mapreduce.{TableInputFormat, TableOutputFormat}
import org.apache.spark._
import org.apache.hadoop.mapreduce.Job
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.client.Result
import org.apache.hadoop.hbase.client.Put
import org.apache.hadoop.hbase.util.Bytes
import org.apache.spark.rdd.RDD
object SparkWriteHBase {
  def getStuRdd(sc:SparkContext,tablename:String):RDD[(ImmutableBytesWritable,Result)]={
    //新建一个HBase配置项，用于配置读取那个表的信息等
    val conf=HBaseConfiguration.create()
    //配置以什么方式读取HBase中的那个表
    conf.set(TableInputFormat.INPUT_TABLE,tablename)
    //从hbase数据库中读取数据，并以(ImmutableBytesWritable,Result)类型的键值对保存到stuRdd中，
    sc.newAPIHadoopRDD(conf,classOf[TableInputFormat],classOf[ImmutableBytesWritable],classOf[Result])
  }
  def getCount(sc:SparkContext,tablename:String):Long={
    //计算stuRdd读取数据的个数
    getStuRdd(sc,tablename).count()
  }
  def printHBase(sc:SparkContext,tablename:String): Unit ={
    val stuRdd=getStuRdd(sc,tablename)
    //计算stuRdd读取数据的个数
    val count=stuRdd.count()
    println("Students RDD Count:"+count)
    //将数据缓存起来
    stuRdd.cache()
    //遍历stuRdd
    stuRdd.foreach({
      case (_,result)=>{//_表示ImmutableBytesWritable类型的键，result表示Result
      val key=Bytes.toString(result.getRow)//获得行键
      val name=Bytes.toString(result.getValue("info".getBytes,"name".getBytes))//获得info簇中的name列值
      val gender=Bytes.toString(result.getValue("info".getBytes,"gender".getBytes))//获得info簇中的gender列值
      val age=Bytes.toString(result.getValue("info".getBytes,"age".getBytes))//获得info簇中的age列值
        println("Row key:"+key+" Name:"+name+" Gender:"+gender+" Age:"+age)
      }
    })
  }
  def main(args:Array[String]): Unit = {
    val sparkConf = new SparkConf().setAppName("SparkWriteHBase").setMaster("local")
    val sc = new SparkContext(sparkConf)
    val tablename = "student"
    //设置sc以何种方式读取那个数据表
    sc.hadoopConfiguration.set(TableOutputFormat.OUTPUT_TABLE,tablename)
    //根据配置创建一个任务
    val job=new Job(sc.hadoopConfiguration)
    //设置键类型
    job.setOutputKeyClass(classOf[ImmutableBytesWritable])
    //设置值类型
    job.setOutputValueClass(classOf[Result])
    //设置表输出格式类型
    job.setOutputFormatClass(classOf[TableOutputFormat[ImmutableBytesWritable]])
    //获得表中的记录
    val count=getCount(sc,tablename)
    //生成一个RDD
    val indataRdd=sc.makeRDD(Array("%d,Rongcheng%d,M,26".format(count+1,count+1),"%d,Guanhua%d,M,27".format(count+2,count+2)))
    //将字符串以逗号分隔成一个数组，并返回一个数组列表，在次map
    val rdd=indataRdd.map(_.split(",")).map{
      arr=>{//
        val put=new Put(Bytes.toBytes(arr(0)))//创建一行，主键为arr(0)
        put.addColumn(Bytes.toBytes("info"),Bytes.toBytes("name"),Bytes.toBytes(arr(1)))//在一行上添加一列name
        put.addColumn(Bytes.toBytes("info"),Bytes.toBytes("gender"),Bytes.toBytes(arr(2)))//在一行上添加一列gender
        put.addColumn(Bytes.toBytes("info"),Bytes.toBytes("age"),Bytes.toBytes(arr(3)))//在一行上添加一列age
        (new ImmutableBytesWritable,put)//返回行键为ImmutableBytesWritable，值为put
      }
    }
    rdd.saveAsNewAPIHadoopDataset(job.getConfiguration())//然后统一保存到hbase中
    printHBase(sc, tablename)//输出记录
  }
}

2、pom.xml文件内容如下：

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>spark-operate-hbase</groupId>
    <artifactId>sparkoperatehbase</artifactId>
    <version>1.0-SNAPSHOT</version>

    <dependencies>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>2.4.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-client</artifactId>
            <version>2.0.4</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-server</artifactId>
            <version>2.0.4</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-common</artifactId>
            <version>2.0.4</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-mapreduce</artifactId>
            <version>2.0.4</version>
        </dependency>
    </dependencies>
    <build>
        <plugins>
            <plugin>
                <groupId>org.scala-tools</groupId>
                <artifactId>maven-scala-plugin</artifactId>
                <version>2.15.2</version>
                <executions>
                    <execution>
                        <goals>
                            <goal>compile</goal>
                            <goal>testCompile</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-assembly-plugin</artifactId>
                <version>3.1.1</version>
                <configuration>
                    <archive>
                        <manifest>
                            <mainClass>SparkWriteHBase</mainClass>
                        </manifest>
                    </archive>
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
                </configuration>
                <executions>
                    <execution>
                        <id>make-assembly</id>
                        <phase>package</phase>
                        <goals>
                            <goal>single</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>

</project>

3、然后执行mvn package打包。

4、通过：spark-submit target/sparkoperatehbase-1.0-SNAPSHOT-jar-with-dependencies.jar进行提交运行，第一次运行如下图所示：