spark + hbase

软件环境 spark2.3.1 + hbase 2.0.1

这里我们用到的时hortonworks 的spark hbase connector 

1.从github 下载shc源码

2.用idea打开源码,下载依赖

因为源码中有部分依赖是来自hortonworks的,maven 中央仓库中下载不到,所以稍微更改了下

maven setting.xml ,添加了如下语句

<mirror>
      <id>aa</id>
      <mirrorOf>*</mirrorOf> 
      <!--url>http://repo.hortonworks.com/content/groups/public/</url-->
      <url>http://maven.aliyun.com/nexus/content/groups/public/</url>
    </mirror>

先用ali的镜像把大部分依赖下载下来,下载不到的依赖切换到hortonworks 的地址下载

3.打包,测试

因为项目中有很多依赖在我们的classpath中无法找到,所以采用了maven-assembly 打包,将shc依赖的jar都取出来,

然后spark-submit --class xxx --jars 将依赖加上,附上提交语句(有部分jar包可能是冗余的)

/usr/local/spark-2.1.0-bin-hadoop2.7/bin/spark-submit \
--class org.apache.spark.sql.execution.datasources.hbase.examples.CompositeKey \
--master local \
--files /usr/local/hbase-2.1.0/conf/hbase-site.xml \
--jars corelib/activation-1.1.jar,corelib/aircompressor-0.8.jar,corelib/antlr4-runtime-4.7.jar,corelib/antlr-runtime-3.5.2.jar,corelib/aopalliance-1.0.jar,corelib/aopalliance-repackaged-2.5.0-b32.jar,corelib/apacheds-i18n-2.0.0-M15.jar,corelib/apacheds-kerberos-codec-2.0.0-M15.jar,corelib/api-asn1-api-1.0.0-M20.jar,corelib/api-util-1.0.0-M20.jar,corelib/arrow-format-0.8.0.jar,corelib/arrow-memory-0.8.0.jar,corelib/arrow-vector-0.8.0.jar,corelib/asm-all-5.0.2.jar,corelib/audience-annotations-0.5.0.jar,corelib/avro-1.7.6.jar,corelib/avro-ipc-1.7.7.jar,corelib/avro-ipc-1.7.7-tests.jar,corelib/avro-mapred-1.7.7-hadoop2.jar,corelib/chill_2.11-0.8.4.jar,corelib/chill-java-0.8.4.jar,corelib/commons-beanutils-core-1.8.0.jar,corelib/commons-cli-1.2.jar,corelib/commons-codec-1.10.jar,corelib/commons-collections-3.2.2.jar,corelib/commons-compiler-3.0.8.jar,corelib/commons-compress-1.4.1.jar,corelib/commons-configuration-1.6.jar,corelib/commons-crypto-1.0.0.jar,corelib/commons-csv-1.0.jar,corelib/commons-daemon-1.0.13.jar,corelib/commons-digester-1.8.jar,corelib/commons-httpclient-3.1.jar,corelib/commons-io-2.5.jar,corelib/commons-lang-2.6.jar,corelib/commons-lang3-3.6.jar,corelib/commons-logging-1.1.1.jar,corelib/commons-math3-3.6.1.jar,corelib/commons-net-3.1.jar,corelib/compress-lzf-1.0.3.jar,corelib/curator-client-2.7.1.jar,corelib/curator-framework-2.7.1.jar,corelib/curator-recipes-2.7.1.jar,corelib/disruptor-3.3.6.jar,corelib/fastutil-6.5.6.jar,corelib/findbugs-annotations-1.3.9-1.jar,corelib/flatbuffers-1.2.0-3f79e055.jar,corelib/gson-2.2.4.jar,corelib/guava-13.0.1.jar,corelib/guice-3.0.jar,corelib/guice-assistedinject-3.0.jar,corelib/guice-servlet-3.0.jar,corelib/hadoop-annotations-2.7.4.jar,corelib/hadoop-auth-2.7.4.jar,corelib/hadoop-client-2.6.5.jar,corelib/hadoop-common-2.7.4.jar,corelib/hadoop-distcp-2.7.4.jar,corelib/hadoop-hdfs-2.7.4.jar,corelib/hadoop-mapreduce-client-app-2.7.4.jar,corelib/hadoop-mapreduce-client-common-2.7.4.jar,corelib/hadoop-mapreduce-client-core-2.7.4.jar,corelib/hadoop-mapreduce-client-jobclient-2.7.4.jar,corelib/hadoop-mapreduce-client-shuffle-2.7.4.jar,corelib/hadoop-yarn-api-2.7.4.jar,corelib/hadoop-yarn-client-2.7.4.jar,corelib/hadoop-yarn-common-2.7.4.jar,corelib/hadoop-yarn-server-common-2.7.4.jar,corelib/hamcrest-core-1.3.jar,corelib/hbase-annotations-2.0.0-beta-1.jar,corelib/hbase-client-2.0.1.jar,corelib/hbase-common-2.0.1.jar,corelib/hbase-common-2.0.1-tests.jar,corelib/hbase-hadoop2-compat-2.0.1.jar,corelib/hbase-hadoop-compat-2.0.1.jar,corelib/hbase-http-2.0.1.jar,corelib/hbase-mapreduce-2.0.1.jar,corelib/hbase-metrics-2.0.1.jar,corelib/hbase-metrics-api-2.0.1.jar,corelib/hbase-procedure-2.0.1.jar,corelib/hbase-protocol-2.0.1.jar,corelib/hbase-protocol-shaded-2.0.1.jar,corelib/hbase-replication-2.0.1.jar,corelib/hbase-server-2.0.1.jar,corelib/hbase-shaded-miscellaneous-2.1.0.jar,corelib/hbase-shaded-netty-2.1.0.jar,corelib/hbase-shaded-protobuf-2.1.0.jar,corelib/hbase-zookeeper-2.0.1.jar,corelib/hk2-api-2.5.0-b32.jar,corelib/hk2-locator-2.5.0-b32.jar,corelib/hk2-utils-2.5.0-b32.jar,corelib/hppc-0.7.2.jar,corelib/htrace-core-3.2.0-incubating.jar,corelib/htrace-core4-4.2.0-incubating.jar,corelib/httpclient-4.0.1.jar,corelib/httpcore-4.0.1.jar,corelib/i18n-util-1.0.1.jar,corelib/icu4j-60.1.jar,corelib/icu4j-charset-60.1.jar,corelib/icu4j-localespi-60.1.jar,corelib/ivy-2.4.0.jar,corelib/jackson-annotations-2.9.0.jar,corelib/jackson-core-2.9.2.jar,corelib/jackson-core-asl-1.9.2.jar,corelib/jackson-databind-2.9.2.jar,corelib/jackson-jaxrs-1.8.3.jar,corelib/jackson-mapper-asl-1.9.2.jar,corelib/jackson-module-paranamer-2.7.9.jar,corelib/jackson-module-scala_2.11-2.6.7.1.jar,corelib/jackson-xc-1.8.3.jar,corelib/jamon-runtime-2.4.1.jar,corelib/janino-3.0.8.jar,corelib/javassist-3.20.0-GA.jar,corelib/javax.annotation-api-1.2.jar,corelib/javax.el-3.0.1-b10.jar,corelib/javax.inject-1.jar,corelib/javax.inject-2.5.0-b32.jar,corelib/java-xmlbuilder-0.4.jar,corelib/javax.servlet-api-3.1.0.jar,corelib/javax.servlet.jsp-2.3.2.jar,corelib/javax.servlet.jsp-api-2.3.1.jar,corelib/javax.ws.rs-api-2.0.1.jar,corelib/jaxb-api-2.2.2.jar,corelib/jaxb-impl-2.2.3-1.jar,corelib/jcip-annotations-1.0-1.jar,corelib/jcl-over-slf4j-1.7.16.jar,corelib/jcodings-1.0.18.jar,corelib/jdk.tools-1.8.jar,corelib/jersey-client-1.9.jar,corelib/jersey-client-2.22.2.jar,corelib/jersey-common-2.22.2.jar,corelib/jersey-container-servlet-2.22.2.jar,corelib/jersey-container-servlet-core-2.22.2.jar,corelib/jersey-core-1.9.jar,corelib/jersey-guava-2.25.1.jar,corelib/jersey-guice-1.9.jar,corelib/jersey-json-1.9.jar,corelib/jersey-media-jaxb-2.25.1.jar,corelib/jersey-server-1.9.jar,corelib/jersey-server-2.22.2.jar,corelib/jets3t-0.9.0.jar,corelib/jettison-1.3.8.jar,corelib/jetty-6.1.26.jar,corelib/jetty-http-9.3.19.v20170502.jar,corelib/jetty-io-9.3.19.v20170502.jar,corelib/jetty-security-9.3.19.v20170502.jar,corelib/jetty-server-9.3.19.v20170502.jar,corelib/jetty-servlet-9.3.19.v20170502.jar,corelib/jetty-sslengine-6.1.26.jar,corelib/jetty-util-6.1.26.jar,corelib/jetty-util-9.3.19.v20170502.jar,corelib/jetty-util-ajax-9.3.19.v20170502.jar,corelib/jetty-webapp-9.3.19.v20170502.jar,corelib/jetty-xml-9.3.19.v20170502.jar,corelib/jline-2.11.jar,corelib/joda-time-1.6.jar,corelib/joni-2.1.2.jar,corelib/jsch-0.1.54.jar,corelib/json4s-ast_2.11-3.2.11.jar,corelib/json4s-core_2.11-3.2.11.jar,corelib/json4s-jackson_2.11-3.2.11.jar,corelib/jsr305-2.0.1.jar,corelib/jul-to-slf4j-1.7.16.jar,corelib/junit-4.12.jar,corelib/kryo-shaded-3.0.3.jar,corelib/leveldbjni-all-1.8.jar,corelib/libthrift-0.9.0.jar,corelib/log4j-1.2.17.jar,corelib/lz4-java-1.4.0.jar,corelib/metrics-core-3.2.1.jar,corelib/metrics-graphite-3.1.5.jar,corelib/metrics-json-3.1.5.jar,corelib/metrics-jvm-3.1.5.jar,corelib/minlog-1.3.0.jar,corelib/netty-3.9.9.Final.jar,corelib/netty-all-4.0.23.Final.jar,corelib/objenesis-2.1.jar,corelib/orc-core-1.4.4-nohive.jar,corelib/orc-mapreduce-1.4.4-nohive.jar,corelib/oro-2.0.8.jar,corelib/osgi-resource-locator-1.0.1.jar,corelib/paranamer-2.3.jar,corelib/parquet-column-1.8.3.jar,corelib/parquet-common-1.8.3.jar,corelib/parquet-encoding-1.8.3.jar,corelib/parquet-format-2.3.1.jar,corelib/parquet-hadoop-1.8.3.jar,corelib/parquet-jackson-1.8.3.jar,corelib/phoenix-core-5.0.0-alpha-HBase-2.0.jar,corelib/protobuf-java-2.5.0.jar,corelib/py4j-0.10.7.jar,corelib/pyrolite-4.13.jar,corelib/RoaringBitmap-0.5.11.jar,corelib/scala-compiler-2.11.0.jar,corelib/scala-library-2.11.8.jar,corelib/scalap-2.11.0.jar,corelib/scala-parser-combinators_2.11-1.0.4.jar,corelib/scala-reflect-2.11.8.jar,corelib/servlet-api-2.5.jar,corelib/shc-core-1.1.3-2.3-s_2.11.jar,corelib/slf4j-api-1.7.25.jar,corelib/slf4j-log4j12-1.7.25.jar,corelib/snappy-0.3.jar,corelib/snappy-java-1.0.5.jar,corelib/spark-catalyst_2.11-2.3.1.jar,corelib/spark-core_2.11-2.3.1.jar,corelib/spark-kvstore_2.11-2.3.1.jar,corelib/spark-launcher_2.11-2.3.1.jar,corelib/spark-network-common_2.11-2.3.1.jar,corelib/spark-network-shuffle_2.11-2.3.1.jar,corelib/spark-sketch_2.11-2.3.1.jar,corelib/spark-sql_2.11-2.3.1.jar,corelib/spark-tags_2.11-2.3.1.jar,corelib/spark-unsafe_2.11-2.3.1.jar,corelib/sqlline-1.2.0.jar,corelib/stax-api-1.0-2.jar,corelib/stream-2.7.0.jar,corelib/tephra-api-0.13.0-incubating.jar,corelib/tephra-core-0.13.0-incubating.jar,corelib/tephra-hbase-compat-1.3-0.13.0-incubating.jar,corelib/twill-api-0.8.0.jar,corelib/twill-common-0.8.0.jar,corelib/twill-core-0.8.0.jar,corelib/twill-discovery-api-0.8.0.jar,corelib/twill-discovery-core-0.8.0.jar,corelib/twill-zookeeper-0.8.0.jar,corelib/univocity-parsers-2.5.9.jar,corelib/unused-1.0.0.jar,corelib/validation-api-1.1.0.Final.jar,corelib/xbean-asm5-shaded-4.4.jar,corelib/xmlenc-0.52.jar,corelib/xz-1.0.jar,corelib/zookeeper-3.4.10.jar,corelib/zstd-jni-1.3.2-2.jar \
shc-examples-1.1.3-2.3-s_2.11.jar

 

4.测试源码

package org.apache.spark.sql.execution.datasources.hbase.examples

import org.apache.spark.sql.execution.datasources.hbase._
import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession}

case class HBaseCompositeRecord(
    col00: String,
    col01: String,
    col1: String,
    col2: String,
    col3: String,
    col4: String,
    col5: String,
    col6: String,
    col7: String,
    col8: String)

object HBaseCompositeRecord {
  def apply(i: Int): HBaseCompositeRecord = {
    val s = i.toString
    HBaseCompositeRecord(s"row${"%03d".format(i)}",
      s,
      s,
      s,
      if(i%2==0) s else null,
      s,
      s,
      s,
      s,
      s)
  }
}

object CompositeKey {
  def cat = s"""{
                    |"table":{"namespace":"jason", "name":"ppp", "tableCoder":"PrimitiveType"},
                    |"rowkey":"key1:key2",
                    |"columns":{
                      |"col00":{"cf":"rowkey", "col":"key1", "type":"string", "length":"6"},
                      |"col01":{"cf":"rowkey", "col":"key2", "type":"string"},
                      |"col1":{"cf":"cf1", "col":"col1", "type":"string"},
                      |"col2":{"cf":"cf2", "col":"col2", "type":"string"},
                      |"col3":{"cf":"cf3", "col":"col3", "type":"string"},
                      |"col4":{"cf":"cf4", "col":"col4", "type":"string"},
                      |"col5":{"cf":"cf5", "col":"col5", "type":"string"},
                      |"col6":{"cf":"cf6", "col":"col6", "type":"string"},
                      |"col7":{"cf":"cf7", "col":"col7", "type":"string"},
                      |"col8":{"cf":"cf8", "col":"col8", "type":"string"}
                      |}
                    |}""".stripMargin

  def main(args: Array[String]){
    val spark = SparkSession.builder()
      .appName("CompositeKeyExample")
      .getOrCreate()

    val sc = spark.sparkContext
    val sqlContext = spark.sqlContext

    import sqlContext.implicits._

    def withCatalog(cat: String): DataFrame = {
      sqlContext
        .read
        .options(Map(HBaseTableCatalog.tableCatalog->cat))
        .format("org.apache.spark.sql.execution.datasources.hbase")
        .load()
    }

    //populate table with composite key
    val data = (0 to 255).map { i =>
        HBaseCompositeRecord(i)
    }
    sc.parallelize(data).toDF.write.options(
      Map(HBaseTableCatalog.tableCatalog -> cat, HBaseTableCatalog.newTable -> "5"))
      .format("org.apache.spark.sql.execution.datasources.hbase")
        .mode(SaveMode.Overwrite)
      .save()

    //full query
    val df = withCatalog(cat)
    df.show
    spark.stop()
  }
}

5.对catlog 的理解

1)"table":{"namespace":"jason", "name":"ppp", "tableCoder":"PrimitiveType"}

用于指定命名空间(可以理解为是database)和表名

namespace:命名空间,name:表名

2)"rowkey":"key1:key2"

用于指定hbase 中的key,这里用key1+key2 拼接作为一行的key

3)"col00":{"cf":"rowkey", "col":"key1", "type":"string", "length":"6"},
      "col01":{"cf":"rowkey", "col":"key2", "type":"string"},
      "col1":{"cf":"cf1", "col":"col1", "type":"string"},

col00 ,col01,col1 是对应的dataframe 的列名,length  是key1的长度(6个字符组成),cf 是column family ,col 是存入hbase的列名,type是数据类型,在hbase中只要能转换为bytes的数据都可以存储,包括对象

如上hbase 中key 是由 dataframe 中 col00 与col01拼接获得

 

转载于:https://www.cnblogs.com/jason-dong/p/9707618.html

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值