一、编程环境:
1、JDK1.8
2、scala2.11.7
3、hadoop3.0.0、hbase2.1.0、spark2.4.0
操作系统:centos7.6,编译器使用idea2019
二、实现步骤:
1、添加hbase的jar包
因为我这里使用使用的是cdh发行版本,hbase的jar包位置为:/opt/cloudera/parcels/CDH/lib/hbase/lib/
将该目录下的jar文件拷贝至特定目录,如下:
#mkdir /data/lib/hbase
#cp /opt/cloudera/parcels/CDH/lib/hbase/lib/* /data/lib/hbase
2、使用idea创建scala项目,maven添加如下依赖
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
<hadoop.version>3.0.0</hadoop.version>
</properties>
<dependencyManagement>
<dependencies>
<!-- Camel BOM -->
<dependency>
<groupId>org.apache.camel</groupId>
<artifactId>camel-parent</artifactId>
<version>2.24.1</version>
<scope>import</scope>
<type>pom</type>
</dependency>
</dependencies>
</dependencyManagement>
<dependencies>
<dependency>
<groupId>org.apache.camel</groupId>
<artifactId>camel-core</artifactId>
</dependency>
<dependency>
<groupId>org.apache.camel</groupId>
<artifactId>camel-scala</artifactId>
</dependency>
<!-- scala -->
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.11.7</version>
</dependency>
<dependency>
<groupId>org.scala-lang.modules</groupId>
<artifactId>scala-xml_2.11</artifactId>
<version>1.0.6</version>
</dependency>
<!-- logging -->
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-api</artifactId>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-core</artifactId>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-slf4j-impl</artifactId>
<scope>runtime</scope>
</dependency>
<!-- testing -->
<dependency>
<groupId>org.apache.camel</groupId>
<artifactId>camel-test</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
<version>2.6.6</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.2.0</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.2.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-core</artifactId>
<version>1.2.1</version>
</dependency>
<!-- <dependency>-->
<!-- <groupId>org.apache.hbase</groupId>-->
<!-- <artifactId>hbase-client</artifactId>-->
<!-- <version>2.1.0</version>-->
<!-- </dependency>-->
<!-- <dependency>-->
<!-- <groupId>org.apache.hbase</groupId>-->
<!-- <artifactId>hbase-server</artifactId>-->
<!-- <version>2.1.0</version>-->
<!-- </dependency>-->
<!-- <dependency>-->
<!-- <groupId>org.apache.hbase</groupId>-->
<!-- <artifactId>hbase-common</artifactId>-->
<!-- <version>2.1.0</version>-->
<!-- </dependency>-->
<!-- <dependency>-->
<!-- <groupId>org.apache.hbase</groupId>-->
<!-- <artifactId>hbase</artifactId>-->
<!-- <version>2.1.0</version>-->
<!-- <type>pom</type>-->
<!-- </dependency>-->
<!-- <dependency>-->
<!-- <groupId>org.apache.hbase</groupId>-->
<!-- <artifactId>hbase-mapreduce</artifactId>-->
<!-- <version>2.1.0</version>-->
<!-- </dependency>-->
<dependency>
<groupId>org.glassfish</groupId>
<artifactId>javax.el</artifactId>
<version>3.0.1-b08</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-hdfs -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>${hadoop.version}</version>
<exclusions>
<exclusion>
<artifactId>jackson-databind</artifactId>
<groupId>com.fasterxml.jackson.core</groupId>
</exclusion>
</exclusions>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>${hadoop.version}</version>
<exclusions>
<exclusion>
<artifactId>jackson-databind</artifactId>
<groupId>com.fasterxml.jackson.core</groupId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.gdal</groupId>
<artifactId>gdal</artifactId>
<version>2.4.0</version>
</dependency>
</dependencies>
注意:这里引入的hadoop版本尽量和系统中安装的版本一致,以免出现代码冲突的情况。
3、idea添加hbase的jar包
点击idea的file-Project Structure设置,将步骤1中创建的jar包引入,如下:
4、添加maven的打包工具,这一步非必要,如果使用idea自带的打包工具无法正常打包,可以使用该方式。
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>2.5.5</version>
<configuration>
<archive>
<manifest>
<mainClass>**this is your mainclass**</mainClass>
</manifest>
</archive>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assemble</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
5、编写代码
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.client.Put
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.mapred.TableOutputFormat
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.mapred.JobConf
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.{Seconds, StreamingContext}
object HbaseOpe {
def writeTxtToHbase()={
val spark = SparkSession.builder().appName("SparkHBaseRDD").getOrCreate()
val sc = spark.sparkContext
//表名称
val tablename= "tb:table3"
val hbaseConf = HBaseConfiguration.create()
hbaseConf.set(TableOutputFormat.OUTPUT_TABLE, tablename)
val jobConf = new JobConf(hbaseConf)
jobConf.setOutputFormat(classOf[TableOutputFormat])
//hdfs文件
val txtpath="/ZF/2018001.txt"
val txtRdd=sc.textFile(txtpath)
txtRdd.map(_.split(",")).map(arr=>{
val put = new Put(Bytes.toBytes(arr(0)))
put.addColumn(Bytes.toBytes("static"),Bytes.toBytes("col1"),Bytes.toBytes(arr(1)))
put.addColumn(Bytes.toBytes("static"),Bytes.toBytes("col2"),Bytes.toBytes(arr(2)))
put.addColumn(Bytes.toBytes("static"),Bytes.toBytes("col3"),Bytes.toBytes(arr(3)))
put.addColumn(Bytes.toBytes("static"),Bytes.toBytes("col4"),Bytes.toBytes(arr(4)))
put.addColumn(Bytes.toBytes("static"),Bytes.toBytes("col5"),Bytes.toBytes(arr(5)))
put.addColumn(Bytes.toBytes("static"),Bytes.toBytes("col6"),Bytes.toBytes(arr(6)))
put.addColumn(Bytes.toBytes("static"),Bytes.toBytes("col7"),Bytes.toBytes(arr(7)))
(new ImmutableBytesWritable, put)
}).saveAsHadoopDataset(jobConf)
}
}
使用spark-submit提交,我这里测试的是10000000条记录,7列数据,三台主机,共耗时4分钟左右。