大数据学习整理篇(三)Ubuntu 16.04 Server版安装Kudu,Impala,Spark 2.3.4,以及Scala语言使用Spark RDD访问HBase

1.Kudu安装,先建议全部使用root安装

在/etc/apt/sources.list.d目录下,先备份移除ambari-hdp1.list,以及其他HDP相关的仓库信息,
再新增文件cloudera.list,内容如下

# Packages for Cloudera's Distribution for Hadoop, Version 5, on Ubuntu 16.04 amd64       
deb [arch=amd64] http://archive.cloudera.com/kudu/ubuntu/xenial/amd64/kudu xenial-kudu5 contrib
deb-src http://archive.cloudera.com/kudu/ubuntu/xenial/amd64/kudu xenial-kudu5 contrib
运行下面四个命令
#>cd /opt
#>wget https://archive.cloudera.com/kudu/ubuntu/xenial/amd64/kudu/archive.key -O archive.key
#>sudo apt-key add archive.key

  #>apt-get update

 

 

使用root在线安装
apt-get install kudu # Base Kudu files
apt-get install kudu-master # Service scripts for managing kudu-master
apt-get install kudu-tserver # Service scripts for managing kudu-tserver
apt-get install libkuduclient0 # Kudu C++ client shared library
apt-get install libkuduclient-dev # Kudu C++ client SDK

3.3 启动服务
sudo service kudu-master start
sudo service kudu-tserver start

3.4 打开web检查
浏览器打开 http://localhost:8051/

上面安装的kudu是1.4版本。

参考链接:https://blog.csdn.net/weijiasheng/article/details/104796332

https://docs.cloudera.com/documentation/kudu/5-12-x/topics/kudu_installation.html

1.1 impala安装,能安装,但运行不了,由于花费时间太多,建议在CentOS下进行试验

在cloudera.list,添加如下内容

deb [arch=amd64] http://archive.cloudera.com/impala/ubuntu/precise/amd64/impala precise-impala2 contrib
deb-src http://archive.cloudera.com/impala/ubuntu/precise/amd64/impala precise-impala2 contrib


apt-get install bigtop-utils
apt-get install impala
apt-get install impala-server
apt-get install impala-state-store
apt-get install impala-catalog
apt-get install python-setuptools
apt-get install impala-shell


# cd /usr/local/hadoop-2.0.0-cdh4.1.0/etc/hadoop/
# cp core-site.xml hdfs-site.xml /etc/impala/conf
# cd /etc/impala/conf
# vi hdfs-site.xml

增加:

<property>
    <name>dfs.client.read.shortcircuit</name>
    <value>true</value>                     
</property>
<property>                       
    <name>dfs.domain.socket.path</name>
    <value>/var/run/hadoop-hdfs/dn._PORT</value>
</property>                                         
<property>                                          
    <name>dfs.datanode.hdfs-blocks-metadata.enabled</name>
    <value>true</value>                      
</property>                                            
<property>                                           
    <name>dfs.client.use.legacy.blockreader.local</name>
    <value>true</value>                           
</property>                                      
<property>                                          
    <name>dfs.datanode.data.dir.perm</name>          
    <value>750</value>                                
</property>                                             
<property>                                              
    <name>dfs.block.local-path-access.user</name>
    <value>impala</value>
</property> 
<property>
    <name>dfs.client.file-block-storage-locations.timeout</name>
    <value>3000</value>
</property>


vim /etc/default/impala

增加如下内容
IMPALA_CATALOG_SERVICE_HOST=127.0.0.1

IMPALA_STATE_STORE_HOST=127.0.0.1

IMPALA_STATE_STORE_PORT=24000

IMPALA_BACKEND_PORT=22000

IMPALA_LOG_DIR=/var/log/impala

IMPALA_CATALOG_ARGS=" -log_dir=${IMPALA_LOG_DIR} "

IMPALA_STATE_STORE_ARGS=" -log_dir=${IMPALA_LOG_DIR} -state_store_port=${IMPALA_STATE_STORE_PORT}"

IMPALA_SERVER_ARGS=" \
    
    -log_dir=${IMPALA_LOG_DIR} \
    
    -catalog_service_host=${IMPALA_CATALOG_SERVICE_HOST} \
    
    -state_store_port=${IMPALA_STATE_STORE_PORT} \
    
    -use_statestore \
    
    -state_store_host=${IMPALA_STATE_STORE_HOST} \
    
    -be_port=${IMPALA_BACKEND_PORT} "


ENABLE_CORE_DUMPS=false


# LIBHDFS_OPTS=-Djava.library.path=/usr/lib/impala/lib

MYSQL_CONNECTOR_JAR=/usr/share/java/mysql-connector-java-8.0.13.jar

IMPALA_BIN=/usr/lib/impala/sbin

IMPALA_HOME=/usr/lib/impala

# HIVE_HOME=/usr/lib/hive

# HBASE_HOME=/usr/lib/hbase

IMPALA_CONF_DIR=/etc/impala/conf

HADOOP_CONF_DIR=/etc/impala/conf

HIVE_CONF_DIR=/etc/impala/conf

删除
cd /usr/lib/python2.7/dist-packages/
rm -rf setuptools*

service impala-state-store restart --kudu_master_hosts=master:7051

service impala-catalog restart --kudu_master_hosts=master:7051
service impala-server restart --kudu_master_hosts=master:7051

已经耗费两天,全部失败,impala占用21000端口失败。具体原因等后续有空再解决

建议使用CentOS,装impala-kudu,参考:https://www.51dev.com/javascript/14324

3.安装spark,spark 2.4.5的安装请参考我写的一篇博文,2.3.4类似

https://blog.csdn.net/penker_zhao/article/details/102568564

具体下载地址:https://archive.apache.org/dist/spark/spark-2.3.4/

4.使用Scala,Spark-Hbase,访问HBase数据库

scala 版本2.11.12

Spark 2.3.4

HBase 2.0.2

4.1使用./hbase shell进入客户端,建表

hbase(main):019:0>create 'Student','info'

4.2在idea建立scala的maven工程

安装Scala插件

设置Scala SDK

具体参考:

https://blog.csdn.net/hr786250678/article/details/86229959

https://www.cnblogs.com/chuhongyun/p/11400884.html

https://www.cnblogs.com/wangjianwei/articles/9722234.html

具体代码,工程目录如下

pom文件

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>com.example</groupId>
  <artifactId>scalahbasetest</artifactId>
  <version>1.0-SNAPSHOT</version>
  <inceptionYear>2008</inceptionYear>
  <properties>
    <maven.compiler.source>1.8</maven.compiler.source>
    <maven.compiler.target>1.8</maven.compiler.target>
    <encoding>UTF-8</encoding>
    <hadoop.version>2.7.5</hadoop.version>
    <spark.version>2.3.4</spark.version>
    <scala.version>2.11.12</scala.version>
    <junit.version>4.12</junit.version>
    <netty.version>4.1.42.Final</netty.version>
  </properties>

  <repositories>
    <repository>
      <id>scala-tools.org</id>
      <name>Scala-Tools Maven2 Repository</name>
      <url>http://scala-tools.org/repo-releases</url>
    </repository>
  </repositories>

  <pluginRepositories>
    <pluginRepository>
      <id>scala-tools.org</id>
      <name>Scala-Tools Maven2 Repository</name>
      <url>http://scala-tools.org/repo-releases</url>
    </pluginRepository>
  </pluginRepositories>

  <dependencies>
    <dependency>
      <groupId>org.scala-lang</groupId>
      <artifactId>scala-library</artifactId>
      <version>${scala.version}</version>
    </dependency>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>4.4</version>
      <scope>test</scope>
    </dependency>
    <dependency>
      <groupId>org.specs</groupId>
      <artifactId>specs</artifactId>
      <version>1.2.5</version>
      <scope>test</scope>
    </dependency>

    <!-- spark 核心依赖包 -->
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_2.11</artifactId>
      <version>${spark.version}</version>
    </dependency>
    <dependency>
      <groupId>org.apache.hbase.connectors.spark</groupId>
      <artifactId>hbase-spark</artifactId>
      <version>1.0.0</version>
    </dependency>
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-streaming_2.11</artifactId>
      <version>${spark.version}</version>
      <!--
          java.lang.NoClassDefFoundError: org/apache/spark/streaming/dstream/DStream
      -->
      <!-- <scope>provided</scope>-->
    </dependency>
    <dependency>
      <groupId>io.netty</groupId>
      <artifactId>netty-all</artifactId>
      <version>${netty.version}</version>
    </dependency>
  </dependencies>

  <build>
    <sourceDirectory>src/main/scala</sourceDirectory>
    <testSourceDirectory>src/test/scala</testSourceDirectory>
    <plugins>
      <!-- 编译Scala 的插件 -->
      <plugin>
        <groupId>net.alchim31.maven</groupId>
        <artifactId>scala-maven-plugin</artifactId>
        <version>3.4.6</version>
      </plugin>
      <!-- 编译Java 的插件 -->
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-compiler-plugin</artifactId>
        <version>3.8.1</version>
        <configuration>
          <source>${maven.compiler.source}</source>
          <target>${maven.compiler.target}</target>
        </configuration>
      </plugin>
      <plugin>
        <groupId>org.scala-tools</groupId>
        <artifactId>maven-scala-plugin</artifactId>
        <version>2.15.2</version>
        <executions>
          <execution>
            <id>scala-compile-first</id>
            <goals>
              <goal>compile</goal>
            </goals>
            <configuration>
              <includes>
                <include>**/*.scala</include>
              </includes>
            </configuration>
          </execution>
        </executions>
      </plugin>
    </plugins>
  </build>
</project>

HBaseBulkPutExample.scala

package com.example

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.hadoop.hbase.client.{Get, Put, Result}
import org.apache.hadoop.hbase.spark.HBaseContext
import org.apache.hadoop.hbase.spark.HBaseRDDFunctions._
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.TableName
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
/**
 * @author 王天赐
 * @create 2019-11-29 9:28
 */
object HBaseBulkPutExample extends App {

  val tableName = "Student"

  val sparkConf = new SparkConf()
    .setAppName("HBaseBulkPutExample " + tableName)
    .setMaster("local[*]")
  val sc = new SparkContext(sparkConf)


  try {

    //[(Array[Byte])]
    val rdd = sc.parallelize(Array(
      Array(Bytes.toBytes("B1001"),Bytes.toBytes("name"),Bytes.toBytes("张飞")),
      Array(Bytes.toBytes("B1002"),Bytes.toBytes("name"),Bytes.toBytes("李白")),
      Array(Bytes.toBytes("B1003"),Bytes.toBytes("name"),Bytes.toBytes("韩信"))))

    val conf = HBaseConfiguration.create()
    conf.set("hbase.zookeeper.quorum", "192.168.51.32")
    conf.set("hbase.zookeeper.property.clientPort", "2181")

    val hbaseContext = new HBaseContext(sc, conf)

    val getRdd = rdd.hbaseBulkPut(hbaseContext, TableName.valueOf("Student"),
      record => {
        val put = new Put(record(0))
        put.addColumn(Bytes.toBytes("info"), record(1), record(2));
        put
      }
    )

  } finally {
    sc.stop()
  }
}
HBaseBulkGetExampleByRDD.scala
package com.example
import org.apache.hadoop.hbase.client.{Get, Result}
import org.apache.hadoop.hbase.spark.HBaseContext
import org.apache.hadoop.hbase.spark.HBaseRDDFunctions._
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.{Cell, CellUtil, HBaseConfiguration, TableName}
import org.apache.spark.{SparkConf, SparkContext}
/**
 * 使用 RDD 作为数据源, 将RDD中的数据写入到HBase
 * 特别注意 : 一定要导入 HBase 的隐式方法org.apache.hadoop.hbase.spark.HBaseRDDFunctions._
 *
 * @author 王天赐
 * @create 2019-11-29 19:35
 */
object HBaseBulkGetExampleByRDD extends App{

  // 1.创建SparkConf 以及 SparkContext, 设置本地运行模式
  val conf = new SparkConf()
    .setMaster("local[*]")
    .setAppName("HBase")
  val sc = new SparkContext(conf)
  // 设置日志输出等级为 WARN
  sc.setLogLevel("WARN")

  try {
    // 2. 创建HBaseConfiguration对象设置连接参数
    val hbaseConf = HBaseConfiguration.create()
    // 设置连接参数
    hbaseConf.set("hbase.zookeeper.quorum", "192.168.51.32")
    hbaseConf.set("hbase.zookeeper.property.clientPort", "2181")

    // 3.创建HBaseContext
    val hc = new HBaseContext(sc, hbaseConf)

    // 4. 将需要获取的数据的 Rowkey 字段等信息封装到 RDD中
    val rowKeyAndQualifier = sc.parallelize(Array(
      Array(Bytes.toBytes("B1001"), Bytes.toBytes("name")),
      Array(Bytes.toBytes("B1002"), Bytes.toBytes("name")),
      Array(Bytes.toBytes("B1003"), Bytes.toBytes("name"))
    ))

    // 5. 获取指定RowKey 以及指定字段的信息
    val result = rowKeyAndQualifier.hbaseBulkGet(hc, TableName.valueOf("Student"), 2,
      (info) => {
        val rowkey = info(0)
        // 字段名
        val qualify = info(1)
        val get = new Get(rowkey)
        get
      }
    )
    // 6. 遍历结果
    result.foreach(data => {
      // 注意 Data是 Tuple 类型
      val result: Result = data._2
      // 获取 Cell数组对象
      val cells: Array[Cell] = result.rawCells()
      // 遍历
      for (cell <- cells) {
        // 获取对应的值
        val rowKey = Bytes.toString(CellUtil.cloneRow(cell))
        val qualifier = Bytes.toString(CellUtil.cloneQualifier(cell))
        val value = Bytes.toString(CellUtil.cloneValue(cell))
        // 打印输出结果
        println("[ " + rowKey + " , " + qualifier + " , " + value + " ]")
      }
    })

  } finally {
    sc.stop()
  }

}

使用Mvn 打包 mvn clean install -DskipTests

将scalahbasetest-1.0-sNAPSHOT.jar包上传到spark安装目录,比如(opt/spark***/jars)

4.2将Hbase的jar包拷贝到Spark安装目录(比如/opt/spark***/jars)

4.3使用spark-submit运行两个scala文件

一个是插入数据

./spark-submit --class tech.zhaoxin.HBaseBulkPutExample --master spark://master:7077 /opt/spark-2.3.4-bin-hadoop2.7/jars/scalahbasetest-1.0-SNAPSHOT.jar

一个是获取数据

./spark-submit --class tech.zhaoxin.HBaseBulkGetExampleByRDD --master spark://master:7077 /opt/spark-2.3.4-bin-hadoop2.7/jars/scalahbasetest-1.0-SNAPSHOT.jar

4.4Scala如何转Java

可以先转成class,再用反编译的工具转成Java,目前还没有正式去试验下。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

我还要去追逐我的梦

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值