Apache CarbonData 1.5.0编译及安装

一、编译环境描述

  • OpenStack创建五个虚拟机,其中1个主节点(hostname为bigdatamaster),4个从节点(hostname分别为,bigdataslave1、bigdataslave2、bigdataslave3、bigdataslave4)

  • OS:CentOS 7.2_1511

  • JDK:Oracle JDK 1.8_191

  • Maven:3.5.2

  • Hadoop:Apache Hadoop 2.7.2

  • Hive:0.13.1

  • Scala:2.11.8

  • Spark:2.3.2

  • CarbonData:1.5.0

二、编译过程

1.选择源码

在CarbonData的归档地址(http://archive.apache.org/dist/carbondata/1.5.0/或者https://dist.apache.org/repos/dist/release/carbondata/)下载源码:

[root@bigdatamaster Desktop]# wget https://dist.apache.org/repos/dist/release/carbondata/1.5.0/apache-carbondata-1.5.0-source-release.zip
...
[root@bigdatamaster Desktop]# ls
apache-carbondata-1.5.0-source-release.zip
[root@bigdatamaster Desktop]# unzip apache-carbondata-1.5.0-source-release.zip
[root@bigdatamaster Desktop]# ls
apache-carbondata-1.5.0-source-release.zip  carbondata-parent-1.5.0
[root@bigdatamaster carbondata-parent-1.5.0]# ls
assembly  build   conf  datamap       dev   examples  hadoop       LICENSE          NOTICE   processing  store      tools
bin       common  core  DEPENDENCIES  docs  format    integration  licenses-binary  pom.xml  README.md   streaming

注:如果底层的hadoop系统版本为2.7.2,scala版本为2.11.8,spark版本为2.2.1,则不需要通过源码编译。由于本文所处的底层系统hadoop版本为2.7.1,scala版本为,spark版本为2.3.2,因此需要下载源码重新编译。

2.编译源码

[root@bigdatamaster carbondata-parent-1.5.0]# mvn -DskipTests -Pspark-2.3 -Dspark.version=2.3.2 -Dhadoop.version=2.7.1 clean package

...
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] Apache CarbonData :: Parent ........................ SUCCESS [  8.404 s]
[INFO] Apache CarbonData :: Common ........................ SUCCESS [ 15.636 s]
[INFO] Apache CarbonData :: Core .......................... SUCCESS [01:00 min]
[INFO] Apache CarbonData :: Processing .................... SUCCESS [ 27.459 s]
[INFO] Apache CarbonData :: Hadoop ........................ SUCCESS [ 13.402 s]
[INFO] Apache CarbonData :: Streaming ..................... SUCCESS [03:11 min]
[INFO] Apache CarbonData :: Store SDK ..................... SUCCESS [ 37.462 s]
[INFO] Apache CarbonData :: Spark Datasource .............. SUCCESS [01:31 min]
[INFO] Apache CarbonData :: Spark Common .................. SUCCESS [01:32 min]
[INFO] Apache CarbonData :: Search ........................ SUCCESS [ 34.174 s]
[INFO] Apache CarbonData :: Lucene Index DataMap .......... SUCCESS [01:35 min]
[INFO] Apache CarbonData :: Bloom Index DataMap ........... SUCCESS [ 13.619 s]
[INFO] Apache CarbonData :: Spark2 ........................ SUCCESS [02:35 min]
[INFO] Apache CarbonData :: Spark Common Test ............. SUCCESS [01:09 min]
[INFO] Apache CarbonData :: DataMap Examples .............. SUCCESS [  3.621 s]
[INFO] Apache CarbonData :: Assembly ...................... SUCCESS [ 15.694 s]
[INFO] Apache CarbonData :: CLI ........................... SUCCESS [ 24.015 s]
[INFO] Apache CarbonData :: Hive .......................... SUCCESS [ 32.317 s]
[INFO] Apache CarbonData :: presto ........................ SUCCESS [01:06 min]
[INFO] Apache CarbonData :: Spark2 Examples ............... SUCCESS [ 52.898 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 18:22 min
[INFO] Finished at: 2019-04-20T21:36:16+08:00
[INFO] Final Memory: 201M/1411M
[INFO] ------------------------------------------------------------------------...

查看pom.xml文件发现scala编译的版本默认就是2.11.8,因此在此处编译时不需要再添加“-Pscala-2.1 -Dscala.version=2.11.8”指定scala版本

编译好后,可以在assembly目录的target目录下发现指定hadoop版本和spark版本编译好的carbon文件,如下图所示:

三、安装过程

1.按照官方文档https://carbondata.apache.org/quick-start-guide.html的步骤,在Spark集群下安装和配置CarbonData(文档:Installing and Configuring CarbonData on Standalone Spark Cluster这一部分)

前提条件:

  • Hadoop的HDFS和YARN均正常运行(已安装Hadoop-2.7.1集群并正常运行)

  • Spark正常运行(已安装Spark-2.3.2集群并正常运行)

  • CarbonData用户必须有HDFS的访问权限(root账户运行)

1)在spark安装目录下创建carbonlib目录:

[root@bigdatamaster ~]# cd $SPARK_HOME
[root@bigdatamaster spark-2.3.2]# ls
bin   data      jars        LICENSE   NOTICE  R          RELEASE  sparkdata
conf  examples  kubernetes  licenses  python  README.md  sbin     yarn
[root@bigdatamaster spark-2.3.2]# mkdir carbonlib
[root@bigdatamaster spark-2.3.2]# ls
bin        conf  examples  kubernetes  licenses  python  README.md  sbin       yarn
carbonlib  data  jars      LICENSE     NOTICE    R       RELEASE    sparkdata

2)将步骤2中编译好的apache-carbondata-1.5.0-bin-spark2.3.2-hadoop2.7.1.jar拷贝至carbonlib目录下

[root@bigdatamaster spark-2.3.2]# cp ~/Desktop/carbondata-parent-1.5.0/assembly/target/scala-2.11/apache-carbondata-1.5.0-bin-spark2.3.2-hadoop2.7.1.jar ./carbonlib/
[root@bigdatamaster spark-2.3.2]# ls carbonlib/
apache-carbondata-1.5.0-bin-spark2.3.2-hadoop2.7.1.jar

3)编辑spark安装目录下conf目录下的spark-env.sh文件,将carbonlib添加至spark环境变量中

[root@bigdatamaster spark-2.3.2]# vim conf/spark-env.sh
(在文件末尾添加如下一行)
export SPARK_CLASSPATH=$SPARK_CLASSPATH:${SPARK_HOME}/carbonlib/*

4)将carbon.properties.template文件拷贝至spark安装目录的conf目录下,并命名为carbon.properties

[root@bigdatamaster carbondata-parent-1.5.0]# ls
assembly  common  datamap       docs      hadoop       licenses-binary  processing  streaming
bin       conf    DEPENDENCIES  examples  integration  NOTICE           README.md   target
build     core    dev           format    LICENSE      pom.xml          store       tools
[root@bigdatamaster carbondata-parent-1.5.0]# ls conf/
carbon.properties.template  dataload.properties.template

[root@bigdatamaster carbondata-parent-1.5.0]# cp conf/carbon.properties.template $SPARK_HOME/conf/carbon.properties
[root@bigdatamaster carbondata-parent-1.5.0]# ls $SPARK_HOME/conf
carbon.properties           metrics.properties.template   spark-env.sh
docker.properties.template  slaves                        spark-env.sh.template
fairscheduler.xml.template  slaves.template
log4j.properties.template   spark-defaults.conf.template

5)配置carbon.properties

[root@bigdatamaster spark-2.3.2]# vim conf/carbon.properties
(添加如下几行)
carbon.storelocation=hdfs://bigdatamaster:9000/carbon/Store
carbon.ddl.base.hdfs.url=hdfs://bigdatamaster:9000/carbon/Data
carbon.badRecords.location=hdfs://bigdatamaster:9000/carbon/BadRecords
carbon.lock.type=HDFSLOCK

四个配置的含义请见官网https://carbondata.apache.org/configuration-parameters.html

6)配置spark安装目录下conf目录下的spark-default.conf文件

[root@bigdatamaster spark-2.3.2]# cd conf/
[root@bigdatamaster conf]# ls
carbon.properties           metrics.properties.template   spark-env.sh
docker.properties.template  slaves                        spark-env.sh.template
fairscheduler.xml.template  slaves.template
log4j.properties.template   spark-defaults.conf.template
[root@bigdatamaster conf]# cp spark-defaults.conf.template spark-defaults.conf
[root@bigdatamaster conf]# ls
carbon.properties           metrics.properties.template  spark-defaults.conf.template
docker.properties.template  slaves                       spark-env.sh
fairscheduler.xml.template  slaves.template              spark-env.sh.template
log4j.properties.template   spark-defaults.conf

[root@bigdatamaster conf]# vim spark-defaults.conf
(添加如下2行)
spark.executor.extraJavaOptions -Dcarbon.properties.filepath=/root/data/spark-2.3.2/conf/carbon.properties
spark.driver.extraJavaOptions   -Dcarbon.properties.filepath=/root/data/spark-2.3.2/conf/carbon.properties

7)将hive-site.xml添加至spark安装目录的conf目录下

[root@bigdatamaster spark-2.3.2]# cp ~/data/hive-0.13.1/conf/hive-site.xml conf/

注:此步骤不做,则在下面测试创建表时会报如下错误

scala> carbon.sql("CREATE TABLE IF NOT EXISTS test_table(id string, name string, city string, age Int) STORED BY 'carbondata'")
2019-04-21 14:10:47 WARN  ObjectStore:6666 - Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
2019-04-21 14:10:47 WARN  ObjectStore:568 - Failed to get database default, returning NoSuchObjectException
2019-04-21 14:10:50 WARN  ObjectStore:568 - Failed to get database global_temp, returning NoSuchObjectException
2019-04-21 14:10:52 AUDIT CarbonCreateTableCommand:207 - [bigdatamaster][root][Thread-1]Creating Table with Database name [default] and Table name [test_table]
2019-04-21 14:10:53 WARN  HiveExternalCatalog:66 - Couldn't find corresponding Hive SerDe for data source provider org.apache.spark.sql.CarbonSource. Persisting data source table `default`.`test_table` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.
2019-04-21 14:10:54 AUDIT CarbonCreateTableCommand:207 - [bigdatamaster][root][Thread-1]Table created with Database name [default] and Table name [test_table]
res0: org.apache.spark.sql.DataFrame = []

8)在Spark其他节点上重复步骤1)~步骤6)

此处我直接通过scp命令传输。步骤1)~步骤6)涉及spark安装目录下的以下目录和文件:

  • carbonlib($SPARK_HOME/carbonlib)

  • spark-env.sh($SPARK_HOME/conf/spark-env.sh)

  • spark-defaults.conf($SPARK_HOME/conf/spark-defaults.conf)

  • carbon.properties($SPARK_HOME/conf/carbon.properties)

直接将一个目录和三个配置文件通过scp远程传输至spark集群其他节点相应位置即可:

[root@bigdatamaster spark-2.3.2]# scp -r carbonlib root@bigdataslave1:~/data/spark-2.3.2/  
[root@bigdatamaster spark-2.3.2]# scp -r carbonlib root@bigdataslave2:~/data/spark-2.3.2/  
[root@bigdatamaster spark-2.3.2]# scp -r carbonlib root@bigdataslave3:~/data/spark-2.3.2/   
[root@bigdatamaster spark-2.3.2]# scp -r carbonlib root@bigdataslave4:~/data/spark-2.3.2/

[root@bigdatamaster spark-2.3.2]# scp conf/spark-env.sh root@bigdataslave1:~/data/spark-2.3.2/conf/  
[root@bigdatamaster spark-2.3.2]# scp conf/spark-env.sh root@bigdataslave2:~/data/spark-2.3.2/conf/  
[root@bigdatamaster spark-2.3.2]# scp conf/spark-env.sh root@bigdataslave3:~/data/spark-2.3.2/conf/  
[root@bigdatamaster spark-2.3.2]# scp conf/spark-env.sh root@bigdataslave4:~/data/spark-2.3.2/conf/
 
[root@bigdatamaster spark-2.3.2]# scp conf/spark-defaults.conf root@bigdataslave1:~/data/spark-2.3.2/conf/
[root@bigdatamaster spark-2.3.2]# scp conf/spark-defaults.conf root@bigdataslave2:~/data/spark-2.3.2/conf/
[root@bigdatamaster spark-2.3.2]# scp conf/spark-defaults.conf root@bigdataslave3:~/data/spark-2.3.2/conf/
[root@bigdatamaster spark-2.3.2]# scp conf/spark-defaults.conf root@bigdataslave4:~/data/spark-2.3.2/conf/

[root@bigdatamaster spark-2.3.2]# scp conf/carbon.properties root@bigdataslave1:~/data/spark-2.3.2/conf/ 
[root@bigdatamaster spark-2.3.2]# scp conf/carbon.properties root@bigdataslave2:~/data/spark-2.3.2/conf/
[root@bigdatamaster spark-2.3.2]# scp conf/carbon.properties root@bigdataslave3:~/data/spark-2.3.2/conf/
[root@bigdatamaster spark-2.3.2]# scp conf/carbon.properties root@bigdataslave4:~/data/spark-2.3.2/conf/

 

四、测试

1)创建测试数据,并上传至HDFS(注:下面测试数据集测试步骤选自文献9,在此感谢作者)

[root@bigdatamaster Desktop]# vim carbonTestData.csv
(添加如下4行)
id,name,city,age
1,david,shenzhen,31
2,eason,shenzhen,27
3,jarry,wuhan,35

[root@bigdatamaster Desktop]# hadoop dfs -put carbonTestData.csv /
[root@bigdatamaster Desktop]# hadoop dfs -ls /
Found 6 items
-rw-r--r--   2 root supergroup         78 2019-04-21 14:06 /carbonTestData.csv
drwxr-xr-x   - root supergroup          0 2019-04-04 14:55 /hadoopdata
drwxr-xr-x   - root supergroup          0 2019-03-06 16:16 /hbase
drwxr-xr-x   - root supergroup          0 2019-03-06 16:04 /root
drwxrwxr-x   - root supergroup          0 2019-03-06 17:01 /tmp
drwxr-xr-x   - root supergroup          0 2019-03-06 16:27 /user

2)在bigdatamaster的terminal输入以下目录启动spark shell

spark-shell \
--master spark://bigdatamaster:7077 \
--jars /root/data/spark-2.3.2/carbonlib/apache-carbondata-1.5.0-bin-spark2.3.2-hadoop2.7.1.jar \
--total-executor-cores 2 \
--executor-memory 2G

 

[root@bigdatamaster ~]# spark-shell \
> --master spark://bigdatamaster:7077 \
> --jars /root/data/spark-2.3.2/carbonlib/apache-carbondata-1.5.0-bin-spark2.3.2-hadoop2.7.1.jar \
> --total-executor-cores 2 \
> --executor-memory 2G
2019-04-21 14:08:56 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://bigdatamaster:4040
Spark context available as 'sc' (master = spark://bigdatamaster:7077, app id = app-20190421140908-0002).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.2
      /_/
         
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_191)
Type in expressions to have them evaluated.
Type :help for more information.


scala> import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SparkSession


scala> import org.apache.spark.sql.CarbonSession._
import org.apache.spark.sql.CarbonSession._


scala> val carbon = SparkSession.builder().config(sc.getConf).getOrCreateCarbonSession("hdfs://bigdatamaster:9000/carbon/Store")

2019-04-21 14:10:07 WARN  SparkContext:66 - Using an existing SparkContext; some configuration may not take effect.
2019-04-21 14:10:07 WARN  CarbonProperties:168 - main The enable unsafe sort value "null" is invalid. Using the default value "true
2019-04-21 14:10:07 WARN  CarbonProperties:168 - main The enable off heap sort value "null" is invalid. Using the default value "true
2019-04-21 14:10:07 WARN  CarbonProperties:168 - main The custom block distribution value "null" is invalid. Using the default value "false
2019-04-21 14:10:07 WARN  CarbonProperties:168 - main The enable vector reader value "null" is invalid. Using the default value "true
2019-04-21 14:10:07 WARN  CarbonProperties:168 - main The carbon task distribution value "null" is invalid. Using the default value "block
2019-04-21 14:10:07 WARN  CarbonProperties:168 - main The enable auto handoff value "null" is invalid. Using the default value "true
2019-04-21 14:10:07 WARN  CarbonProperties:168 - main The specified value for property carbon.sort.storage.inmemory.size.inmbis invalid.
2019-04-21 14:10:07 WARN  CarbonProperties:168 - main The specified value for property 512is invalid.
2019-04-21 14:10:07 WARN  CarbonProperties:168 - main The specified value for property carbon.sort.storage.inmemory.size.inmbis invalid. Taking the default value.512
carbon: org.apache.spark.sql.SparkSession = org.apache.spark.sql.CarbonSession@53e166ad

(构建表模式)
scala> carbon.sql("CREATE TABLE IF NOT EXISTS test_table(id string, name string, city string, age Int) STORED BY 'carbondata'")

2019-04-21 14:12:34 AUDIT CarbonCreateTableCommand:207 - [bigdatamaster][root][Thread-1]Creating Table with Database name [default] and Table name [test_table]
res1: org.apache.spark.sql.DataFrame = []

(上传数据)
scala> carbon.sql("LOAD DATA INPATH 'hdfs://bigdatamaster:9000/carbonTestData.csv' INTO TABLE test_table")

2019-04-21 14:21:12 WARN  DeleteLoadFolders:168 - main Files are not found in segment hdfs://bigdatamaster:9000/carbon/Store/default/test_table/Fact/Part0/Segment_0 it seems, files are already being deleted
2019-04-21 14:21:12 AUDIT CarbonDataRDDFactory$:207 - [bigdatamaster][root][Thread-1]Data load request has been received for table default.test_table
2019-04-21 14:21:18 AUDIT CarbonDataRDDFactory$:207 - [bigdatamaster][root][Thread-1]Data load is successful for default.test_table
2019-04-21 14:21:18 AUDIT MergeIndexEventListener:207 - [bigdatamaster][root][Thread-1]Load post status event-listener called for merge index
res4: org.apache.spark.sql.DataFrame = []

(查看表数据)
scala> carbon.sql("SELECT * FROM test_table").show()
+---+-----+--------+---+                                                        
| id| name|    city|age|
+---+-----+--------+---+
|  1|david|shenzhen| 31|
|  2|eason|shenzhen| 27|
|  3|jarry|   wuhan| 35|
+---+-----+--------+---+

scala> carbon.sql("SELECT city, avg(age), sum(age) FROM test_table GROUP BY city").show()
+--------+--------+--------+                                                    
|    city|avg(age)|sum(age)|
+--------+--------+--------+
|   wuhan|    35.0|      35|
|shenzhen|    29.0|      58|
+--------+--------+--------+

 

五、参考文献

  1. CarbonData使用示例(Java):https://blog.csdn.net/u013181284/article/details/77574094

  2. CarbonData编译、安装和集成Spark 2.2:https://blog.csdn.net/wuzhilon88/article/details/78864735

  3. Spark2.1.0 + CarbonData1.0.0集群模式部署及使用入门:https://blog.csdn.net/coridc/article/details/61915801

  4. Apache CarbonData :一种为更加快速数据分析而生的新Hadoop文件版式:https://blog.csdn.net/u011239443/article/details/52015680

  5. 【思维导图】Parquet Orc CarbonData 三种列式存储格式对比:https://blog.csdn.net/lxhandlbb/article/details/80754252

  6. carbondata 安装文档:https://blog.csdn.net/u013181284/article/details/73331170

  7. Apache CarbonData学习资料汇总:https://blog.csdn.net/xubo245/article/details/84336960

  8. Apache CarbonData中文文档:https://www.iteblog.com/archives/tag/carbondata/

  9. Apache CarbonData 1.0.0 编译部署 on Mac OS:https://ask.hellobi.com/blog/marsj/6164

 

六、附录

贴几张mvn编译源码过程中的图片,真养眼。。

 

转载于:https://my.oschina.net/xhhuang/blog/3039958

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值