spark2.3.2+Yarn+Carbondata Thrift Server 配置carbondata1.5

最新推荐文章于 2024-06-19 11:18:27 发布

nszkadrgg

最新推荐文章于 2024-06-19 11:18:27 发布

阅读量2.3k

点赞数

分类专栏： Spark

本文链接：https://blog.csdn.net/nszkadrgg/article/details/88183896

版权

Spark 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

Carbondata简介

Apache Carbondata 是一种新的融合存储解决方案，利用先进的列式存储，索引，压缩和编码技术提高查询效率。

Apache Carbondata 中文文档地址: http://carbondata.iteblog.com

Apache Carbondata 英文文档: http://carbondata.apache.org/

GitHub 源码地址 https://github.com/apache/carbondata/

1.是基于CDH集成Carbondata

安装mysql

https://blog.csdn.net/nszkadrgg/article/details/78666628 tar 包的安装方式

https://blog.csdn.net/nszkadrgg/article/details/85052693 rpm 包的安装方式

https://blog.csdn.net/a774630093/article/details/79270080 yum 的安装方式

安装CDH

https://blog.csdn.net/nszkadrgg/article/details/80022704 CDH5.10离线安装

2.编译个安装Carbondata基于CDH

https://github.com/apache/carbondata/tree/master/build Carbondata编译文档

下载Spark2.3.2的版本

https://archive.apache.org/dist/spark/spark-2.3.2/

解压下载好的 Spark2.3.2

下载maven 配置环境变量

配置的环境变量 vim /etc/profile

记得 source /etc/profile 让配置的环境变量生效

maven环境变量是否生效的验证

jdk1.8 环境变量的配置

验证JDK1.8是否安装成功

https://blog.csdn.net/cjf_wei/article/details/78700321 安装thrift 很重要，按照里面的步骤来安装，thrift 选择0.9.3的版本，其他的组件就是文章中写的版本来安装。

Carbondata 编译

下载 Carbondata,选择branch-1.5的分支，然后Clone or download

解压 carbondata包，然后进入目录

编译命令报了如下的错误，然后 mvn clean

然后修改里面的pom.xml文件

<id>cloudera</id>

<name>cloudera Repository</name>

<url>https://repository.cloudera.com/artifactory/cloudera-repos</url>

</repository>

删除然后:wq 保存退出

<presto.jdbc.url>localhost:8086</presto.jdbc.url>

<spark.hadoop.hive.metastore.uris>thrift://localhost:8086</spark.hadoop.hive.metastore.uris>

然后再次进入carbondata目录进行编译。

命令:

mvn clean package -DskipTests -Pspark-2.3 -Dspark.version=2.3.2 -Phadoop-2.8 -Phive -Phive-thriftserver -Pyarn -Dyarn.version=2.6.0-cdh5.15.2 -Dhadoop.version=2.6.0-cdh5.15.2 package -Pbuild-with-format

编译CarbonData，使用Spark 2.3.2，CDH hadoop 2.6：别人编译通过的

mvn -DskipTests -Pspark-2.3 -Phadoop-2.8 -Pbuild-with-format -Pmv -Dspark.version=2.3.2 -Dhadoop.version=2.6.0-cdh5.15.0 clean package

[WARNING] The requested profile "hive" could not be activated because it does not exist.

[WARNING] The requested profile "hive-thriftserver" could not be activated because it does not exist.

[WARNING] The requested profile "yarn" could not be activated because it does not exist.

[ERROR] Failed to execute goal on project carbondata-examples: Could not resolve dependencies for project org.apache.carbondata:carbondata-examples:jar:1.5.3-SNAPSHOT: Failed to collect dependencies at org.alluxio:alluxio-core-client-hdfs:jar:1.8.1: Failed to read artifact descriptor for org.alluxio:alluxio-core-client-hdfs:jar:1.8.1: Could not transfer artifact org.alluxio:alluxio-core-client-hdfs:pom:1.8.1 from/to alimaven (http://maven.aliyun.com/nexus/content/groups/public/): Timeout while waiting for concurrent download of /opt/repo/org/alluxio/alluxio-core-client-hdfs/1.8.1/alluxio-core-client-hdfs-1.8.1.pom.part to progress -> [Help 1]

[ERROR]

[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.

[ERROR] Re-run Maven using the -X switch to enable full debug logging.

[ERROR]

[ERROR] For more information about the errors and possible solutions, please read the following articles:

[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException

[ERROR]

[ERROR] After correcting the problems, you can resume the build with the command

[INFO] ------------------------------------------------------------------------

[INFO] BUILD FAILURE

[INFO] ------------------------------------------------------------------------

[INFO] Total time: 25:22 min

[INFO] Finished at: 2019-02-23T17:35:21+08:00

[INFO] ------------------------------------------------------------------------

[WARNING] The requested profile "hive" could not be activated because it does not exist.

[WARNING] The requested profile "hive-thriftserver" could not be activated because it does not exist.

[WARNING] The requested profile "yarn" could not be activated because it does not exist.

[ERROR] Failed to execute goal org.scala-tools:maven-scala-plugin:2.15.2:compile (default) on project carbondata-examples: wrap: org.apache.commons.exec.ExecuteException: Process exited with an error: 1(Exit value: 1) -> [Help 1]

[ERROR]

[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.

[ERROR] Re-run Maven using the -X switch to enable full debug logging.

[ERROR]

[ERROR] For more information about the errors and possible solutions, please read the following articles:

[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException

[ERROR]

[ERROR] After correcting the problems, you can resume the build with the command

[ERROR] mvn <goals> -rf :carbondata-examples

[root@cdh01 carbondata-branch-1.5]# mvn clean package -DskipTests -Pspark-2.3 -Dspark.version=2.3.2 -Phadoop-2.8 -Phive -Phive-thriftserver -Pyarn -Dyarn.version=2.6.0-cdh5.15.2 -Dhadoop.version=2.6.0-cdh5.15.2 package -Pbuild-with-format

编译成功！

然后找到编译后源码包的位置

[root@cdh01 scala-2.11]#cd /opt/software/carbondata-branch-1.5/assembly/target/scala-2.11

carbondata的部署

先到spark包的目录

[root@cdh01 software]# cd spark-2.3.2-bin-2.6.0-cdh5.15.2/

新建carbonlib包

[root@cdh01 spark-2.3.2-bin-2.6.0-cdh5.15.2]# mkdir carbonlib

已经编译好的carbondata 放入 carbonlib包中

[root@cdh01 spark-2.3.2-bin-2.6.0-cdh5.15.2]# cd carbonlib/

[root@cdh01 carbonlib]# ll

total 91344

-rw-r--r--. 1 root root 93533271 Feb 26 09:15 apache-carbondata-1.5.3-SNAPSHOT-bin-spark2.3.2-hadoop2.6.0-cdh5.15.2.jar

到spark 的conf 目录，修改参数

[root@cdh01 spark-2.3.2-bin-2.6.0-cdh5.15.2]# cd conf/

[root@cdh01 conf]# ll

total 56

-rw-r--r--. 1 root root 4094 Feb 26 09:22 carbon.properties

-rw-r--r--. 1 root root 4094 Feb 26 09:22 carbon.properties.template

-rw-rw-r--. 1 root root 996 Sep 16 20:13 docker.properties.template

-rw-rw-r--. 1 root root 1105 Sep 16 20:13 fairscheduler.xml.template

-rw-rw-r--. 1 root root 2025 Sep 16 20:13 log4j.properties.template

-rw-rw-r--. 1 root root 7801 Sep 16 20:13 metrics.properties.template

-rw-rw-r--. 1 root root 862 Feb 26 09:30 slaves.template

-rw-r--r--. 1 root root 1292 Feb 26 09:30 spark-defaults.conf

-rw-rw-r--. 1 root root 1292 Sep 16 20:13 spark-defaults.conf.template

-rwxr-xr-x. 1 root root 4298 Feb 26 09:21 spark-env.sh

-rwxrwxr-x. 1 root root 4221 Sep 16 20:13 spark-env.sh.template

复制文件

[root@cdh01 conf]# cp carbon.properties.template carbon.properties

[root@cdh01 conf]# cp spark-defaults.conf.template spark-defaults.conf

[root@cdh01 conf]# cp spark-env.sh.template spark-env.sh

添加主机名修改后保存退出

[root@cdh01 conf]# vim slaves.template

# Licensed to the Apache Software Foundation (ASF) under one or more

# contributor license agreements. See the NOTICE file distributed with

# this work for additional information regarding copyright ownership.

# The ASF licenses this file to You under the Apache License, Version 2.0

# (the "License"); you may not use this file except in compliance with

# the License. You may obtain a copy of the License at

# http://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software

# distributed under the License is distributed on an "AS IS" BASIS,

# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

# See the License for the specific language governing permissions and

# limitations under the License.

# A Spark Worker will be started on each of the machines listed below.

cdh01

[root@cdh01 conf]# vim spark-defaults.conf

添加如下参数

spark.master=yarn-client

spark.yarn.dist.files=/opt/spark-2.3.2-bin-2.6.0-cdh5.15.2/conf/carbon.properties

spark.yarn.dist.archives=/opt/spark-2.3.2-bin-2.6.0-cdh5.15.2/carbonlib/carbondata.tar.gz

spark.executor.extraJavaOptions="-Dcarbon.properties.filepath = carbon.properties"

spark.executor.extraClassPath=carbondata.tar.gz/carbonlib/*

spark.driver.extraClassPath=/opt/spark-2.3.2-bin-2.6.0-cdh5.15.2/carbonlib/*

spark.driver.extraJavaOptions="-Dcarbon.properties.filepath = $SPARK_HOME/conf/carbon.properties"

# 如果你的 CarbonData 实例仅用于查询，你可以在 spark 配置文件设置 spark.speculation = true 属性

spark.speculation = true

# # 这个值可以设置成 executor 核总数的 1 到 2倍。在一个聚合场景里，将这个值从 200 减少到 32，查询时间从 17 秒减少到 9 秒。

# #spark.sql.shuffle.partitions=40

spark.sql.shuffle.partitions=32

# #增加每个spark任务处理的数据量，可以减少spark的任务个数，可以减少文件数

set mapred.min.split.size=1342177280

修改carbon.properties 加入以下参数

[root@cdh01 conf]# vim carbon.properties

carbon.storelocation=hdfs://192.168.1.130:8020/user/hive/warehouse/carbon.store

carbon.task.distribution=merge_small_files

hive 的metadata db(很重要)

是把hive-site.xml 放在编译后的conf文件中吧，这个我漏了写了，是，要从cdh的配置中拷出来，放的位置/opt/spark-2.3.2-bin-2.6.0-cdh5.15.2/conf

修改 spark-env.sh 添加以下环境变量(参数都结合你资源的情况来进行调整)

export SPARK_MASTER_IP=cdh01

export SCALA_HOME=/opt/software/scala-2.11.8

export SPARK_WORKER_MEMORY=3g

export JAVA_HOME=/usr/java/jdk1.8.0_45

export HADOOP_HOME=/opt/cloudera/parcels/CDH-5.15.2-1.cdh5.15.2.p0.3/lib/hadoop

export HADOOP_CONF_DIR=/opt/cloudera/parcels/CDH-5.15.2-1.cdh5.15.2.p0.3/lib/hadoop/etc/hadoop

添加启动脚本

[root@cdh01 hadoop-hdfs]# pwd

/var/lib/hadoop-hdfs

[root@cdh01 hadoop-hdfs]# vim startup.sh

sh /opt/software/spark-2.3.2-bin-2.6.0-cdh5.15.2/bin/spark-submit \

--class org.apache.carbondata.spark.thriftserver.CarbonThriftServer \

--num-executors 2 --driver-memory 3g --executor-memory 6g --executor-cores 2 \

/opt/software/spark-2.3.2-bin-2.6.0-cdh5.15.2/carbonlib/apache-carbondata-1.5.3-SNAPSHOT-bin-spark2.3.2-hadoop2.6.0-cdh5.15.2.jar \

hdfs://192.168.137.130:8020/user/hive/warehouse/carbon.store #carbondata元数据的位置

以上参数添加完了以后保存

chmod +x startup.sh 执行了以后变成绿色就可以了

然后启动

[root@cdh01 hadoop-hdfs]# sh startup.sh

报了一下错误，这我们要修改cdh yarn 服务的资源

java.lang.IllegalArgumentException: Required executor memory (6144+614 MB) is above the max threshold (1041 MB) of this cluster! Please check the values of 'yarn.scheduler.maximum-allocation-mb' and/or 'yarn.nodemanager.resource.memory-mb'.

at org.apache.spark.deploy.yarn.Client.verifyClusterResources(Client.scala:318)

at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:166)

at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:57)

at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:164)

at org.apache.spark.SparkContext.<init>(SparkContext.scala:500)

at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2493)

at org.apache.spark.sql.CarbonSession$CarbonBuilder$$anonfun$2.apply(CarbonSession.scala:241)

at org.apache.spark.sql.CarbonSession$CarbonBuilder$$anonfun$2.apply(CarbonSession.scala:233)

at scala.Option.getOrElse(Option.scala:121)

at org.apache.spark.sql.CarbonSession$CarbonBuilder.getOrCreateCarbonSession(CarbonSession.scala:233)

at org.apache.spark.sql.CarbonSession$CarbonBuilder.getOrCreateCarbonSession(CarbonSession.scala:169)

at org.apache.carbondata.spark.thriftserver.CarbonThriftServer$.main(CarbonThriftServer.scala:74)

at org.apache.carbondata.spark.thriftserver.CarbonThriftServer.main(CarbonThriftServer.scala)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:497)

at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)

at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)

at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)

at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)

at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)

at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

2019-02-26 13:38:45 INFO AbstractConnector:318 - Stopped Spark@7d3c09ec{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}

2019-02-26 13:38:45 INFO SparkUI:54 - Stopped Spark web UI at http://cdh01:4040

2019-02-26 13:38:45 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:66 - Attempted to request executors before the AM has registered!

2019-02-26 13:38:45 INFO YarnClientSchedulerBackend:54 - Stopped

2019-02-26 13:38:45 INFO MapOutputTrackerMasterEndpoint:54 - MapOutputTrackerMasterEndpoint stopped!

2019-02-26 13:38:45 INFO MemoryStore:54 - MemoryStore cleared

2019-02-26 13:38:45 INFO BlockManager:54 - BlockManager stopped

2019-02-26 13:38:45 INFO BlockManagerMaster:54 - BlockManagerMaster stopped

2019-02-26 13:38:45 WARN MetricsSystem:66 - Stopping a MetricsSystem that is not running

2019-02-26 13:38:45 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:54 - OutputCommitCoordinator stopped!

2019-02-26 13:38:45 INFO SparkContext:54 - Successfully stopped SparkContext

Exception in thread "main" java.lang.IllegalArgumentException: Required executor memory (6144+614 MB) is above the max threshold (1041 MB) of this cluster! Please check the values of 'yarn.scheduler.maximum-allocation-mb' and/or 'yarn.nodemanager.resource.memory-mb'.