环境
cdh版本:cdh5.14.0
hive版本:1.1.0-cdh5.13.0
spark版本:2.11-2.3.1
mlsql版本:1.2.0
pom依赖
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<parent>
<artifactId>***</artifactId>
<groupId>***</groupId>
<version>***</version>
</parent>
<modelVersion>***</modelVersion>
<artifactId>mlsql-demo</artifactId>
<properties>
<scala.version>2.11.8</scala.version>
<scala.binary.version>2.11</scala.binary.version>
<spark.version>2.3.1</spark.version>
<mlsql.version>1.2.0</mlsql.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
<exclusions>
<exclusion>
<artifactId>org.apache.spark</artifactId>
<groupId>spark-sql_${scala.binary.version}</groupId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql-kafka-0-10_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-jdbc</artifactId>
<version>${hive.version}</version>
</dependency>
<dependency>
<groupId>tech.mlsql</groupId>
<artifactId>streamingpro-common</artifactId>
<version>${mlsql.version}</version>
</dependency>
<dependency>
<groupId>tech.mlsql</groupId>
<artifactId>streamingpro-api</artifactId>
<version>${mlsql.version}</version>
</dependency>
<dependency>
<groupId>tech.mlsql</groupId>
<artifactId>streamingpro-dsl</artifactId>
<version>${mlsql.version}</version>
</dependency>
<dependency>
<groupId>tech.mlsql</groupId>
<artifactId>streamingpro-spark-common</artifactId>
<version>${mlsql.version}</version>
</dependency>
<dependency>
<groupId>tech.mlsql</groupId>
<artifactId>streamingpro-mlsql-spark_2.3</artifactId>
<version>${mlsql.version}</version>
</dependency>
<dependency>
<groupId>tech.mlsql</groupId>
<artifactId>streamingpro-spark-2.3.0-adaptor</artifactId>
<version>${mlsql.version}</version>
</dependency>
<build>
<plugins>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
</plugin>
<plugin>
<groupId>org.antlr</groupId>
<artifactId>antlr4-maven-plugin</artifactId>
<version>4.7.1</version>
<executions>
<execution>
<goals>
<goal>antlr4</goal>
</goals>
<!--<phase>generate-sources</phase>-->
</execution>
</executions>
<configuration>
<visitor>false</visitor>
<sourceDirectory>src/main/resources</sourceDirectory>
<outputDirectory>src/main/java/streaming/dsl/parser</outputDirectory>
</configuration>
</plugin>
</plugins>
</build>
</project>
代码实现
val spark = SparkSession.builder()
.appName("MlsqlJob")
//.master("local[*]")
.enableHiveSupport()
.getOrCreate()
def evaluateMLSQL(spark: SparkSession, mlsql: String): Unit = {
var successed = false
var msg = ""
JobManager.initForTest(spark)
val listener = new ScriptSQLExecListener(spark, null, null)
val groupId = "group"
JobManager.addJobManually(MLSQLJobInfo("user", MLSQLJobType.SCRIPT, "jobName", "jobContent", groupId, System.currentTimeMillis(), -1))
ScriptSQLExec.setContext(MLSQLExecuteContext(listener, "user", "/tmp/user", groupId, Map()))
ScriptSQLExec.parse(mlsql, listener, skipInclude = true, skipAuth = true, skipPhysicalJob = false, skipGrammarValidate = false)
val tName = listener.getLastSelectTable().get
val table = spark.table(tName)
table.show()
}
依赖冲突的坑
1.antlr4冲突的坑
用spark-submit提交的时候,mlsql的antlr4的jar包会和spark的jar包冲突,这个原因是mlsql语法用了一种方言,spark sql自己也用了一种方言,yarn-cluster提交的时候,优先使用的是spark的antlr4的jar包,导致一直报mlsql语句不能解析,这个问题的解决方案是让spark在driver端和executor端运行的时候不要加载spark的antlr4包,而是加载我们自己的pom里面的antlr4包,在spark的脚本里面添加如下配置项:
–conf spark.executor.extraClassPath=antlr4-runtime-4.7.1.jar
–conf spark.driver.extraClassPath=antlr4-runtime-4.7.1.jar
这样antlr4的冲突问题就解决了,正常的运行大部分的mlSql官方demo。
2.hive jdbc方式读取hive数据报错
mlsql enginne部署的时候一直都存在这个问题。
我在内嵌mlsql的时候也遇到这个报错了,原因是mlsql的包和spark的包有版本冲突的问题,经过排查,需要将spark自带的一些hive相关的jar包去除掉:
这是spark2.3目录下面自带的关于hive的jar包:
这里冲突的jar包是这两个包
hive-jdbc-1.2.1.spark2.jar
spark-hive-thriftserver_2.11-2.3.1.jar
将这两个包删除,即可正常通过jdbc方式读取hive数据,用mlsql进行操作。
排除后的spark jar包如下:
自建的mlsql项目依赖的hive相关jar包:
测试
测试spark sql读取hive
mlsql语句:
select * from default.test limit 10 as xj;
结果:
mlsql语句:
set user="hive";
connect jdbc where
url="jdbc:hive2://master01:10000/default"
and driver="org.apache.hive.jdbc.HiveDriver"
and user="${user}"
and password="hive"
as db_1;
load jdbc.`db_1.test` as test;
select * from test limit 10 as xj;
结果:
spark sql和jdbc两种方式读取hive都测试通过。
通过内嵌到项目,可以自己对mlsql的资源进行管理,不必长期的启动一个mlsql-engine服务在yarn上占用资源:
完整的启动脚本
$SPARK_HOME/bin/spark-submit --master yarn \
--deploy-mode cluster \
--name mlsql \
--executor-memory 1G \
--driver-memory 1G \
--executor-cores 1 \
--driver-cores 2 \
--files ${log4j},${hadoop_site},${config_file} \
--driver-java-options "${cp_ops}" \
--jars 自定义的mlsql项目的jar路径 \
--conf spark.sql.hive.convertMetastoreParquet=false \
--conf spark.sql.hive.metastorePartitionPruning=false \
--conf spark.sql.parquet.enableVectorizedReader=false \
--conf spark.executor.extraClassPath=antlr4-runtime-4.7.1.jar \
--conf spark.driver.extraClassPath=antlr4-runtime-4.7.1.jar \
--conf spark.executorEnv.JAVA_HOME=/opt/cloudera/parcels/jdk8 \
--conf spark.yarn.appMasterEnv.JAVA_HOME=/opt/cloudera/parcels/jdk8 \
--conf spark.shuffle.service.enabled=true \
--class 启动类 \
自定义mlsql项目jar包 \