文章目录
Github
地址:https://github.com/ithuhui/hui-base-java
模块:【hui-base-spark】
分支:master
位置:com.hui.base.spark.sparksql.job.MetroAnalysisJob
Note
hadoop完全分布式集群搭建刚写完。还是先写一下应用。这里写一下sparksql怎么应用起来。
Spark SQL是spark家族里面最常用的。在实际开发当中也比RDD常用。
题外话。想了解RDD的java函数式编程可以看
【Spark】SparkRDD开发手册(JavaAPI函数式编程) https://blog.csdn.net/HuHui_/article/details/83905308
-
我们先看一下wiki(来历):
Spark SQL在Spark核心上带出一种名为SchemaRDD的数据抽象化概念,提供结构化和半结构化数据相关的支持。Spark SQL提供了领域特定语言,可使用Scala、Java或Python来操纵SchemaRDDs。它还支持使用使用命令行界面和ODBC/JDBC服务器操作SQL语言。在Spark 1.3版本,SchemaRDD被重命名为DataFrame。
-
重点注意
以前项目经常能看到Hive解析MapReduce提交到集群上运行。后来出现了Spark,又有了Hive解析成SparkJob提交到集群上运行。这里解释一下
SparkSQL的前身是Shark,但又因为Shark对于Hive的太多依赖,2014年spark团队停止对Shark的开发,将所有资源放SparkSQL项目上,SparkSQL作为Spark生态的一员逐渐发展,而不再受限于Hive,只是兼容Hive;Hive on Spark是由Cloudera发起,由Intel、MapR等公司共同参与的开源项目,2014年spark团队停止对Shark的开发,将所有资源放SparkSQL项目上,也就是说,Hive将不再受限于一个引擎,可以采用Map-Reduce、Tez、Spark等引擎
-
应用场景
可以说大数据应用来说是最简单的,最方便应用的计算方式。通过对数据源的读取后,使用sql语言即可分析并解决大部分的大数据分析计算问题。一般在数据分层里面不管是源数据计算还是业务数据分析都十分常用。
-
小编提醒
我github的例子不需要你去安装spark环境。单机版的在本机local既可用。
以后更新完所有的基础用法。用一个项目把全部应用起来。就不能用单机版跑了。
但是学习过程并不需要被这些束缚。
最常用的是计算第一第二层数据,存hbase,然后计算第三层数据(业务相关),存到结果表(ES or Mysql…etc)。
Ready
- maven依赖 我这里使用spark2.1.1
- sql基础
- mysql(其他database也可以,用于计算结果存储)
Core Code
1. Mysql数据库建结果表
CREATE TABLE `hui_metro_test` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`from_station` varchar(255) DEFAULT NULL,
`to_station` varchar(255) DEFAULT NULL,
`count` int(11) DEFAULT NULL,
`distance` double DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=201743 DEFAULT CHARSET=utf8;
2. DB配置文件
db.url=jdbc:mysql://127.0.0.1:3306/hui?characterEncoding=UTF-8
db.user=root
db.password=123456
db.driver=com.mysql.jdbc.Driver
3. 搞个文件
{ "from_station":"西朗","to_station":"坑口","count":"1","distance":"1.6" },
{ "from_station":"西朗","to_station":"花地湾","count":"2","distance":"2.5" },
{ "from_station":"西朗","to_station":"芳村","count":"3","distance":"3.8" },
{ "from_station":"西朗","to_station":"黄沙","count":"4","distance":"5.2" },
{ "from_station":"西朗","to_station":"长寿路","count":"5","distance":"6.1" },
{ "from_station":"西朗","to_station":"陈家祠","count":"6","distance":"7.3" },
{ "from_station":"西朗","to_station":"西门口","count":"7","distance":"8.3" },
{ "from_station":"西朗","to_station":"公园前","count":"8","distance":"9.1" },
{ "from_station":"西朗","to_station":"农讲所","count":"9","distance":"10.3" },
{ "from_station":"西朗","to_station":"烈士陵园","count":"10","distance":"11.3" }
4. 数据分层
我这里没有什么数据分层,根据实际需要。我是直接把分析结果文件存入mysql
5. SparkJob父类
/**
* <b><code>SparkJob</code></b>
* <p/>
* Description:
* <p/>
* <b>Creation Time:</b> 2018/11/11 17:39.
*
* @author Hu Weihui
*/
public class SparkJob {
/**
* The constant LOGGER.
*
* @since hui_project 1.0.0
*/
private static final Logger LOGGER = LoggerFactory.getLogger(SparkJob.class);
/**
* The constant serialVersionUID.
*
* @since hui_project 1.0.0
*/
private static final long serialVersionUID = 771902776566370732L;
/**
* Instantiates a new Spark job.
*/
protected SparkJob(){}
/**
* Execute.
*
* @param sparkContext the spark context
* @param args the args
* @since hui_project 1.0.0
*/
public void execute(JavaSparkContext sparkContext, String[] args) {
}
/**
* Execute.
*
* @param sparkContext the spark context
* @since hui_project 1.0.0
*/
public void execute(JavaSparkContext sparkContext) {
}
}
6. MetroAnalysisJob(具体业务sparkjob)
/**
* <b><code>MetroAnalysisJob</code></b>
* <p/>
* Description:
* <p/>
* <b>Creation Time:</b> 2018/11/11 17:32.
*
* @author Hu Weihui
*/
public class MetroAnalysisJob extends SparkJob {
private static Logger LOGGER = LoggerFactory.getLogger(MetroAnalysisJob.class);
private static final String INPUT_FILE_PATH
= MetroAnalysisJob.class.getClassLoader().getResource("test.json").toString();
private static final String OUTPUT_FILE_PATH
= "D:/test/test";
private static final String SQL = "select * from hui_metro_testjson";
public static void main(String[] args) {
SparkConf sparkConf = new SparkConf()
.setAppName("test")
.setMaster("local[4]");
JavaSparkContext sparkContext = new JavaSparkContext(sparkConf);
MetroAnalysisJob metroAnalysisJob = new MetroAnalysisJob();
metroAnalysisJob.execute(sparkContext, args);
}
@Override
public void execute(JavaSparkContext sparkContext, String[] args) {
super.execute(sparkContext, args);
deal(sparkContext, INPUT_FILE_PATH, OUTPUT_FILE_PATH);
}
/**
* 数据逻辑处理
* @param sparkContext
* @param inPutPath
* @param outPutPath
*/
private void deal(JavaSparkContext sparkContext, String inPutPath, String outPutPath) {
SparkJobUtil.checkFileExists(inPutPath);
SQLContext sqlContext = new SQLContext(sparkContext);
// sqlContext.setConf("spark.sql.parquet.binaryAsString","true");
//创建快照临时表
Dataset<Row> dataset = sqlContext.read().json(inPutPath);
dataset.registerTempTable("hui_metro_testjson");
dataset.show(10);
Dataset<Row> resultFrame = sqlContext.sql(SQL);
if (resultFrame.count() > 0) {
resultFrame.repartition(3).write()
.mode(SaveMode.Append).json(outPutPath);
}
resultFrame.show(10);
//结果写入数据库
MySQLJdbcConfig jdbcConfig = new MySQLJdbcConfig();
jdbcConfig.init();
resultFrame.write().mode("append")
.jdbc(jdbcConfig.getUrl(), "hui_metro_test", jdbcConfig.getConnectionProperties());
}
}
7. SparkJob工具类
/**
* <b><code>SparkJobUtil</code></b>
* <p/>
* Description:
* <p/>
* <b>Creation Time:</b> 2018/11/11 17:48.
*
* @author Hu Weihui
*/
public class SparkJobUtil {
/**
* The constant LOGGER.
*
* @since hui_project 1.0.0
*/
private static final Logger LOGGER = LoggerFactory.getLogger(SparkJobUtil.class);
/**
* Close quietly.
*
* @param fileSystem the file system
* @since hui_project 1.0.0
*/
public static void closeQuietly(FileSystem fileSystem) {
if (fileSystem != null) {
try {
fileSystem.close();
} catch (IOException e) {
LOGGER.error("Fail to close FileSystem:" + fileSystem, e);
}
}
}
/**
* Check file exists.
*
* @param path the path
* @since hui_project 1.0.0
*/
public static void checkFileExists(String path) {
Configuration configuration = new Configuration();
FileSystem fileSystem = null;
try {
fileSystem = FileSystem.get(configuration);
if (!fileSystem.exists(new Path(path))) {
throw new FileNotFoundException(path);
}
} catch (IOException e) {
e.printStackTrace();
}finally {
closeQuietly(fileSystem);
}
}
}
8. MySQLjdbcConfig
主要是读取配置文件然后传入DataFrame去连接
/**
* <b><code>MySQLJdbcConfig</code></b>
* <p/>
* Description:
* <p/>
* <b>Creation Time:</b> 2018/11/11 17:32.
*
* @author Hu Weihui
*/
public class MySQLJdbcConfig {
private static final Logger LOGGER = LoggerFactory.getLogger(MySQLJdbcConfig.class);
private String table;
private String url;
private Properties connectionProperties;
public void init(){
Properties properties = new Properties();
InputStream resourceAsStream = this.getClass().getClassLoader().getResourceAsStream("jdbc.properties");
try {
properties.load(resourceAsStream);
setUrl(properties.getProperty("db.url"));
//考虑多数据源的情况,另外创建properties传入
Properties connectionProperties = new Properties();
connectionProperties.setProperty("user",properties.getProperty("db.user"));
connectionProperties.setProperty("password",properties.getProperty("db.password"));
connectionProperties.setProperty("url",properties.getProperty("db.url"));
setConnectionProperties(connectionProperties);
} catch (IOException e) {
LOGGER.info("读取配置文件失败");
}
}
public String getTable() {
return table;
}
public void setTable(String table) {
this.table = table;
}
public String getUrl() {
return url;
}
public void setUrl(String url) {
this.url = url;
}
public Properties getConnectionProperties() {
return connectionProperties;
}
public void setConnectionProperties(Properties connectionProperties) {
this.connectionProperties = connectionProperties;
}
}
9. Running Result
总结
- Spark不难,但是原理需要我们去理解。以后再更新源码方面,更详细的东西
- 数据分层很重要
- 转载注明一下作者呗 。 感谢小哥哥小姐姐~
- 喜欢的话留下评论讨论问题,如果能帮到你们很开心。
Author
作者:HuHui
转载:欢迎一起讨论web和大数据问题,转载请注明作者和原文链接,感谢