前言:
接手大量数据分析需求,临阵打框架,踩了很多坑,记录一下,
下次再搭框架忘记了好再回来看看。
1.SpringBoot框架搭建
1.pom.xml
经过验证的pom.xml,不算齐全,但是即拿即用,spark的依赖下载时间会比较长,依赖很多
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.xxx</groupId>
<artifactId>xxx</artifactId>
<version>1.0-SNAPSHOT</version>
<parent>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-parent</artifactId>
<version>1.5.7.RELEASE</version>
<relativePath/>
</parent>
<url>http://maven.apache.org</url>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
<maven.compiler.source>1.7</maven.compiler.source>
<maven.compiler.target>1.7</maven.compiler.target>
<java.version>1.8</java.version>
<scala.version>2.10.4</scala.version>
<spark.version>1.6.2</spark.version>
</properties>
<dependencies>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-tomcat</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-test</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-thymeleaf</artifactId>
</dependency>
<!--开启 cache 缓存-->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-cache</artifactId>
</dependency>
<!-- ehcache 缓存 -->
<dependency>
<groupId>net.sf.ehcache</groupId>
<artifactId>ehcache</artifactId>
</dependency>
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>1.2.17</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>1.6.6</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<!--<dependency>-->
<!--<groupId>com.fasterxml.jackson.core</groupId>-->
<!--<artifactId>jackson-databind</artifactId>-->
<!--<version>2.4.4</version>-->
<!--</dependency>-->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>${spark.version}</version>
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
</exclusion>
<exclusion>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-launcher_2.10</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.10</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.10</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.10</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.specs</groupId>
<artifactId>specs</artifactId>
<version>1.2.5</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.ansj</groupId>
<artifactId>ansj_seg</artifactId>
<version>5.1.1</version>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.23</version>
<!--<version>8.0.11</version>-->
</dependency>
<!--mybatis -->
<dependency>
<groupId>org.mybatis.spring.boot</groupId>
<artifactId>mybatis-spring-boot-starter</artifactId>
<version>1.2.0</version>
</dependency>
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>druid</artifactId>
<version>1.0.11</version>
</dependency>
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>1.2.28</version>
</dependency>
<dependency>
<groupId>io.jsonwebtoken</groupId>
<artifactId>jjwt</artifactId>
<version>0.9.1</version>
</dependency>
<dependency>
<groupId>io.springfox</groupId>
<artifactId>springfox-swagger-ui</artifactId>
<version>2.6.1</version>
</dependency>
<dependency>
<groupId>io.springfox</groupId>
<artifactId>springfox-swagger2</artifactId>
<version>2.6.1</version>
</dependency>
<dependency>
<groupId>org.projectlombok</groupId>
<artifactId>lombok</artifactId>
<version>1.16.10</version>
</dependency>
<dependency>
<groupId>org.assertj</groupId>
<artifactId>assertj-core</artifactId>
</dependency>
</dependencies>
<build>
<finalName>xxx</finalName>
<pluginManagement>
<plugins>
<plugin>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-maven-plugin</artifactId>
<configuration>
<mainClass>com.xxx.xxx.Application</mainClass>
</configuration>
<executions>
<execution>
<goals>
<goal>repackage</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</pluginManagement>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
</plugins>
<resources>
<resource>
<directory>src/main/resources</directory>
<includes>
<include>**/**</include>
</includes>
<filtering>false</filtering>
</resource>
</resources>
</build>
</project>
2.application各种配置
2.1 application.properties配置文件
分成两个方便切换配置版本
application.properties
spring.profiles.active=dev
application-dev.properties
##两个数据源,一会要用来配置
localB.jdbc.ds.url=jdbc:mysql://localhost/xxxx?useUnicode=true&characterEncoding=UTF-8&serverTimezone=UTC
localB.jdbc.ds.username=xxxx
localB.jdbc.ds.password=xxxx
localB.jdbc.ds.driver-class-name=com.mysql.jdbc.Driver
##这里只能说,相对默认的要用起来舒服,还有改进空间
localB.jdbc.ds.initialSize=10
localB.jdbc.ds.maxActive=200
localB.jdbc.ds.poolPreparedStatements=false
localB.jdbc.ds.maxOpenPreparedStatements = 100
localB.jdbc.ds.maxPoolPreparedStatementPerConnectionSize=20
localA.jdbc.ds.url=jdbc:mysql://xxxxx/xxxx?characterEncoding=UTF-8&useSSL=false&useOldAliasMetadataBehavior=true&zeroDateTimeBehavior=convertToNull
localA.jdbc.ds.username=xxxx
localA.jdbc.ds.password=xxxxx
localA.jdbc.ds.driver-class-name=com.mysql.jdbc.Driver
localA.jdbc.ds.initialSize=10
localA.jdbc.ds.maxActive=200
localA.jdbc.ds.poolPreparedStatements=false
localA.jdbc.ds.maxOpenPreparedStatements = 100
localA.jdbc.ds.maxPoolPreparedStatementPerConnectionSize=20
#不使用默认,把配置移动至config
#mybatis.type-aliases-package=com.xxxx.data.model.entity
#mybatis.mapper-locations=classpath:mapper/*.xml
server.port=8089
server.contextPath=/xxxx
server.tomcat.uri-encoding=utf-8
#http encoding
spring.http.encoding.charset=UTF-8
spring.http.encoding.enabled=true
spring.http.encoding.force=true
logging.file=../export/xxxxx.txt
logging.level.root=debug
logging.level.org.springframework.web=debug
logging.level.sample.mybatis.mapper=TRACE
spring.main.banner-mode=off
spring.http.multipart.maxFileSize=10000Mb
spring.http.multipart.maxRequestSize=10000Mb
#spring.freemarker.check-template-location=false
spark.spark-home=.
spark.appname=xxxxx
spark.master=local
3.双数据源配置
1.一个本地库,一个生产库,配置后可同时操作,方便两边的日志数据每日的同步分析
2.配置两个文件,分别对应不同的mapper和xml,在用的时候使用其对应的映射,就是可以实现对不同数据库的操作
3.注意的点: sessionFactory.setVfs(SpringBootVFS.class);springboot特殊适配
import com.alibaba.druid.pool.DruidDataSource;
import org.apache.ibatis.session.SqlSessionFactory;
import org.mybatis.spring.SqlSessionFactoryBean;
import org.mybatis.spring.annotation.MapperScan;
import org.mybatis.spring.boot.autoconfigure.SpringBootVFS;
import org.springframework.beans.factory.annotation.Qualifier;
import org.springframework.boot.context.properties.ConfigurationProperties;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.context.annotation.Primary;
import org.springframework.core.io.support.PathMatchingResourcePatternResolver;
import org.springframework.jdbc.datasource.DataSourceTransactionManager;
import javax.sql.DataSource;
@Configuration
@MapperScan(basePackages ="com.onehearts.data.dao.localDB",sqlSessionFactoryRef = "localDBSessionFactory")
public class LocalDataSourceConfiguration {
//扫描从数据源xml文件
static final String MAPPER_LOCATION = "classpath:mapper/local/*.xml";
@Bean("localDataSource")
@ConfigurationProperties(prefix = "local.jdbc.ds")
@Primary
public DruidDataSource localDataSource() {
DruidDataSource dds = new DruidDataSource();
return dds;
}
@Bean(name = "localTransactionManager")
@Primary
public DataSourceTransactionManager localTransactionManager() {
return new DataSourceTransactionManager(localDataSource());
}
@Bean(name = "localDBSessionFactory")
@Primary
public SqlSessionFactory localDBSessionFactory(@Qualifier("localDataSource") DataSource localDataSource)
throws Exception {
final SqlSessionFactoryBean sessionFactory = new SqlSessionFactoryBean();
sessionFactory.setDataSource(localDataSource);
sessionFactory.setMapperLocations(new PathMatchingResourcePatternResolver()
.getResources(LocalDataSourceConfiguration.MAPPER_LOCATION));
//适配Springboot环境
sessionFactory.setVfs(SpringBootVFS.class);
// 配置
sessionFactory.setTypeAliasesPackage("com.xxx.data.model.entity");
return sessionFactory.getObject();
}
}
第二个数据源
import com.alibaba.druid.pool.DruidDataSource;
import org.apache.ibatis.session.SqlSessionFactory;
import org.mybatis.spring.SqlSessionFactoryBean;
import org.mybatis.spring.annotation.MapperScan;
import org.mybatis.spring.boot.autoconfigure.SpringBootVFS;
import org.springframework.beans.factory.annotation.Qualifier;
import org.springframework.boot.context.properties.ConfigurationProperties;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.core.io.support.PathMatchingResourcePatternResolver;
import org.springframework.jdbc.datasource.DataSourceTransactionManager;
import javax.sql.DataSource;
@Configuration
@MapperScan(basePackages ="com.xxx.data.dao.xxxDB" ,sqlSessionFactoryRef="xxxDBSessionFactory")
public class OneheartDataSourceConfiguration {
//扫描从数据源xml文件
static final String MAPPER_LOCATION = "classpath:mapper/xxx/*.xml";
//默认数据源
@Bean("xxxDataSource")
@ConfigurationProperties(prefix = "xxx.jdbc.ds")
public DruidDataSource xxxDataSource() {
DruidDataSource dds = new DruidDataSource();
return dds;
}
@Bean(name = "xxxTransactionManager")
public DataSourceTransactionManager xxxTransactionManager() {
return new DataSourceTransactionManager(xxxDataSource());
}
@Bean(name = "xxxDBSessionFactory")
public SqlSessionFactory oneheartDBSessionFactory(@Qualifier("xxxDataSource") DataSource xxxDataSource)
throws Exception {
final SqlSessionFactoryBean sessionFactory = new SqlSessionFactoryBean();
sessionFactory.setDataSource(xxxDataSource);
sessionFactory.setMapperLocations(new PathMatchingResourcePatternResolver()
.getResources(OneheartDataSourceConfiguration.MAPPER_LOCATION));
sessionFactory.setVfs(SpringBootVFS.class);
sessionFactory.setTypeAliasesPackage("com.xxx.data.model.entity");
return sessionFactory.getObject();
}
}
数据源具体使用
@Autowired
private LocalADao daoA;
@Autowired
private LocalBDao daoB;
public ....(){
//操作库A
daoA.findLocalA();
//操作库B
daoB.findLocalB();
}
@Component("localADao")
public interface LocalADao{
List<xxx> findLocalA();
}
<mapper namespace="xxxxxx.LocalADao ">
<select id="findLocalA" resultType="WsEntity">
select
*
from
xxx
</select>
</mapper >
<mapper namespace="xxxxxx.LocalBDao ">
<select id="findLocalB" resultType="WsEntity">
select
*
from
xxx
</select>
</mapper >
4.spark配置
1.重点在于SparkConf类
2.需要下载hadoop-common-bin这个包,然后设置好环境变量(windows重启生效),不然无法做文件持久化
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.springframework.boot.autoconfigure.condition.ConditionalOnMissingBean;
import org.springframework.boot.context.properties.ConfigurationProperties;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
@Configuration
@ConfigurationProperties(prefix = "spark")
public class SparkContextBean {
private String sparkHome = ".";
private String appName = "xxx";
private String master = "local";
// spark.executor.memory=3g
// spark.eventLog.enabled=true
// spark.driver.maxResultSize=1g
@Bean
@ConditionalOnMissingBean(SparkConf.class)
public SparkConf sparkConf() throws Exception {
//必须配置环境变量,系统变量设置,如何没有配置需要在代码中加上这句
// System.setProperty("hadoop.home.dir", "E:\\hadoop-common-bin-2.7.1");
SparkConf conf = new SparkConf().setAppName(appName).setMaster(master);
// 参数配置详见https://www.cnblogs.com/yangcx666/p/8723826.html
conf.set("spark.executor.memory","1g");
conf.set("spark.driver.cores","2");
conf.set("spark.driver.memory","512m");
conf.set("spark.ui.port","8003");
// conf.set("spark.eventLog.enabled","true");
// conf.set("spark.driver.maxResultSize","1g");
return conf;
}
@Bean
@ConditionalOnMissingBean(JavaSparkContext.class)
public JavaSparkContext javaSparkContext() throws Exception {
return new JavaSparkContext(sparkConf());
}
public String getSparkHome() {
return sparkHome;
}
public void setSparkHome(String sparkHome) {
this.sparkHome = sparkHome;
}
public String getAppName() {
return appName;
}
public void setAppName(String appName) {
this.appName = appName;
}
public String getMaster() {
return master;
}
public void setMaster(String master) {
this.master = master;
}
}
具体使用
public void sparkExerciseDemo() {
DateFormat df = DateFormat.getDateInstance();
List<Integer> data = Lists.newArrayList(1, 2, 3, 4, 5, 6);
//JavaRDD每次对RDD的操作,都视之为一次job
JavaRDD<Integer> rdd01 = sc.parallelize(data);
rdd01.saveAsTextFile("E:\\spark-result\\" + System.currentTimeMillis());
//map对每一个元素操作,返回一个新的RDD
rdd01 = rdd01.map(num -> (num * num));
rdd01.saveAsTextFile("E:\\spark-result\\" + System.currentTimeMillis());
//spark map :1,4,9,16,25,36
logger.debug("spark map :{}", Joiner.on(",").skipNulls().join(rdd01.collect()).toString());
//对每一个元素进行筛选,返回符合条件的元素组成一个新的RDD
rdd01 = rdd01.filter(x -> x < 6);
rdd01.saveAsTextFile("E:\\spark-result\\" + System.currentTimeMillis());
rdd01 = rdd01.filter(x -> x >2);
rdd01.saveAsTextFile("E:\\spark-result\\" + System.currentTimeMillis());
//spark filter :1,4
logger.debug("spark filter :{}", Joiner.on(",").skipNulls().join(rdd01.collect()).toString());
//对每个元素进行操作,讲返回迭代器的所有元素组成新的RDD
// rdd01 = rdd01.flatMap(x -> {
// List<Integer> li = new ArrayList<Integer>();
// li.add(x);
// li.add(x+1);
// li.add(x+2);
// Integer[] test = {x, (x + 1), (x + 2)};
// Arrays.asList(test).iterator()
// return (List)Arrays.asList(test).iterator();
// });
//去重复
//distinct
//flatMap :1,2,3,4,5,6
logger.info("flatMap :{}", Joiner.on(",").skipNulls().join(rdd01.collect()).toString());
JavaRDD<Integer> unionRdd = sc.parallelize(data);
//合并
rdd01 = rdd01.union(unionRdd);
rdd01.saveAsTextFile("E:\\spark-result\\" + System.currentTimeMillis());
/**
*
* //union :1,2,3,4,5,6,1,2,3,4,5,6
intersection 交集
cartesian笛卡尔积
subtract 移除相同内容
collect 返回所有元素
count 返回元素个数
countByValue 返回各元素出现的次数
take(num) 取出num个元素
top(num)返回前num个元素
reduce(func) 并行整合RDD中所有数据
foreach(func) 对每个元素使用func
RDD.unpersist()手动删除RDD
persist:保留rdd依赖关系
checkpoint(level:StorageLevel):RDD[T]切断RDD依赖关系
transformations转换恢复原有RDD分区
*
*/
logger.info("union :{}", Joiner.on(",").skipNulls().join(rdd01.collect()).toString());
List<Integer> result = Lists.newArrayList();
//reduce 最大最小值
result.add(rdd01.reduce((Integer v1, Integer v2) -> {
return v1 + v2;
}));
//reduce :42
logger.info("reduce :{}", Joiner.on(",").skipNulls().join(result).toString());
result.forEach(System.out::print);
JavaPairRDD<Integer, Iterable<Integer>> groupRdd = rdd01.groupBy(x -> {
logger.info("======grouby========:{}", x);
if (x > 10) return 0;
else return 1;
});
List<Tuple2<Integer, Iterable<Integer>>> resul = groupRdd.collect();
//group by key:1 value:1,2,3,4,5,6,1,2,3,4,5,6
resul.forEach(x -> {
logger.info("group by key:{} value:{}", x._1, Joiner.on(",").skipNulls().join(x._2).toString());
});
// DateFormat df = DateFormat.getDateTimeInstance();
rdd01.saveAsTextFile("E:\\spark-result\\" + System.currentTimeMillis());
}
5.spark界面UI讲解
对RDD的操作分两种,一类返回新的RDD(原RDD不会被改变),一类返回非RDD对象,返回非RDD的操作,会立刻执行,而返回新RDD的操作,会尽量最后执行,执行的操作均回在UI界面上反应
后续再补充…
6.踩坑注意事项
操作的数据量大时,需考量到各个组件的上限,比如mysql默认的SQL长度限制,数据量大时,如非必要索引不宜太多,考虑插入效率因子,缓存,spark用到的内存参数,rdd之间存在的依赖关系需要及时的解除,减少资源消耗,还需考虑硬盘容量,读写速率
在RDD的操作中,所有用到的变量均需要可序列化,而我们用到的dao没有序列化,这里会用不了,可以做如下方式解决
@Component
public class GetService implements Serializable{
@Autowired
public xxxLogDao xxxLogDao;
@Autowired
public SmsDao smsDao;
@Autowired
public WSDao wsDao;
@Autowired
public CommonDao commonDao;
@Autowired
public LogDao logDao;
@Autowired
public MailDao mailDao;
public static GetService getService;
@PostConstruct
public void init(){
getService = this;
getService.xxxDao = this.xxxDao;
getService.smsDao = this.smsDao;
getService.wsDao = this.wsDao;
getService.commonDao = this.commonDao;
getService.logDao = this.logDao;
getService.mailDao = this.mailDao;
}
}
调用的时候这样来,就不会爆不能序列号的错误了
GetService.getService.xxxxLogDao
mysql.ini配置这几个参数笔记重要,对于插入操作效率提升巨大
bulk_insert_buffer_size=120M
max_allowed_packet=32M
net_buffer_length=8k
8.总结
先掌握大致的原理,轮廓,再一步一步细致的去用,再调优,多总结,遇坑就搜,直接看源码会是非常快速的解决办法。