数据背景
测试数据插入量级为2000+条数据,表字段120+(业务需求)
开启打印执行时间统计信息,进行批量处理前后效果查询
其中包含业务查询以及处理数据(1次delete,4次复杂的sql查询,插入总条数2000+条)
数据库使用的oracle
最后面有一个全网抄来抄去但都没有写清楚的saveAll问题
以下为批量提交配置、以及执行时间统计信息配置
spring:
jpa:
properties:
hibernate:
#打印执行时间统计信息
generate_statistics: true
jdbc:
#为spring data jpa saveAll方法提供批量插入操作 此处可以随时更改大小
batch_size: 500
batch_versioned_data: true
order_inserts: true
#批量更新这边不使用,配不配之无所谓
order_updates: true
未批量提交
首先未配置批量插入之前,系统没有做batch的处理,flush只有一次
4792100 nanoseconds spent acquiring 1 JDBC connections;
0 nanoseconds spent releasing 0 JDBC connections;
84075100 nanoseconds spent preparing 2203 JDBC statements;
10701684100 nanoseconds spent executing 2203 JDBC statements;
0 nanoseconds spent executing 0 JDBC batches;
0 nanoseconds spent performing 0 L2C puts;
0 nanoseconds spent performing 0 L2C hits;
0 nanoseconds spent performing 0 L2C misses;
12018229400 nanoseconds spent executing 1 flushes (flushing a total of 2202 entities and 0 collections);
595400 nanoseconds spent executing 8 partial-flushes (flushing a total of 26 entities and 26 collections)
批量提交
可以看到批次变多了,但是没有按照2000/500的进行划分。当时这块研究了半天,一直以为是数据库参数配置问题。
查询mysql可以设置参数配置在url后面加 rewriteBatchedStatements=true ,但是oracle不支持,会报错
3592000 nanoseconds spent acquiring 1 JDBC connections;
0 nanoseconds spent releasing 0 JDBC connections;
42867300 nanoseconds spent preparing 646 JDBC statements;
36625900 nanoseconds spent executing 8 JDBC statements;
3085078300 nanoseconds spent executing 638 JDBC batches;
0 nanoseconds spent performing 0 L2C puts;
0 nanoseconds spent performing 0 L2C hits;
0 nanoseconds spent performing 0 L2C misses;
6765787300 nanoseconds spent executing 1 flushes (flushing a total of 2202 entities and 0 collections);
6153600 nanoseconds spent executing 8 partial-flushes (flushing a total of 26 entities and 26 collections)
之后发现其实是entity配置了一个参数@DynamicInsert,由于这个是动态拼接sql,null的字段不拼接,所以会导致insert sql语句不一致,batch应该是把前后类似的sql汇总成一个batch处理的,如果前后字段不一致会分成多个batch。(我这边使用的是hibernate的uuid,也没有使用sequence的,所以生成主键不会执行数据库)
4480700 nanoseconds spent acquiring 1 JDBC connections;
0 nanoseconds spent releasing 0 JDBC connections;
1386400 nanoseconds spent preparing 9 JDBC statements;
36561200 nanoseconds spent executing 8 JDBC statements;
413603400 nanoseconds spent executing 5 JDBC batches;
0 nanoseconds spent performing 0 L2C puts;
0 nanoseconds spent performing 0 L2C hits;
0 nanoseconds spent performing 0 L2C misses;
19469112800 nanoseconds spent executing 1 flushes (flushing a total of 2202 entities and 0 collections);
7884900 nanoseconds spent executing 8 partial-flushes (flushing a total of 26 entities and 26 collections)
注释掉@DynamicInsert之后可以看到batch次数变成5次了,这就正常了。但是flush的时间使用就变多了
分多次flush
之后就是尝试多次flush处理
这边使用PersistenceContext注解,因为查到 EntityManager 非线程安全
@PersistenceContext
private EntityManager entityManager;
saveAll代码变更
其中entityManager.persist为置entity新增记录持久态。具体源码分析我放在后面
public void save(List<XXXEntity> result){
batchInsert(result);
// if(!CollectionUtils.isEmpty(result)) {
// XXXRepository.saveAll(result);
// }
}
/**
* 批量插入
*
* @param list 实体类集合
* @param <T> 表对应的实体类
*/
public <T> void batchInsert(List<T> list) {
if (!CollectionUtils.isEmpty(list)) {
for (int i = 0; i < list.size(); i++) {
entityManager.persist(list.get(i));
if (i % 500 == 0) {//100
entityManager.flush();
entityManager.clear();
}
}
entityManager.flush();
entityManager.clear();
}
}
i % 100
3744200 nanoseconds spent acquiring 1 JDBC connections;
0 nanoseconds spent releasing 0 JDBC connections;
3345300 nanoseconds spent preparing 31 JDBC statements;
48246500 nanoseconds spent executing 8 JDBC statements;
485294000 nanoseconds spent executing 23 JDBC batches;
0 nanoseconds spent performing 0 L2C puts;
0 nanoseconds spent performing 0 L2C hits;
0 nanoseconds spent performing 0 L2C misses;
17969493100 nanoseconds spent executing 23 flushes (flushing a total of 2202 entities and 0 collections);
813000 nanoseconds spent executing 8 partial-flushes (flushing a total of 26 entities and 26 collections)
i %500
4044700 nanoseconds spent acquiring 1 JDBC connections;
0 nanoseconds spent releasing 0 JDBC connections;
1293800 nanoseconds spent preparing 14 JDBC statements;
32300100 nanoseconds spent executing 8 JDBC statements;
329406500 nanoseconds spent executing 6 JDBC batches;
0 nanoseconds spent performing 0 L2C puts;
0 nanoseconds spent performing 0 L2C hits;
0 nanoseconds spent performing 0 L2C misses;
2756675900 nanoseconds spent executing 6 flushes (flushing a total of 2202 entities and 0 collections);
692600 nanoseconds spent executing 8 partial-flushes (flushing a total of 26 entities and 26 collections)
综上分析,flush次数多,flushes 耗时长;batch批次分的多的,batches耗时长。权衡利弊吧就。
之前还查到flush好像有上限,这个待验证。
saveAll源码分析
因为这边使用的jpa,所以源码自然是jpa的。
org.springframework.data.jpa.repository.support.SimpleJpaRepository
/*
* (non-Javadoc)
* @see org.springframework.data.jpa.repository.JpaRepository#save(java.lang.Iterable)
*/
@Transactional
public <S extends T> List<S> saveAll(Iterable<S> entities) {
Assert.notNull(entities, "The given Iterable of entities not be null!");
List<S> result = new ArrayList<S>();
for (S entity : entities) {
result.add(save(entity));
}
return result;
}
/*
* (non-Javadoc)
* @see org.springframework.data.repository.CrudRepository#save(java.lang.Object)
*/
@Transactional
public <S extends T> S save(S entity) {
if (entityInformation.isNew(entity)) {
em.persist(entity);
return entity;
} else {
return em.merge(entity);
}
}
可以看到这边使用的是for循环entityInformation.isNew(entity)
追踪进去看到使用如下
org.springframework.data.jpa.repository.support.JpaMetamodelEntityInformation
/*
* (non-Javadoc)
* @see org.springframework.data.repository.core.support.AbstractEntityInformation#isNew(java.lang.Object)
*/
@Override
public boolean isNew(T entity) {
if (!versionAttribute.isPresent()
|| versionAttribute.map(Attribute::getJavaType).map(Class::isPrimitive).orElse(false)) {
return super.isNew(entity);
}
BeanWrapper wrapper = new DirectFieldAccessFallbackBeanWrapper(entity);
return versionAttribute.map(it -> wrapper.getPropertyValue(it.getName()) == null).orElse(true);
}
org.springframework.data.repository.core.support.AbstractEntityInformation
/*
* (non-Javadoc)
* @see org.springframework.data.repository.core.EntityInformation#isNew(java.lang.Object)
*/
public boolean isNew(T entity) {
ID id = getId(entity);
Class<ID> idType = getIdType();
if (!idType.isPrimitive()) {
return id == null;
}
if (id instanceof Number) {
return ((Number) id).longValue() == 0L;
}
throw new IllegalArgumentException(String.format("Unsupported primitive id type %s!", idType));
}
查询配置的ID,如果非原始类型(boolean、char、byte、short、int、long、float、double),判断该字段是否为空。id 会在em.persist(entity)之后赋值
PS:以下是查网上资料被坑的最惨的地方
网上查了都在说saveAll执行insert之前都会去执行select,但我一直没有select的sql日志。之前一直以为是版本问题,直到我无意中给主键赋值,发现insert之前居然有select日志了。
首先说明如何增加hibernate的sql输出日志yml
spring:
jpa:
properties:
hibernate:
generate_statistics: true
show_sql: true
我这边使用hibernate去生成主键uuid
@Id
@GenericGenerator(name = "system-uuid", strategy = "uuid")
@GeneratedValue(generator = "system-uuid")
private String xxxid;
网上查了全在说SimpleJpaRepository的saveAll遍历每条记录去执行select,需要另外写方法才能使新增记录跳过select执行步骤。
//SimpleJpaRepository
/*
* (non-Javadoc)
* @see org.springframework.data.repository.CrudRepository#save(java.lang.Object)
*/
@Transactional
public <S extends T> S save(S entity) {
if (entityInformation.isNew(entity)) {
em.persist(entity);
return entity;
} else {
return em.merge(entity);
}
}
我查了其实主要是因为如果主键已经被赋值了之后,SimpleJpaRepository的会由于判断主键非空而走else,然后em.merge会执行select语句去找是否存在该条记录,才导致的新增数据也会走查询。
下面是我使用少量数据做的测试
10条记录对主键id赋值
5309100 nanoseconds spent acquiring 1 JDBC connections;
0 nanoseconds spent releasing 0 JDBC connections;
2280400 nanoseconds spent preparing 20 JDBC statements;
60949500 nanoseconds spent executing 10 JDBC statements;//这里
38818200 nanoseconds spent executing 10 JDBC batches;
0 nanoseconds spent performing 0 L2C puts;
0 nanoseconds spent performing 0 L2C hits;
0 nanoseconds spent performing 0 L2C misses;
54961800 nanoseconds spent executing 1 flushes (flushing a total of 10 entities and 0 collections);
0 nanoseconds spent executing 0 partial-flushes (flushing a total of 0 entities and 0 collections)
10条记录不对主键id赋值
4243600 nanoseconds spent acquiring 1 JDBC connections;
0 nanoseconds spent releasing 0 JDBC connections;
653700 nanoseconds spent preparing 10 JDBC statements;
0 nanoseconds spent executing 0 JDBC statements;//这里
37879700 nanoseconds spent executing 10 JDBC batches;
0 nanoseconds spent performing 0 L2C puts;
0 nanoseconds spent performing 0 L2C hits;
0 nanoseconds spent performing 0 L2C misses;
48910200 nanoseconds spent executing 1 flushes (flushing a total of 10 entities and 0 collections);
0 nanoseconds spent executing 0 partial-flushes (flushing a total of 0 entities and 0 collections)
可以看出多的 60949500 nanoseconds spent executing 10 JDBC statements; 这部分其实就是多出的select执行sql。
所以要么对每条记录进行entityManager.persist(entity),要么把主键清空。这样就不会对新增数据进行多余查询动作了