在实际工作中,我遇到了需要对千万级别数据量的库表进行抽取字段构建新表的需求,于是在此进行总结:
一. 未使用多线程
1. 新建查询类 ExtractInfo
package com.example.demo.bean;
import java.util.List;
public class ExtractInfo {
private String tableName;
private List<String> fields;
public String getTableName() {
return tableName;
}
public void setTableName(String tableName) {
this.tableName = tableName;
}
public List<String> getFields() {
return fields;
}
public void setFields(List<String> fields) {
this.fields = fields;
}
@Override
public String toString() {
return "ExtractInfo{" +
"tableName='" + tableName + '\'' +
", fields=" + fields +
'}';
}
}
其中 tableName 字段为需要操作的库表,fields 为需要抽取的字段
2. 新建 AsyncController 类
package com.example.demo.controller;
import com.alibaba.fastjson.JSONObject;
import com.example.demo.bean.ExtractInfo;
import com.example.demo.service.AsyncServer;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.web.bind.annotation.*;
import java.util.List;
@RestController
@RequestMapping("/async")
@CrossOrigin(value = "*", maxAge = 3600)
public class AsyncController {
@Autowired
private AsyncServer server;
@RequestMapping(value = "/extract", method = RequestMethod.POST)
public JSONObject extractFields(@RequestBody ExtractInfo extractInfo){
JSONObject output = new JSONObject();
String tableName = extractInfo.getTableName();
List<String> fields = extractInfo.getFields();
long startTime = System.currentTimeMillis();
server.extractFields(tableName, fields);
long endTime = System.currentTimeMillis();
output.put("status", 200);
output.put("msg", "success");
output.put("time", "花费时间:"+(endTime-startTime)/1000+"秒");
return output;
}
}
3. 新建 AsyncServer 类
package com.example.demo.service;
import com.example.demo.dao.MongoDBTestImpl;
import org.springframework.scheduling.annotation.Async;
import org.springframework.stereotype.Service;
import javax.annotation.Resource;
import java.util.List;
@Service
public class AsyncServer {
@Resource
private MongoDBTestImpl mongoDBTestImpl;
public void extractFields(String tableName, List<String> fields) {
String newTableName = tableName+"_new";
long total = mongoDBTestImpl.countData(tableName);
long size = total/10000;
long number = total%10000;
if (number != 0) {
size+=1;
}
for (int i=0;i<size;i++) {
mongoDBTestImpl.extractFields(i, tableName, newTableName, fields);
}
}
}
3. 新建 AsyncImpl 类
package com.example.demo.dao;
import org.springframework.data.mongodb.core.MongoTemplate;
import org.springframework.data.mongodb.core.query.Query;
import org.springframework.stereotype.Repository;
import javax.annotation.Resource;
import java.util.List;
@Repository
public class MongoDBTestImpl {
@Resource
private MongoTemplate template;
public long countData(String tableName) {
Query query = new Query();
return template.count(query, tableName);
}
public void extractFields(int i, String tableName, String newTableName, List<String> fields) {
int num = (i*10000);
Query query = new Query();
for (String field: fields) {
query.fields().include(field);
}
List<Object> objects = template.find(query.skip(num).limit(10000),Object.class, tableName);
template.insert(objects, newTableName);
}
}
代码运行测试:
可以看到,两百万条数据量左右的表处理时间大概为3分钟左右
二. 引入多线程
关于如何使用多线程,请参考之前的文章
1. 修改 AsyncController 类
package com.example.demo.controller;
import com.alibaba.fastjson.JSONObject;
import com.example.demo.bean.ExtractInfo;
import com.example.demo.service.AsyncServer;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.web.bind.annotation.*;
import java.util.List;
@RestController
@RequestMapping("/async")
@CrossOrigin(value = "*", maxAge = 3600)
public class AsyncController {
@Autowired
private AsyncServer server;
@RequestMapping(value = "/extract", method = RequestMethod.POST)
public JSONObject extractFields(@RequestBody ExtractInfo extractInfo) throws InterruptedException{
JSONObject output = new JSONObject();
String tableName = extractInfo.getTableName();
List<String> fields = extractInfo.getFields();
long startTime = System.currentTimeMillis();
server.extractFields(tableName, fields);
long endTime = System.currentTimeMillis();
output.put("status", 200);
output.put("msg", "success");
output.put("time", "花费时间:"+(endTime-startTime)/1000+"秒");
return output;
}
}
2. 修改 AsyncServer 类
package com.example.demo.service;
import com.example.demo.dao.MongoDBTestImpl;
import org.springframework.scheduling.annotation.Async;
import org.springframework.stereotype.Service;
import javax.annotation.Resource;
import java.util.List;
import java.util.concurrent.CountDownLatch;
@Service
public class AsyncServer {
@Resource
private MongoDBTestImpl mongoDBTestImpl;
public void extractTables(String tableName, List<String> fields) throws InterruptedException{
String newTableName = tableName+"_new";
long total = mongoDBTestImpl.countData(tableName);
long size = total/10000;
long number = total%10000;
if (number != 0) {
size+=1;
}
if (size>500){
long page = size/500;
long left = size%500;
for (int j=0;j<page;j++) {
int startNum = 500*j;
int endNum = 500*(j+1);
CountDownLatch countDownLatch = new CountDownLatch(500);
{
for (int i=startNum;i<endNum;i++) {
mongoDBTestImpl.extractFields(i, fields, tableName, newTableName, countDownLatch);
}
}
countDownLatch.await();
}
if (left != 0) {
CountDownLatch countDownLatch = new CountDownLatch((int)(size-page*500));
for (int i=(int)page*500;i<size;i++) {
mongoDBTestImpl.extractFields(i, fields, tableName, newTableName, countDownLatch);
}
countDownLatch.await();
}
} else {
CountDownLatch countDownLatch = new CountDownLatch((int)size);
for (int i=0;i<size;i++) {
mongoDBTestImpl.extractFields(i, fields, tableName, newTableName, countDownLatch);
}
countDownLatch.await();
}
System.out.println("complete");
}
}
此段代码逻辑为:将数据按每10000条分批,每次最多500批进入等待队列,每次20个线程同时处理
3. 修改 AsyncImpl 类
package com.example.demo.dao;
import org.springframework.data.mongodb.core.MongoTemplate;
import org.springframework.data.mongodb.core.query.Query;
import org.springframework.scheduling.annotation.Async;
import org.springframework.stereotype.Repository;
import javax.annotation.Resource;
import java.util.List;
import java.util.concurrent.CountDownLatch;
@Repository
public class MongoDBTestImpl {
@Resource
private MongoTemplate template;
public long countData(String tableName) {
Query query = new Query();
return template.count(query, tableName);
}
@Async("async")
public void extractFields(int i, List<String> fields, String oldTableName, String newTableName, CountDownLatch countDownLatch) {
int num = (i*10000);
Query query = new Query();
for (String field: fields) {
query.fields().include(field);
}
List<Object> objects = template.find(query.skip(num).limit(10000),Object.class, oldTableName);
template.insert(objects, newTableName);
countDownLatch.countDown();
}
}
代码运行测试:
可以看到,两百万条数据量的库表处理时间直接降至 22 秒,效果拔群