SpringBoot + MongoDB 大容量数据多线程分批处理（示例：抽取字段构建新表）

最新推荐文章于 2024-04-29 15:47:35 发布

SakamotoLz

最新推荐文章于 2024-04-29 15:47:35 发布

阅读量2.6k

点赞数 1

文章标签： spring boot mongodb java 多线程

本文链接：https://blog.csdn.net/qq_38736877/article/details/121742530

版权

在实际工作中，我遇到了需要对千万级别数据量的库表进行抽取字段构建新表的需求，于是在此进行总结：

一. 未使用多线程

1. 新建查询类 ExtractInfo

package com.example.demo.bean;

import java.util.List;

public class ExtractInfo {

    private String tableName;

    private List<String> fields;

    public String getTableName() {
        return tableName;
    }

    public void setTableName(String tableName) {
        this.tableName = tableName;
    }

    public List<String> getFields() {
        return fields;
    }

    public void setFields(List<String> fields) {
        this.fields = fields;
    }

    @Override
    public String toString() {
        return "ExtractInfo{" +
                "tableName='" + tableName + '\'' +
                ", fields=" + fields +
                '}';
    }
}

其中 tableName 字段为需要操作的库表，fields 为需要抽取的字段

2. 新建 AsyncController 类

package com.example.demo.controller;

import com.alibaba.fastjson.JSONObject;
import com.example.demo.bean.ExtractInfo;
import com.example.demo.service.AsyncServer;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.web.bind.annotation.*;

import java.util.List;

@RestController
@RequestMapping("/async")
@CrossOrigin(value = "*",  maxAge = 3600)
public class AsyncController {

    @Autowired
    private AsyncServer server;

    @RequestMapping(value = "/extract", method = RequestMethod.POST)
    public JSONObject extractFields(@RequestBody ExtractInfo extractInfo){
        JSONObject output = new JSONObject();
        String tableName = extractInfo.getTableName();
        List<String> fields = extractInfo.getFields();
        long startTime = System.currentTimeMillis();
        server.extractFields(tableName, fields);
        long endTime = System.currentTimeMillis();
        output.put("status", 200);
        output.put("msg", "success");
        output.put("time", "花费时间："+(endTime-startTime)/1000+"秒");
        return output;
    }

}

3. 新建 AsyncServer 类

package com.example.demo.service;

import com.example.demo.dao.MongoDBTestImpl;
import org.springframework.scheduling.annotation.Async;
import org.springframework.stereotype.Service;

import javax.annotation.Resource;
import java.util.List;

@Service
public class AsyncServer {

    @Resource
    private MongoDBTestImpl mongoDBTestImpl;

    public void extractFields(String tableName, List<String> fields) {
        String newTableName = tableName+"_new";
        long total = mongoDBTestImpl.countData(tableName);
        long size = total/10000;
        long number = total%10000;
        if (number != 0) {
            size+=1;
        }
        for (int i=0;i<size;i++) {
            mongoDBTestImpl.extractFields(i, tableName, newTableName, fields);
        }
    }
}

3. 新建 AsyncImpl 类

package com.example.demo.dao;

import org.springframework.data.mongodb.core.MongoTemplate;
import org.springframework.data.mongodb.core.query.Query;
import org.springframework.stereotype.Repository;

import javax.annotation.Resource;
import java.util.List;

@Repository
public class MongoDBTestImpl {

    @Resource
    private MongoTemplate template;

    public long countData(String tableName) {
        Query query = new Query();
        return template.count(query, tableName);
    }

    public void extractFields(int i, String tableName, String newTableName, List<String> fields) {
        int num = (i*10000);
        Query query = new Query();
        for (String field: fields) {
            query.fields().include(field);
        }
        List<Object> objects = template.find(query.skip(num).limit(10000),Object.class, tableName);
        template.insert(objects, newTableName);
    }
}

代码运行测试：

可以看到，两百万条数据量左右的表处理时间大概为3分钟左右

二. 引入多线程

关于如何使用多线程，请参考之前的文章

SpringBoot整合多线程入门教程

1. 修改 AsyncController 类

package com.example.demo.controller;

import com.alibaba.fastjson.JSONObject;
import com.example.demo.bean.ExtractInfo;
import com.example.demo.service.AsyncServer;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.web.bind.annotation.*;

import java.util.List;

@RestController
@RequestMapping("/async")
@CrossOrigin(value = "*",  maxAge = 3600)
public class AsyncController {

    @Autowired
    private AsyncServer server;

    @RequestMapping(value = "/extract", method = RequestMethod.POST)
    public JSONObject extractFields(@RequestBody ExtractInfo extractInfo) throws InterruptedException{
        JSONObject output = new JSONObject();
        String tableName = extractInfo.getTableName();
        List<String> fields = extractInfo.getFields();
        long startTime = System.currentTimeMillis();
        server.extractFields(tableName, fields);
        long endTime = System.currentTimeMillis();
        output.put("status", 200);
        output.put("msg", "success");
        output.put("time", "花费时间："+(endTime-startTime)/1000+"秒");
        return output;
    }

}

2. 修改 AsyncServer 类

package com.example.demo.service;

import com.example.demo.dao.MongoDBTestImpl;
import org.springframework.scheduling.annotation.Async;
import org.springframework.stereotype.Service;

import javax.annotation.Resource;
import java.util.List;
import java.util.concurrent.CountDownLatch;

@Service
public class AsyncServer {

    @Resource
    private MongoDBTestImpl mongoDBTestImpl;

    public void extractTables(String tableName, List<String> fields) throws InterruptedException{
        String newTableName = tableName+"_new";
        long total = mongoDBTestImpl.countData(tableName);
        long size = total/10000;
        long number = total%10000;
        if (number != 0) {
            size+=1;
        }
        if (size>500){
            long page = size/500;
            long left = size%500;
            for (int j=0;j<page;j++) {
                int startNum = 500*j;
                int endNum = 500*(j+1);
                CountDownLatch countDownLatch = new CountDownLatch(500);
                {
                    for (int i=startNum;i<endNum;i++) {
                        mongoDBTestImpl.extractFields(i, fields, tableName, newTableName, countDownLatch);
                    }
                }
                countDownLatch.await();
            }
            if (left != 0) {
                CountDownLatch countDownLatch = new CountDownLatch((int)(size-page*500));
                for (int i=(int)page*500;i<size;i++) {
                    mongoDBTestImpl.extractFields(i, fields, tableName, newTableName, countDownLatch);
                }
                countDownLatch.await();
            }
        } else {
            CountDownLatch countDownLatch = new CountDownLatch((int)size);
            for (int i=0;i<size;i++) {
                mongoDBTestImpl.extractFields(i, fields, tableName, newTableName, countDownLatch);
            }
            countDownLatch.await();
        }
        System.out.println("complete");
    }

}

此段代码逻辑为：将数据按每10000条分批，每次最多500批进入等待队列，每次20个线程同时处理

3. 修改 AsyncImpl 类

package com.example.demo.dao;

import org.springframework.data.mongodb.core.MongoTemplate;
import org.springframework.data.mongodb.core.query.Query;
import org.springframework.scheduling.annotation.Async;
import org.springframework.stereotype.Repository;

import javax.annotation.Resource;
import java.util.List;
import java.util.concurrent.CountDownLatch;

@Repository
public class MongoDBTestImpl {

    @Resource
    private MongoTemplate template;

    public long countData(String tableName) {
        Query query = new Query();
        return template.count(query, tableName);
    }

    @Async("async")
    public void extractFields(int i, List<String> fields, String oldTableName, String newTableName, CountDownLatch countDownLatch) {
        int num = (i*10000);
        Query query = new Query();
        for (String field: fields) {
            query.fields().include(field);
        }
        List<Object> objects = template.find(query.skip(num).limit(10000),Object.class, oldTableName);
        template.insert(objects, newTableName);
        countDownLatch.countDown();
    }

}

代码运行测试：

可以看到，两百万条数据量的库表处理时间直接降至 22 秒，效果拔群

SakamotoLz

关注

1
点赞
踩
9

收藏

觉得还不错? 一键收藏
0
评论
SpringBoot + MongoDB 大容量数据多线程分批处理（示例：抽取字段构建新表）

在实际工作中，我遇到了需要对千万级别数据量的库表进行抽取字段构建新表的需求，于是在此进行总结：一. 未使用多线程1. 新建查询类 ExtractInfopackage com.example.demo.bean;import java.util.List;public class ExtractInfo { private String tableName; private List<String> fields; public String
复制链接

扫一扫