在做数据库作业时遇到的优化问题!
作业要求:
实现一个基于嵌套循环策略的两表连接算法
• 算法原理:
FOR EACH CHUNK c1 OF t1 {
IF c1 NOT IN MEMORY
READ CHUNK c1 INTO MEMORY
FOR EACH ROW r1 IN CHUNK c1 {
FOR EACH CHUNK c2 OF t2 {
IF c2 NOT IN MEMORY
READ CHUNK c2 INTO MEMORY
FOR EACH ROW r2 IN c2 {
IF r1,r2 SATISFIES JOIN CONDITIONS
JOIN r1,r2
}
}
}
}
• 数据集:TPC-H数据集
– 参与连接的两个数据表,数据量较大,无法同时一次性放进内存
» 生成5GB的数据时,orders表大小为837MB,lineitem表大小为3.62GB
» 在此作业中,将基于这两个表的数据来进行连接
» 数据生成器:https://pan.baidu.com/s/1s_4P-nVDo-L1qjWdbugE1w
• 实验环境
单核CPU,CentOS 6.4,4GB内存
数据生成方法见:http://www.docin.com/p-1297678266.html?qq-pf-to=pcqq.group
TPC-H数据集中order表和lineitem表结构如下:
生成的表是这样的8张表
生成表.tbl中的数据形式如下:
现在需要做的是读入orders.tbl和lineitem.tbl两表中的数据,然后使用orders.tbl表中的O_ORDERKEY和lineitem.tbl表中的L_ORDERKEY先比较,如果相等就连接成一条数据。
两表连接程序代码如下:
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStreamReader;
import java.text.SimpleDateFormat;
import java.util.ArrayList;
public class T_Con{
public static void main(String[] args) {
SimpleDateFormat df = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
long startTime=System.currentTimeMillis();
System.out.println("startTime = "+df.format(startTime));
try {
table_connect();
} catch (IOException e) {
e.printStackTrace();
}
long endTime=System.currentTimeMillis();
System.out.println("endTime = "+df.format(endTime));
long total_time = endTime-startTime;
System.out.println("runtime is "+total_time/1000+"s");
System.out.println("runtime is "+(total_time/60000)+" minutes "+(total_time%60000)/1000+" seconds");
}
/**
* table connect
*/
public static void table_connect() throws IOException{
String or_filname = "orders.tbl";
String lin_filname = "lineitem.tbl";
String O_L_filename = "result.tbl";
int or_buf_byte = 128*1024*1024;
int lin_buf_byte = 256*1024*1024;
int write_bur_byte = 8*1024*1024;
int sOrders = -1;
int slineitem = -1;
String temp_lin = null;
String temp_or = null;
String temp_or_sub = null;
BufferedReader FOrdersInBR = read_Orders(or_filname,or_buf_byte);
BufferedReader FlineitemInBR = read_lineitem(lin_filname,lin_buf_byte);
BufferedWriter wtable = write_creat(O_L_filename,write_bur_byte);
while((temp_or = FOrdersInBR.readLine()) !=null){
sOrders = Integer.parseInt(temp_or.split("\\|")[0]);
temp_or_sub = temp_or.substring(temp_or.indexOf("|")+1);
if(sOrders < slineitem){
continue;
}
if(slineitem != -1 && slineitem == sOrders){
temp_lin += temp_or_sub;
temp_lin += "\n";
wtable.write(temp_lin);
}
while((temp_lin = FlineitemInBR.readLine()) != null){
slineitem = Integer.parseInt(temp_lin.split("\\|")[0]);
if(slineitem > sOrders){
break;
}
if(slineitem == sOrders){
temp_lin += temp_or_sub;
temp_lin += "\n";
wtable.write(temp_lin);
}
}
}
wtable.flush();
wtable.close();
FOrdersInBR.close();
FlineitemInBR.close();
}
/**
*read Orders table datas to stor memory
*/
public static BufferedReader read_Orders(String or_filname,int or_buf_byte){
BufferedReader FOrdersInBR = null;
try {
FOrdersInBR = new BufferedReader(new InputStreamReader(new FileInputStream(or_filname)),or_buf_byte);
} catch (FileNotFoundException e) {
System.out.println("Read orders file failed!");
e.printStackTrace();
}
return FOrdersInBR;
}
/**
*read lineitem table datas to stor memory
*/
public static BufferedReader read_lineitem(String lin_filname,int lin_buf_byte){
BufferedReader FlineitemInBR = null;
try {
FlineitemInBR = new BufferedReader(new InputStreamReader(new FileInputStream(lin_filname)),lin_buf_byte);
} catch (FileNotFoundException e) {
System.out.println("Read lineitem file failed!");
e.printStackTrace();
}
return FlineitemInBR;
}
/**
* creat write file
*
*/
public static BufferedWriter write_creat(String O_L_filename,int write_bur_byte){
File wfilename = new File(O_L_filename);
if(!wfilename.exists()){
try {
wfilename.createNewFile();
} catch (IOException e) {
System.out.println("create write file failed!");
e.printStackTrace();
}
}
BufferedWriter wtable = null;
try {
wtable = new BufferedWriter(new FileWriter(wfilename),write_bur_byte);
} catch (IOException e) {
System.out.println("open write file failed!");
e.printStackTrace();
}
return wtable;
}
}
把这个程序放在VMware虚拟机下面跑,时间会比较长,我跑出来的最短时间为141s。跑程序的时候发现不是缓冲越大越好,这是为什么,如果把orders.tbl一次全部读入内存,发现跑得更慢,现在想对该程序进行优化,使得更快!请各位大神看下!
(by:CHENYUNHE 2018-12-30 https://blog.csdn.net/CHENYUNHE/)