这是我第一次接触这么大的数据

最新推荐文章于 2024-04-18 16:17:49 发布

chenwen_201116040110

最新推荐文章于 2024-04-18 16:17:49 发布

阅读量502

点赞数

分类专栏：大数据文章标签： sqlserver 数据库 python python连接数据库 python读文件

本文链接：https://blog.csdn.net/chenwen_201116040110/article/details/39249347

版权

大数据专栏收录该内容

3 篇文章 0 订阅

订阅专栏

今天拿到一个任务，把sqlsever（只有一个表）中的数据导入到postgres中，是不是很简单？但是问题是sqlserver中的数据有33G，这可是我第一次接触到这么大的数据。没办法，在网上查了一下怎么导入，然后花了两个小时把数据整成了一个sql文件，在通过了一个工具往postgres数据库中导入，没想到一点导入就崩溃，机器内存就4G，怎么受得了呢？

后来想办法，想了想我觉得把所有数据分成若干个文件，每个文件只有几十M，那么在写一个脚本不就解决了么？然后我就写了一个java程序，将数据分成了两千分，代码如下：

package export_csv;

import java.io.BufferedWriter;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.sql.Statement;
import java.text.NumberFormat;
import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Date;
import java.util.Iterator;
import java.util.LinkedHashMap;

public class Car
{
    public static void main( String[] argv )
    {
        Statement stmt = Sql.getSQL();
        try{
            int max_count = 1999 ;
            SimpleDateFormat df = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");//设置日期格式
            for( int j = 127; j < max_count; ++j )
            {
                System.out.println( "select * from dbo.结果 where cast(REC_CARID AS BIGINT) % "+max_count+" = " + j );
//                try{
//                    LogWriter.log( "文件"+String.valueOf( j ) );
//                    LogWriter.log("开始执行时间："+df.format(new Date()));// new Date()为获取当前系统时间
//                    LogWriter.log( "select * from dbo.结果 where cast(REC_CARID AS BIGINT) % "+max_count+" = " + j );
//                }catch(Exception e)
//                {
//                    e.printStackTrace();
//                }
                System.out.println( "start" );
                ResultSet rs = stmt.executeQuery("select * from dbo.结果 where cast(REC_CARID AS BIGINT) % "+max_count+" = " + j);
                File csvFile = null;
                BufferedWriter csvFileOutputStream = null;
                String path = "E:/all_csv_data/";
                String filename = String.valueOf( j );
                try{
                    csvFile = new File(path + filename + ".csv");
                    // csvFile.getParentFile().mkdir();
                    File parent = csvFile.getParentFile();
                    if (parent != null && !parent.exists()) {
                        parent.mkdirs();
                    }
                    csvFile.createNewFile();

                    // GB2312使正确读取分隔符","
                    csvFileOutputStream = new BufferedWriter(new OutputStreamWriter(
                            new FileOutputStream(csvFile), "GB2312"), 1024);
                    while( rs.next() )
                    {
                        for( int i = 1; i < 11; ++i )
                        {
                            String temp = rs.getString( i );
                            csvFileOutputStream.write(temp);  
                            if( i != 10 ) csvFileOutputStream.write(",");  
                            if( i == 10 ) csvFileOutputStream.newLine();  
                        }
                    }
                    csvFileOutputStream.flush();
                    csvFileOutputStream.close();
                }catch(Exception e)
                {
                    e.printStackTrace();
                }
                System.out.println( "end" );
//                try{
//                    LogWriter.log("结束执行时间："+df.format(new Date()));// new Date()为获取当前系统时间
//                    LogWriter.log("");
//                }catch(Exception e)
//                {
//                    e.printStackTrace();
//                }
                System.out.println( "导出文件"+j+"成功" );
                System.out.println();
            }
        }catch( SQLException e)
        {
            e.printStackTrace();
        }
    }
}

这段代码就是循环所有的记录，然后id对1999求余，根据余数不同，把其导入到1999个文件csv中。哎，本以为这样就解决问题了，实在是太天真了，经过一夜的时间，发现只导出120个文件，这速度实在是无法忍受。。。。

没办法，只能换一种方式了。。。。然后我又花了两个小时，直接把整个数据导出成一个csv格式的文件，写了一个python脚本，经历一夜的时间，呵呵，33G的数据，终于让我给全部导入到postgres数据库中了。代码如下：

<pre name="code" class="python">#coding:utf-8
'''
Created on 2014年9月12日

@author: CW
'''
import psycopg2

def readerCsv( file_path,file_name):
    cin = open( "%s%s"%(file_path,file_name),"r" )
    conn = psycopg2.connect("dbname=cardata user=postgres port=5433")
    cur = conn.cursor()
    count = 0
    
    while True:       
        reader = cin.readline()
        #print reader      
        if not reader:
            break
        line = str(reader).replace('\r\n', '').split(',')
        if line.__len__() >= 10 and line[0].__len__() == 12:
            cur.execute("INSERT INTO car VALUES ('%s','%s','%s','%s','%s','%s','%s','%s','%s','%s')"\
                        %(str(line[0]),str(line[1]),str(line[2]),str(line[3]),\
                          str(line[4]),str(line[5]),str(line[6]),str(line[7]),\
                          str(line[8]),str(line[9]) ))
            
        count = count + 1
        if count%100000==0:        
                print count           
                conn.commit()
    conn.commit()
    pass

if __name__ == '__main__':
    readerCsv( '/home/','car.csv' )
    pass

就这样，经过两天的折腾，我完成了任务。。。。