今天有个任务,有个120M左右,内含600多W条记录,记录之间按行分隔的文件,
分割成20W条记录每个文件,以便在其他程序中导入。
由于自己手残,机器上装了WIndows,配置还太差没法布hadoop,因此只能本机实现了。
使用java自身的IO和commons的IO实现了两套方法比较一下。
等回家之后,看看用hadoop的api能不能编起来简单一点
- import java.io.*;
- import java.util.*;
- import org.apache.commons.io.*;
- /**
- * 任务:将一个存储了大约600W行左右记录(120M)的文件按照20W行分割为若干个文件
- * @author RangE
- *
- */
- public class BasicFileSplitter {
- private BasicFileSplitter() {}
- /**
- * 最基本的文件分割方法,使用BufferedReader和BufferedWriter
- * about 2700000ms cost on the task on Wed 13 Nov 2013
- *
- * 这个方法是最基本的IO,其性能瓶颈主要在于文件写入。
- * 写入文件使用的BufferedWriter使用字符流写入,但此处的实现是写一行就flush一次输出流,拖慢了写入速度
- *
- * @param inputPath
- * @param outputPath
- * @throws IOException
- */
- public static void splitFile(String inputPath, String outputPath) throws IOException {
- BufferedReader reader = null;
- BufferedWriter writer = null;
- try {
- reader = new BufferedReader(new InputStreamReader(new FileInputStream(new File(inputPath)), "UTF-8"));
- String temp = "";
- int countFiles = 0;
- int countLines = 0;
- System.out.println("Starting spliting files...");
- while ((temp = reader.readLine()) != null) {
- //System.out.println(temp);
- writer = new BufferedWriter
- (new OutputStreamWriter
- (new FileOutputStream(outputPath + countFiles + ".data", true), "UTF-8"));
- writer.write(temp + "\n");
- writer.flush();
- countLines++;
- if (countLines == 200000) {
- System.out.println("Spliting file into parts: " + countFiles);
- countFiles++;
- countLines = 0;
- }
- }
- } finally {
- if (reader != null)
- reader.close();
- if (writer != null)
- writer.close();
- System.out.println("Spliting finished successfully.");
- }
- }
- /**
- * 使用java.nio优化读写的分割
- * @param input
- * @param path
- */
- public static void splitFileByNewerIO(File input, String path) {
- //上网上查了一下:
- //要是需要一行一行处理的话还是用BufferedReader的readLine吧...
- //我试了试,用ByteBuffer的话,光处理分行问题就费不少时间.
- //原帖地址:http://bbs.csdn.net/topics/120096457
- //因此不予实现
- }
- /**
- * 用org.apache.commons.io完成的文件分割
- * 17975ms cost on the task on Wed 13 Nov 2013
- *
- * Iterator读文件也是使用BufferedReader读取,
- * 可知主要的性能提升在于FileUtils的写(writeLines)方法,最终调用了IOUtils的writeLines使用BufferedOutputStream字节流写入
- *
- * 注:用此方法分割后,分割文件总大小略大于被分割文件大小
- *
- * @param inputPath
- * @param outputPath
- * @throws IOException
- */
- public static void splitFileByCommonsIO(String inputPath, String outputPath)
- throws IOException {
- LineIterator it = null;
- try {
- it = FileUtils.lineIterator(new File(inputPath), "UTF-8");
- int lineCounter = 0;
- int fileCounter = 0;
- List lineList = new ArrayList();
- System.out.println("Starting..");
- while (it.hasNext()) {
- lineList.add(it.nextLine());
- lineCounter++;
- if (lineCounter == 200000) {
- FileUtils.writeLines(new File(outputPath + "_" + fileCounter), "UTF-8", lineList, true);
- lineList.clear();
- lineCounter = 0;
- fileCounter++;
- System.out.println("Complete file " + fileCounter);
- }
- }
- if (lineList != null && lineList.size() > 0) {
- FileUtils.writeLines(new File(outputPath + "_" + fileCounter + 1), "UTF-8", lineList, true);
- System.out.println("Complete the last file.");
- }
- } finally {
- if (it != null)
- it.close();
- System.out.println("Task completed successfully.");
- }
- }
- }
后来又想比较一下java io,nio和commons io的读写速度,于是用三者分别做了一个复制文件的demo,如下
- import java.io.*;
- import java.nio.*;
- import java.nio.channels.FileChannel;
- import java.nio.charset.Charset;
- import java.nio.charset.CharsetDecoder;
- import java.nio.charset.CharsetEncoder;
- import org.apache.commons.io.FileUtils;
- import org.apache.commons.io.LineIterator;
- public class FileRWSpeedCompare {
- private FileRWSpeedCompare() {}
- /**
- * 单纯对文件的复制来比较io,nio和commons-io的读写速度
- * @param args
- * @throws IOException
- */
- public static void main(String[] args) throws IOException {
- // TODO Auto-generated method stub
- String in = "files/hehe.data";
- String out = in + "_copy";
- long s = System.currentTimeMillis();
- //rw1(in, out);//Cost: 59797ms
- //rw2(in, out);//Cost: 4875ms
- //rw3(in, out);//Cost: 8796ms
- long e = System.currentTimeMillis();
- System.out.println("Cost: " + (e - s) + "ms");
- }
- /**
- * 基本IO
- * @param inputPath
- * @param outputPath
- * @throws IOException
- */
- public static void rw1(String inputPath, String outputPath) throws IOException {
- BufferedReader reader = null;
- BufferedWriter writer = null;
- try {
- reader = new BufferedReader(new InputStreamReader(new FileInputStream(new File(inputPath))));
- String temp = "";
- writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(new File(outputPath))));
- while ((temp = reader.readLine()) != null) {
- writer.write(temp + "\n");
- writer.flush();
- }
- } finally {
- if (reader != null)
- reader.close();
- if (writer != null)
- writer.close();
- }
- }
- /**
- * NIO (http://www.cnblogs.com/focusj/archive/2011/11/03/2231583.html)
- * @param inputPath
- * @param outputPath
- * @throws IOException
- */
- public static void rw2(String inputPath, String outputPath) throws IOException {
- FileInputStream ins = new FileInputStream(inputPath);
- FileOutputStream outs = new FileOutputStream(outputPath);
- ByteBuffer buffer = ByteBuffer.allocate(1024);
- FileChannel inc = ins.getChannel();
- FileChannel outc = outs.getChannel();
- Charset chs = Charset.forName("UTF-8");
- CharsetDecoder dec = chs.newDecoder();
- CharsetEncoder enc = chs.newEncoder();
- while (true) {
- buffer.clear();
- CharBuffer cb = dec.decode(buffer);
- ByteBuffer bb = enc.encode(cb);
- int temp = inc.read(bb);
- if (temp == -1) break;
- bb.flip();
- outc.write(bb);
- }
- }
- /**
- * 使用FileUtils的copyFile方法,事实上这个方法就是用nio的Buffer和Channel实现的
- * @param inputPath
- * @param outputPath
- * @throws IOException
- */
- public static void rw3(String inputPath, String outputPath) throws IOException {
- FileUtils.copyFile(new File(inputPath), new File(outputPath));
- }
- }
(文中的代码如果有需要,可直接复制粘贴使用,需要导入 org.apache.commons.io包)
由此结果,我觉得,对于单纯的读写文件,或者读写内容较简单的文件,使用nio确实优于其他二者,比如复制文件的任务。
但是如果需要按行读取的较大文本文件并且对具体的行进行处理的话,或许commons io要更好一些,其对字节流的封装使用起来很方便。
commons io包还有很多方法,等着自己试一试:
http://m.blog.csdn.net/blog/FansUnion/9844977
【转载请注明出处,谢谢】
刚刚找到了:http://www.oschina.net/code/snippet_54100_7938 ,Java NIO按行读写打文件,测试一下,之后在修改这帖