外部排序
如果需要排序的文件超过了内存的大小,就需要分块排序,按块写入文件,最后再归并已经有序的分块文件。
第一阶段 切分排序
1.将原始未排序的大文件,分为有序的小文件
1.1确定大文件切分的位置
1.2主线程按块读取原始文件,然后交给排序线程
1.3排序线程,对块进行排序
1.4写入线程将有序的数据,写入文件
第二阶段 归并
将临时文件归并到目标的结果文件
作为JAVA程序,需要注意的是,第一阶段是IO密集型,尽量分配更多的内存给Old区
而第二阶段是CPU密集型,需要注意Young区,避免过多的GC占用CPU资源。
需要提前了解的技术细节。
1.JAVA切分大文件
http://blog.itpub.net/29254281/viewspace-1161173/
2.JAVA内存映射文件
http://blog.itpub.net/29254281/viewspace-1162157/
3.JAVA栅栏
http://blog.itpub.net/29254281/viewspace-1164727/
4.观察者模式、生产者/消费者模式
5.JVM监控和GC
实验环境:
双核CPU,使用1G的JAVA堆内存,对2亿long型随机数排序,一般情况下,2亿long型随机数文件4G左右大小。
总体设计的思路
Main主线程负责切分原始文件,blocking函数根据文件分片的设置,返回原始文件的分块。
这样主要为了分块包含整行,而不会切断数据。
读取出来的原始未排序数据封装为Sorter对象,提交给排序线程池。
排序之后的数据,封装为Writer对象,再交给写入线程池,将排序好的分块写入临时文件。
在每个分块写入之后,线程在CyclicBarrier栅栏处等待,待所有的分块写入完毕,启用归并。
需要特别注意的是,在栅栏等待之前,一定要释放所有的资源,以便JVM GC回收内存。
就是Writer对象write方法的下列代码
归并非常消耗CPU资源。
在归并的过程中,因为每个分块本身都是有序的,所以只需要一个线程计算各个分块中最小的数字,将其写入BlockingQueue。
而另一个线程,不断的将队列的数据顺序写入目标文件。
这就是Merge对象的作用。这个过程使用了观察者模式和生产者消费者模式。
实现如下:
从监控可以看到第一阶段是IO密集型,对于内存需要很大;
而第二阶段,排序各个分块对于CPU压力很大,一定要注意不要让GC线程占用过多CPU资源,就是Young区不能过小。
上图中FGC 55之后的就是第二阶段的过程,Young GC明显增多。
验证:
开始使用1-100的实验数据,可以正确排序。
大文件排序之后,可以使用Linux Sort命令验证。
关于性能优化
外部排序的本质就是排序写入小文件,再将小文件归并为有序的目标文件。
所以时间大致应该是拷贝这个文件的时间乘以2.
但是...
想起在15所的时候,吴老师使用大致与我性能相当的配置,居然只用了230S左右..
优化需要注意的两点
1.内存映射文件
文件复制,分块文件写入然后读出这种场景使用内存映射文件,避免了内核空间和用户空间的内存复制。并且可以使用堆外操作系统内存作为缓存.
2.避免GC占用CPU
读取文件和写入文件的时候,Byte和Long做转换的时候,均采用了String类型作为中转,
后续可以考虑直接将byte和long型互转,避免引入String类型,这样就可以避免额外的GC
第一点都很容易想到,而第二点吴老师做了,我没有实现,这可能是导致性能较慢的原因。
以后有时间再补上这个细节吧。
因为使用了内存映射文件,比了避免误差,每次实验之前最好清除缓存。
如果需要排序的文件超过了内存的大小,就需要分块排序,按块写入文件,最后再归并已经有序的分块文件。
第一阶段 切分排序
1.将原始未排序的大文件,分为有序的小文件
1.1确定大文件切分的位置
1.2主线程按块读取原始文件,然后交给排序线程
1.3排序线程,对块进行排序
1.4写入线程将有序的数据,写入文件
第二阶段 归并
将临时文件归并到目标的结果文件
作为JAVA程序,需要注意的是,第一阶段是IO密集型,尽量分配更多的内存给Old区
而第二阶段是CPU密集型,需要注意Young区,避免过多的GC占用CPU资源。
需要提前了解的技术细节。
1.JAVA切分大文件
http://blog.itpub.net/29254281/viewspace-1161173/
2.JAVA内存映射文件
http://blog.itpub.net/29254281/viewspace-1162157/
3.JAVA栅栏
http://blog.itpub.net/29254281/viewspace-1164727/
4.观察者模式、生产者/消费者模式
5.JVM监控和GC
实验环境:
双核CPU,使用1G的JAVA堆内存,对2亿long型随机数排序,一般情况下,2亿long型随机数文件4G左右大小。
总体设计的思路
Main主线程负责切分原始文件,blocking函数根据文件分片的设置,返回原始文件的分块。
这样主要为了分块包含整行,而不会切断数据。
读取出来的原始未排序数据封装为Sorter对象,提交给排序线程池。
排序之后的数据,封装为Writer对象,再交给写入线程池,将排序好的分块写入临时文件。
在每个分块写入之后,线程在CyclicBarrier栅栏处等待,待所有的分块写入完毕,启用归并。
需要特别注意的是,在栅栏等待之前,一定要释放所有的资源,以便JVM GC回收内存。
就是Writer对象write方法的下列代码
归并非常消耗CPU资源。
在归并的过程中,因为每个分块本身都是有序的,所以只需要一个线程计算各个分块中最小的数字,将其写入BlockingQueue。
而另一个线程,不断的将队列的数据顺序写入目标文件。
这就是Merge对象的作用。这个过程使用了观察者模式和生产者消费者模式。
实现如下:
import java.io.BufferedInputStream;
import java.io.BufferedOutputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.RandomAccessFile;
import java.nio.MappedByteBuffer;
import java.nio.channels.FileChannel;
import java.nio.channels.FileChannel.MapMode;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.util.List;
import java.util.Queue;
import java.util.concurrent.BlockingQueue;
import java.util.concurrent.BrokenBarrierException;
import java.util.concurrent.ConcurrentLinkedQueue;
import java.util.concurrent.CyclicBarrier;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.LinkedBlockingQueue;
public class Controller {
public static void main(String[] args) throws IOException {
Controller c = new Controller(new File("/home/lihuilin/桌面/t.txt"), 15, "/home/lihuilin/桌面/");
}
// 排序线程池
private final ExecutorService sortThread;
// 写入线程池,将排序之后的分片写入文件
private final ExecutorService writerThread;
// 栅栏,等待所有分片写入文件之后,启动合并
private final CyclicBarrier barrier;
// 原始没有顺序的大文件
private final File file;
// 分片大小
private final int pieces;
// 输出目录
private final String outDir;
// 记录分片写入临时文件的位置
private final List<File> outFileList = new ArrayList<File>();
public Controller(File file, int pieces, final String outDir) throws IOException {
final long start = System.currentTimeMillis();
sortThread = Executors.newFixedThreadPool(1);
// 写入线程池的线程数一定不能小于分片的大小。否则CyclicBarrier
// await之后,后续的分片将没有线程可用。await不会释放线程资源。
writerThread = Executors.newFixedThreadPool(pieces + 1);
this.file = file;
this.pieces = pieces;
this.outDir = outDir;
this.barrier = new CyclicBarrier(pieces, new Runnable() {
@Override
public void run() {
long end = System.currentTimeMillis();
System.out.println("合并之前总用时:" + (end - start) / 1000 + "s");
// 合并有序的分片临时文件
Merger merger = new Merger(outFileList, outDir);
writerThread.submit(merger);
try {
merger.merge();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
writerThread.shutdown();
sortThread.shutdown();
end = System.currentTimeMillis();
System.out.println("外部排序总用时:" + (end - start) / 1000 + "s");
}
});
action();
}
private void action() throws IOException {
List<Point> list = blocking(file, pieces);
for (Point p : list) {
Spilter spilter = new MappedByteBufferSpilter(file, p);
long[] data = null;
data = spilter.spilt();
Sorter s = new Sorter(data, p, writerThread, barrier, outFileList);
sortThread.submit(s);
}
}
private List<Point> blocking(File file, int piece) throws IOException {
List<Point> result = new ArrayList<Point>();
List<Long> list = new ArrayList<Long>();
list.add(-1L);
long length = file.length();
long step = length / piece;
long index = 0;
for (int i = 0; i < piece; i++) {
BufferedInputStream in = new BufferedInputStream(new FileInputStream(file));
if (index + step < length) {
index = index + step;
in.skip(index);
while (in.read() != 10) {
index = index + 1;
}
list.add(index);
index++;
}
in.close();
}
list.add(length - 1);
for (int i = 0; i < list.size() - 1; i++) {
long skipSize = list.get(i) + 1;
long l = list.get(i + 1) - list.get(i);
result.add(new Point(skipSize, l, outDir));
}
return result;
}
}
class Merger implements Runnable {
private final List<Worker> workerList = new ArrayList<Worker>();
private String outDir = null;
private BlockingQueue<Long> queue = new LinkedBlockingQueue<Long>(1000);
private volatile boolean finished = false;
public Merger(List<File> outFileList, String outDir) {
for (File file : outFileList) {
Worker worker = new Worker(file, workerList);
workerList.add(worker);
}
this.outDir = outDir;
}
@Override
public void run() {
try {
System.out.println("读取队列,写入目标文件");
BufferedOutputStream bos = new BufferedOutputStream(new FileOutputStream(outDir + "result.txt"), 50 * 1024 * 1024);
while (finished != true || !queue.isEmpty()) {
Long l = queue.take();
bos.write((l + "\n").getBytes());
}
bos.flush();
bos.close();
} catch (Exception ex) {
ex.printStackTrace();
}
}
public void merge() throws IOException, InterruptedException {
while (workerList.size() != 0) {
Collections.sort(workerList);
Worker worker = workerList.get(0);
Long data = worker.poll();
if (data == null) {
workerList.remove(worker);
} else {
queue.put(data);
}
}
finished = true;
}
private class Worker implements Comparable<Worker> {
private long data;
private MappedByteBuffer buffer = null;
private List<Worker> workerList = null;
private boolean eof = false;
Worker(File file, List<Worker> workerList) {
try {
RandomAccessFile rFile = new RandomAccessFile(file, "r");
FileChannel channel = rFile.getChannel();
buffer = channel.map(MapMode.READ_ONLY, 0, channel.size());
channel.close();
rFile.close();
this.workerList = workerList;
data = buffer.getLong();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
public long peek() {
return data;
}
public Long poll() {
long result = data;
if (buffer.position() != buffer.limit()) {
data = buffer.getLong();
} else {
if (eof == false) {
eof = true;
} else {
return null;
}
}
return result;
}
@Override
public int compareTo(Worker o) {
if (this.peek() > o.peek()) {
return 1;
} else if (this.peek() < o.peek()) {
return -1;
} else {
return 0;
}
}
}
}
interface Spilter {
public long[] spilt();
}
class Sorter implements Runnable {
long[] data;
Point p;
ExecutorService writerThread;
List<File> outFileList;
CyclicBarrier barrier;
public Sorter(long[] data, Point p, ExecutorService writerThread, CyclicBarrier barrier, List<File> outFileList) {
this.data = data;
this.p = p;
this.outFileList = outFileList;
this.barrier = barrier;
this.writerThread = writerThread;
}
public long[] sort() {
System.out.println("\t开始排序:" + p);
long start = System.currentTimeMillis();
Arrays.sort(this.data);
long end = System.currentTimeMillis();
System.out.println("\t结束排序:" + p + ",用时:" + (end - start) / 1000);
return this.data;
}
@Override
public void run() {
Writer writer = new MappedByteBufferWriter(sort(), p, barrier, outFileList);
writerThread.submit(writer);
}
}
interface Writer extends Runnable {
public void write();
}
class MappedByteBufferWriter implements Writer {
private static int FLAG = 1;
private CyclicBarrier barrier = null;
private File outfile = null;
private Point point = null;
private long[] data = null;
private List<File> outFileList = null;
public MappedByteBufferWriter(long[] data, Point point, CyclicBarrier barrier, List<File> outFileList) {
this.data = data;
this.point = point;
this.outfile = new File(point.getOutDir() + FLAG + ".txt");
this.barrier = barrier;
this.outFileList = outFileList;
FLAG++;
}
@Override
public void write() {
try {
System.out.println("\t\t开始写入:" + point);
long start = System.currentTimeMillis();
FileChannel channel = new RandomAccessFile(this.outfile, "rw").getChannel();
MappedByteBuffer buffer = channel.map(MapMode.READ_WRITE, 0, this.data.length * 8);
for (int i = 0; i < data.length; i++) {
buffer.putLong(data[i]);
}
buffer.force();
long end = System.currentTimeMillis();
System.out.println("\t\t结束写入:" + point + ",用时:" + (end - start) / 1000);
synchronized (outFileList) {
outFileList.add(outfile);
}
this.data = null;
channel.close();
buffer = null;
barrier.await();
} catch (IOException ex) {
ex.printStackTrace();
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (BrokenBarrierException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
@Override
public void run() {
this.write();
}
}
class MappedByteBufferSpilter implements Spilter {
private File file;
private Point point;
public MappedByteBufferSpilter(File file, Point p) {
this.file = file;
this.point = p;
}
@Override
public long[] spilt() {
System.out.println("开始读入:" + point);
long start = System.currentTimeMillis();
long[] result = null;
try {
FileChannel in = new RandomAccessFile(file, "r").getChannel();
MappedByteBuffer inBuffer = in.map(MapMode.READ_ONLY, point.getSkipSize(), point.getLength());
byte[] data = new byte[inBuffer.limit()];
inBuffer.get(data);
result = new long[getObjectSize(data)];
int resultIndex = 0;
int index = 0;
int first = 0;
while (index < data.length) {
if (data[index] == 10) {
byte[] tmpData = Arrays.copyOfRange(data, first, index);
String str = new String(tmpData);
result[resultIndex] = Long.valueOf(str);
resultIndex++;
first = index + 1;
}
index++;
}
in.close();
} catch (IOException ex) {
ex.printStackTrace();
}
long end = System.currentTimeMillis();
System.out.println("结束读入:" + point + ",用时:" + (end - start) / 1000);
return result;
}
private int getObjectSize(byte[] data) {
int size = 0;
for (byte b : data) {
if (b == 10) {
size++;
}
}
return size;
}
}
class Point {
public Point(long skipSize, long length, String outDir) {
if (length > Integer.MAX_VALUE) {
throw new RuntimeException("长度溢出");
}
this.skipSize = skipSize;
this.length = (int) length;
this.outDir = outDir;
}
@Override
public String toString() {
return "Point [skipSize=" + skipSize + ", length=" + length + "]";
}
private long skipSize;
private int length;
private String outDir;
public String getOutDir() {
return outDir;
}
public long getSkipSize() {
return skipSize;
}
public int getLength() {
return length;
}
}
运行:import java.io.BufferedOutputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.RandomAccessFile;
import java.nio.MappedByteBuffer;
import java.nio.channels.FileChannel;
import java.nio.channels.FileChannel.MapMode;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.util.List;
import java.util.Queue;
import java.util.concurrent.BlockingQueue;
import java.util.concurrent.BrokenBarrierException;
import java.util.concurrent.ConcurrentLinkedQueue;
import java.util.concurrent.CyclicBarrier;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.LinkedBlockingQueue;
public class Controller {
public static void main(String[] args) throws IOException {
Controller c = new Controller(new File("/home/lihuilin/桌面/t.txt"), 15, "/home/lihuilin/桌面/");
}
// 排序线程池
private final ExecutorService sortThread;
// 写入线程池,将排序之后的分片写入文件
private final ExecutorService writerThread;
// 栅栏,等待所有分片写入文件之后,启动合并
private final CyclicBarrier barrier;
// 原始没有顺序的大文件
private final File file;
// 分片大小
private final int pieces;
// 输出目录
private final String outDir;
// 记录分片写入临时文件的位置
private final List<File> outFileList = new ArrayList<File>();
public Controller(File file, int pieces, final String outDir) throws IOException {
final long start = System.currentTimeMillis();
sortThread = Executors.newFixedThreadPool(1);
// 写入线程池的线程数一定不能小于分片的大小。否则CyclicBarrier
// await之后,后续的分片将没有线程可用。await不会释放线程资源。
writerThread = Executors.newFixedThreadPool(pieces + 1);
this.file = file;
this.pieces = pieces;
this.outDir = outDir;
this.barrier = new CyclicBarrier(pieces, new Runnable() {
@Override
public void run() {
long end = System.currentTimeMillis();
System.out.println("合并之前总用时:" + (end - start) / 1000 + "s");
// 合并有序的分片临时文件
Merger merger = new Merger(outFileList, outDir);
writerThread.submit(merger);
try {
merger.merge();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
writerThread.shutdown();
sortThread.shutdown();
end = System.currentTimeMillis();
System.out.println("外部排序总用时:" + (end - start) / 1000 + "s");
}
});
action();
}
private void action() throws IOException {
List<Point> list = blocking(file, pieces);
for (Point p : list) {
Spilter spilter = new MappedByteBufferSpilter(file, p);
long[] data = null;
data = spilter.spilt();
Sorter s = new Sorter(data, p, writerThread, barrier, outFileList);
sortThread.submit(s);
}
}
private List<Point> blocking(File file, int piece) throws IOException {
List<Point> result = new ArrayList<Point>();
List<Long> list = new ArrayList<Long>();
list.add(-1L);
long length = file.length();
long step = length / piece;
long index = 0;
for (int i = 0; i < piece; i++) {
BufferedInputStream in = new BufferedInputStream(new FileInputStream(file));
if (index + step < length) {
index = index + step;
in.skip(index);
while (in.read() != 10) {
index = index + 1;
}
list.add(index);
index++;
}
in.close();
}
list.add(length - 1);
for (int i = 0; i < list.size() - 1; i++) {
long skipSize = list.get(i) + 1;
long l = list.get(i + 1) - list.get(i);
result.add(new Point(skipSize, l, outDir));
}
return result;
}
}
class Merger implements Runnable {
private final List<Worker> workerList = new ArrayList<Worker>();
private String outDir = null;
private BlockingQueue<Long> queue = new LinkedBlockingQueue<Long>(1000);
private volatile boolean finished = false;
public Merger(List<File> outFileList, String outDir) {
for (File file : outFileList) {
Worker worker = new Worker(file, workerList);
workerList.add(worker);
}
this.outDir = outDir;
}
@Override
public void run() {
try {
System.out.println("读取队列,写入目标文件");
BufferedOutputStream bos = new BufferedOutputStream(new FileOutputStream(outDir + "result.txt"), 50 * 1024 * 1024);
while (finished != true || !queue.isEmpty()) {
Long l = queue.take();
bos.write((l + "\n").getBytes());
}
bos.flush();
bos.close();
} catch (Exception ex) {
ex.printStackTrace();
}
}
public void merge() throws IOException, InterruptedException {
while (workerList.size() != 0) {
Collections.sort(workerList);
Worker worker = workerList.get(0);
Long data = worker.poll();
if (data == null) {
workerList.remove(worker);
} else {
queue.put(data);
}
}
finished = true;
}
private class Worker implements Comparable<Worker> {
private long data;
private MappedByteBuffer buffer = null;
private List<Worker> workerList = null;
private boolean eof = false;
Worker(File file, List<Worker> workerList) {
try {
RandomAccessFile rFile = new RandomAccessFile(file, "r");
FileChannel channel = rFile.getChannel();
buffer = channel.map(MapMode.READ_ONLY, 0, channel.size());
channel.close();
rFile.close();
this.workerList = workerList;
data = buffer.getLong();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
public long peek() {
return data;
}
public Long poll() {
long result = data;
if (buffer.position() != buffer.limit()) {
data = buffer.getLong();
} else {
if (eof == false) {
eof = true;
} else {
return null;
}
}
return result;
}
@Override
public int compareTo(Worker o) {
if (this.peek() > o.peek()) {
return 1;
} else if (this.peek() < o.peek()) {
return -1;
} else {
return 0;
}
}
}
}
interface Spilter {
public long[] spilt();
}
class Sorter implements Runnable {
long[] data;
Point p;
ExecutorService writerThread;
List<File> outFileList;
CyclicBarrier barrier;
public Sorter(long[] data, Point p, ExecutorService writerThread, CyclicBarrier barrier, List<File> outFileList) {
this.data = data;
this.p = p;
this.outFileList = outFileList;
this.barrier = barrier;
this.writerThread = writerThread;
}
public long[] sort() {
System.out.println("\t开始排序:" + p);
long start = System.currentTimeMillis();
Arrays.sort(this.data);
long end = System.currentTimeMillis();
System.out.println("\t结束排序:" + p + ",用时:" + (end - start) / 1000);
return this.data;
}
@Override
public void run() {
Writer writer = new MappedByteBufferWriter(sort(), p, barrier, outFileList);
writerThread.submit(writer);
}
}
interface Writer extends Runnable {
public void write();
}
class MappedByteBufferWriter implements Writer {
private static int FLAG = 1;
private CyclicBarrier barrier = null;
private File outfile = null;
private Point point = null;
private long[] data = null;
private List<File> outFileList = null;
public MappedByteBufferWriter(long[] data, Point point, CyclicBarrier barrier, List<File> outFileList) {
this.data = data;
this.point = point;
this.outfile = new File(point.getOutDir() + FLAG + ".txt");
this.barrier = barrier;
this.outFileList = outFileList;
FLAG++;
}
@Override
public void write() {
try {
System.out.println("\t\t开始写入:" + point);
long start = System.currentTimeMillis();
FileChannel channel = new RandomAccessFile(this.outfile, "rw").getChannel();
MappedByteBuffer buffer = channel.map(MapMode.READ_WRITE, 0, this.data.length * 8);
for (int i = 0; i < data.length; i++) {
buffer.putLong(data[i]);
}
buffer.force();
long end = System.currentTimeMillis();
System.out.println("\t\t结束写入:" + point + ",用时:" + (end - start) / 1000);
synchronized (outFileList) {
outFileList.add(outfile);
}
this.data = null;
channel.close();
buffer = null;
barrier.await();
} catch (IOException ex) {
ex.printStackTrace();
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (BrokenBarrierException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
@Override
public void run() {
this.write();
}
}
class MappedByteBufferSpilter implements Spilter {
private File file;
private Point point;
public MappedByteBufferSpilter(File file, Point p) {
this.file = file;
this.point = p;
}
@Override
public long[] spilt() {
System.out.println("开始读入:" + point);
long start = System.currentTimeMillis();
long[] result = null;
try {
FileChannel in = new RandomAccessFile(file, "r").getChannel();
MappedByteBuffer inBuffer = in.map(MapMode.READ_ONLY, point.getSkipSize(), point.getLength());
byte[] data = new byte[inBuffer.limit()];
inBuffer.get(data);
result = new long[getObjectSize(data)];
int resultIndex = 0;
int index = 0;
int first = 0;
while (index < data.length) {
if (data[index] == 10) {
byte[] tmpData = Arrays.copyOfRange(data, first, index);
String str = new String(tmpData);
result[resultIndex] = Long.valueOf(str);
resultIndex++;
first = index + 1;
}
index++;
}
in.close();
} catch (IOException ex) {
ex.printStackTrace();
}
long end = System.currentTimeMillis();
System.out.println("结束读入:" + point + ",用时:" + (end - start) / 1000);
return result;
}
private int getObjectSize(byte[] data) {
int size = 0;
for (byte b : data) {
if (b == 10) {
size++;
}
}
return size;
}
}
class Point {
public Point(long skipSize, long length, String outDir) {
if (length > Integer.MAX_VALUE) {
throw new RuntimeException("长度溢出");
}
this.skipSize = skipSize;
this.length = (int) length;
this.outDir = outDir;
}
@Override
public String toString() {
return "Point [skipSize=" + skipSize + ", length=" + length + "]";
}
private long skipSize;
private int length;
private String outDir;
public String getOutDir() {
return outDir;
}
public long getSkipSize() {
return skipSize;
}
public int getLength() {
return length;
}
}
[lihuilin@lihuilin 桌面]$ java Controller 开始读入:Point [skipSize=0, length=271726519] 结束读入:Point [skipSize=0, length=271726519],用时:8 开始读入:Point [skipSize=271726519, length=271726515] 开始排序:Point [skipSize=0, length=271726519] 结束排序:Point [skipSize=0, length=271726519],用时:3 开始写入:Point [skipSize=0, length=271726519] 结束写入:Point [skipSize=0, length=271726519],用时:2 结束读入:Point [skipSize=271726519, length=271726515],用时:9 开始排序:Point [skipSize=271726519, length=271726515] 开始读入:Point [skipSize=543453034, length=271726511] 结束排序:Point [skipSize=271726519, length=271726515],用时:4 开始写入:Point [skipSize=271726519, length=271726515] 结束写入:Point [skipSize=271726519, length=271726515],用时:3 结束读入:Point [skipSize=543453034, length=271726511],用时:9 开始读入:Point [skipSize=815179545, length=271726515] 开始排序:Point [skipSize=543453034, length=271726511] 结束排序:Point [skipSize=543453034, length=271726511],用时:3 开始写入:Point [skipSize=543453034, length=271726511] 结束写入:Point [skipSize=543453034, length=271726511],用时:5 结束读入:Point [skipSize=815179545, length=271726515],用时:13 开始读入:Point [skipSize=1086906060, length=271726524] 开始排序:Point [skipSize=815179545, length=271726515] 结束排序:Point [skipSize=815179545, length=271726515],用时:3 开始写入:Point [skipSize=815179545, length=271726515] 结束写入:Point [skipSize=815179545, length=271726515],用时:5 结束读入:Point [skipSize=1086906060, length=271726524],用时:13 开始读入:Point [skipSize=1358632584, length=271726507] 开始排序:Point [skipSize=1086906060, length=271726524] 结束排序:Point [skipSize=1086906060, length=271726524],用时:3 开始写入:Point [skipSize=1086906060, length=271726524] 结束写入:Point [skipSize=1086906060, length=271726524],用时:5 结束读入:Point [skipSize=1358632584, length=271726507],用时:12 开始读入:Point [skipSize=1630359091, length=271726523] 开始排序:Point [skipSize=1358632584, length=271726507] 结束排序:Point [skipSize=1358632584, length=271726507],用时:3 开始写入:Point [skipSize=1358632584, length=271726507] 结束写入:Point [skipSize=1358632584, length=271726507],用时:5 结束读入:Point [skipSize=1630359091, length=271726523],用时:13 开始读入:Point [skipSize=1902085614, length=271726514] 开始排序:Point [skipSize=1630359091, length=271726523] 结束排序:Point [skipSize=1630359091, length=271726523],用时:3 开始写入:Point [skipSize=1630359091, length=271726523] 结束写入:Point [skipSize=1630359091, length=271726523],用时:5 结束读入:Point [skipSize=1902085614, length=271726514],用时:13 开始读入:Point [skipSize=2173812128, length=271726519] 开始排序:Point [skipSize=1902085614, length=271726514] 结束排序:Point [skipSize=1902085614, length=271726514],用时:3 开始写入:Point [skipSize=1902085614, length=271726514] 结束写入:Point [skipSize=1902085614, length=271726514],用时:5 结束读入:Point [skipSize=2173812128, length=271726519],用时:13 开始读入:Point [skipSize=2445538647, length=271726516] 开始排序:Point [skipSize=2173812128, length=271726519] 结束排序:Point [skipSize=2173812128, length=271726519],用时:3 开始写入:Point [skipSize=2173812128, length=271726519] 结束写入:Point [skipSize=2173812128, length=271726519],用时:5 结束读入:Point [skipSize=2445538647, length=271726516],用时:12 开始读入:Point [skipSize=2717265163, length=271726517] 开始排序:Point [skipSize=2445538647, length=271726516] 结束排序:Point [skipSize=2445538647, length=271726516],用时:3 开始写入:Point [skipSize=2445538647, length=271726516] 结束写入:Point [skipSize=2445538647, length=271726516],用时:5 结束读入:Point [skipSize=2717265163, length=271726517],用时:13 开始读入:Point [skipSize=2988991680, length=271726517] 开始排序:Point [skipSize=2717265163, length=271726517] 结束排序:Point [skipSize=2717265163, length=271726517],用时:3 开始写入:Point [skipSize=2717265163, length=271726517] 结束写入:Point [skipSize=2717265163, length=271726517],用时:5 结束读入:Point [skipSize=2988991680, length=271726517],用时:12 开始读入:Point [skipSize=3260718197, length=271726516] 开始排序:Point [skipSize=2988991680, length=271726517] 结束排序:Point [skipSize=2988991680, length=271726517],用时:3 开始写入:Point [skipSize=2988991680, length=271726517] 结束写入:Point [skipSize=2988991680, length=271726517],用时:5 结束读入:Point [skipSize=3260718197, length=271726516],用时:12 开始读入:Point [skipSize=3532444713, length=271726515] 开始排序:Point [skipSize=3260718197, length=271726516] 结束排序:Point [skipSize=3260718197, length=271726516],用时:3 开始写入:Point [skipSize=3260718197, length=271726516] 结束写入:Point [skipSize=3260718197, length=271726516],用时:5 结束读入:Point [skipSize=3532444713, length=271726515],用时:12 开始读入:Point [skipSize=3804171228, length=271726376] 开始排序:Point [skipSize=3532444713, length=271726515] 结束排序:Point [skipSize=3532444713, length=271726515],用时:3 开始写入:Point [skipSize=3532444713, length=271726515] 结束写入:Point [skipSize=3532444713, length=271726515],用时:4 结束读入:Point [skipSize=3804171228, length=271726376],用时:12 开始排序:Point [skipSize=3804171228, length=271726376] 结束排序:Point [skipSize=3804171228, length=271726376],用时:2 开始写入:Point [skipSize=3804171228, length=271726376] 结束写入:Point [skipSize=3804171228, length=271726376],用时:3 合并之前总用时:190s 读取队列,写入目标文件 外部排序总用时:398s
JVM监控:从监控可以看到第一阶段是IO密集型,对于内存需要很大;
而第二阶段,排序各个分块对于CPU压力很大,一定要注意不要让GC线程占用过多CPU资源,就是Young区不能过小。
上图中FGC 55之后的就是第二阶段的过程,Young GC明显增多。
验证:
开始使用1-100的实验数据,可以正确排序。
大文件排序之后,可以使用Linux Sort命令验证。
关于性能优化
外部排序的本质就是排序写入小文件,再将小文件归并为有序的目标文件。
所以时间大致应该是拷贝这个文件的时间乘以2.
但是...
想起在15所的时候,吴老师使用大致与我性能相当的配置,居然只用了230S左右..
优化需要注意的两点
1.内存映射文件
文件复制,分块文件写入然后读出这种场景使用内存映射文件,避免了内核空间和用户空间的内存复制。并且可以使用堆外操作系统内存作为缓存.
2.避免GC占用CPU
读取文件和写入文件的时候,Byte和Long做转换的时候,均采用了String类型作为中转,
后续可以考虑直接将byte和long型互转,避免引入String类型,这样就可以避免额外的GC
第一点都很容易想到,而第二点吴老师做了,我没有实现,这可能是导致性能较慢的原因。
以后有时间再补上这个细节吧。
因为使用了内存映射文件,比了避免误差,每次实验之前最好清除缓存。
来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/29254281/viewspace-1167988/,如需转载,请注明出处,否则将追究法律责任。
转载于:http://blog.itpub.net/29254281/viewspace-1167988/