文件倒叙读取行内容识别文件编码多线程
1.文件倒叙读取的方式
文件倒叙读取行内容参考了许多博客,大体是一种采用RandomAccessFile进行读取,考虑到有现成的开源解决方案,先采用ReversedLinesFileReader类来进行操作,再apache的commons-io依赖中。
emm,考虑到编码问题,推荐采用2.7版本来操作,因为看2.4的源码是不支持gbk编码的。
依赖:
<dependency>
<groupId>commons-io</groupId>
<artifactId>commons-io</artifactId>
<version>2.7</version>
</dependency>
2.读取的代码
public void reverseReadFileContent(String filePath){
File file=new File(filePath);
int blockSize=(int) file.length()/1024<2?2:(int) file.length()/1024;
try (ReversedLinesFileReader reader = new ReversedLinesFileReader(file,blockSize,
Charset.forName("UTF-8"))) {//钻石语法哈,自动关闭流
String line = "";
while (line!=null){
line= reader.readLine();
System.out.println("本行的内容是"+line);
}
}catch (Exception e) {
e.printStackTrace();
}
}
3.识别文件的编码格式
看上述代码其实仅支持utf8编码文件读取,在集成现有项目的其他模块时是不适用的。因此在读之前需要先进行编码解析的操作,一番百度又一次找到了根据文件前几个字节来做的。但仍然是有一个成熟的模块可支持我们的操作。照例采用成熟的东西。
依赖:
<dependency>
<groupId>net.sourceforge.cpdetector</groupId>
<artifactId>cpdetector</artifactId>
<version>1.0.4</version>
</dependency>
4.读取文件的编码格式
/**
* 获取文件的编码格式 加锁,不加锁会引起线程并发问题
* @param filePath
* @return 编码格式
*/
public static synchronized String getFileEncode(String filePath) {
String charsetName = "";
try {
File file = new File(filePath);
CodepageDetectorProxy detector = CodepageDetectorProxy.getInstance();
detector.add(new ParsingDetector(false));
//添加三种编码格式
detector.add(JChardetFacade.getInstance());
detector.add(ASCIIDetector.getInstance());
// detector.add(UnicodeDetector.getInstance());
java.nio.charset.Charset charset = null;
charset = detector.detectCodepage(file.toURI().toURL());
if (charset != null) {
charsetName = charset.name();
} else {
charsetName = "UTF-8";
}
} catch (Exception ex) {
ex.printStackTrace();
}
return charsetName;
}
由于我的读操作会在多线程情况下读取多个file来找出其内容因此会有并发问题,加锁即可解决该问题。
5.ReversedLinesFileReader的GBK编码读取支持
观察此构造器可以发现该版本的reader是支持gbk编码文件内容的读取的,但是我们的文件中文编码有可能是gb2312、big5的,可以知道如果直接传进此编码会抛出异常,但是gbk会兼容此两种编码,因此我们将gb2312、big5放入集合中判断是否contains这两种编码,是的话采用gbk来读文件
/**
* Creates a ReversedLinesFileReader with the given block size and encoding.
*
* @param file
* the file to be read
* @param blockSize
* size of the internal buffer (for ideal performance this should
* match with the block size of the underlying file system).
* @param encoding
* the encoding of the file
* @throws IOException if an I/O error occurs
* @since 2.7
*/
public ReversedLinesFileReader(final Path file, final int blockSize, final Charset encoding) throws IOException {
this.blockSize = blockSize;
this.encoding = encoding;
// --- check & prepare encoding ---
final Charset charset = Charsets.toCharset(encoding);
final CharsetEncoder charsetEncoder = charset.newEncoder();
final float maxBytesPerChar = charsetEncoder.maxBytesPerChar();
if (maxBytesPerChar == 1f) {
// all one byte encodings are no problem
byteDecrement = 1;
} else if (charset == StandardCharsets.UTF_8) {
// UTF-8 works fine out of the box, for multibyte sequences a second UTF-8 byte can never be a newline byte
// http://en.wikipedia.org/wiki/UTF-8
byteDecrement = 1;
} else if(charset == Charset.forName("Shift_JIS") || // Same as for UTF-8
// http://www.herongyang.com/Unicode/JIS-Shift-JIS-Encoding.html
charset == Charset.forName("windows-31j") || // Windows code page 932 (Japanese)
charset == Charset.forName("x-windows-949") || // Windows code page 949 (Korean)
charset == Charset.forName("gbk") || // Windows code page 936 (Simplified Chinese)
charset == Charset.forName("x-windows-950")) { // Windows code page 950 (Traditional Chinese)
byteDecrement = 1;
} else if (charset == StandardCharsets.UTF_16BE || charset == StandardCharsets.UTF_16LE) {
// UTF-16 new line sequences are not allowed as second tuple of four byte sequences,
// however byte order has to be specified
byteDecrement = 2;
} else if (charset == StandardCharsets.UTF_16) {
throw new UnsupportedEncodingException("For UTF-16, you need to specify the byte order (use UTF-16BE or " +
"UTF-16LE)");
} else {
throw new UnsupportedEncodingException("Encoding " + encoding + " is not supported yet (feel free to " +
"submit a patch)");
}
// NOTE: The new line sequences are matched in the order given, so it is important that \r\n is BEFORE \n
newLineSequences = new byte[][] { "\r\n".getBytes(encoding), "\n".getBytes(encoding), "\r".getBytes(encoding) };
avoidNewlineSplitBufferSize = newLineSequences[0].length;
// Open file
channel = Files.newByteChannel(file, StandardOpenOption.READ);
totalByteLength = channel.size();
int lastBlockLength = (int) (totalByteLength % blockSize);
if (lastBlockLength > 0) {
totalBlockCount = totalByteLength / blockSize + 1;
} else {
totalBlockCount = totalByteLength / blockSize;
if (totalByteLength > 0) {
lastBlockLength = blockSize;
}
}
currentFilePart = new FilePart(totalBlockCount, lastBlockLength, null);
}
private static List<String> gbkCharsets;
static {
gbkCharsets = new ArrayList<> ();
gbkCharsets.add("GB2312");
gbkCharsets.add("BIG5");
// gbkCharsets.add("GBK");
}
/**
* 输入获取的字符集获取到该字符集的大集合,例如GB2312(中文简体),BIG5(中文繁体)均在GBK编码中
* @param fileCharset
* @return
*/
private String getFileCharset(String fileCharset){
return gbkCharsets.contains(fileCharset)?"gbk":fileCharset;
}
上述的读文件代码即可改为如下
public void reverseReadFileContent(String filePath){
File file=new File(filePath);
int blockSize=(int) file.length()/1024<2?2:(int) file.length()/1024;
String charset=getFileEncode(filePath);
charset=getFileCharset(charset.toUpperCase());
try (ReversedLinesFileReader reader = new ReversedLinesFileReader(file,blockSize,
Charset.forName(charset))) {//钻石语法哈,自动关闭流
String line = "";
while (line!=null){
line= reader.readLine();
System.out.println("本行的内容是"+line);
}
}catch (Exception e) {
e.printStackTrace();
}
}
6.最后加上我的多线程读取代码吧,其实我的是需要线程读取的返回值的,因此采用线程池和Callable来做。
public Object readMultiFileContent(){
Integer maxPoolSize ="配置文件可以提前设置好一个默认线程池的大小";
ExecutorService executor = Executors.newFixedThreadPool(paths.size()>maxPoolSize?maxPoolSize:paths.size());
ExecutorCompletionService<String> completionService = new ExecutorCompletionService<>(executor);
try {
for (int i = 0; i < paths.size(); i++) {
String path = String.valueOf(paths.get(i));
completionService.submit(()->{
reverseReadFileContent(path);//返回结果可以在这里拿到返回
return "方法的返回结果";
});
}
for (int i = 0; i < paths.size(); i++) {
Future<String> future = completionService.take();
if (future != null) {
future.get();
}
}
}catch (Exception e) {
log.error("线程异常:"+e);
} finally {
executor.shutdown();
}
return result;
}
以上便是需求的所有完成步骤了,再试试其他方案看是否可提高读取效率,我的是需要倒叙读取最新n(n为常数10,30,50)行包含输入字符串的数据,因此在极端情况下是会遍历所有行的,commons-io扩展出的倒叙读是会有问题的,因此,看是否有其他高效的读取方案。欢迎大家指出不足与提供新的思路、解决方案。