解析字符串后分组写成多文件

【问题】

So I have one large file that contains a bunch of weather data. I have to allocate each line from the large file into its corresponding state file. So there will be a total of 50 new state files with their own data.

The large file contains ~1 million lines of records like this:

COOP:166657,'NEW IBERIA AIRPORT ACADIANA REGIONAL LA US',200001,177,553

Although the name of the station can vary and have different number of words.

Currently right now I am using regex to find the pattern and output to a file, and it must be grouped by state. If I read in the entire file without any modifications it takes about 46 seconds. With the code to find the state abbreviation, create the file, and output to that file, it takes over 10 minutes.

This is what I have right now:

package climate;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.util.Arrays;
import java.util.Scanner;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

/**
\* This program will read in a large file containing many stations and states,
\* and output in order the stations to their corresponding state file.
*
\* Note: This take a long time depending on processor. It also appends data to
\* the files so you must remove all the state files in the current directory
\* before running for accuracy.
*

\* @author Marcus
*
*/

public  class  ClimateCleanStates {
 public  static  void main(String\[\] args) throws  IOException {
 Scanner in = new  Scanner(System.in);
 System.out
 .println("Note: This program can take a long time depending on processor.");
 System.out
 .println("It is also not necessary to run as state files are in this directory.");
 System.out
 .println("But if you would like to see how it works, you may continue.");
 System.out.println("Please remove state files before running.");
 System.out.println("\\nIs the States directory empty?");
 String answer = in.nextLine();
 if (answer.equals("N")) {
 System.exit(0);
 in.close();
 }
 System.out.println("Would you like to run the program?");
 String answer2 = in.nextLine();
 if (answer2.equals("N")) {
 System.exit(0);
 in.close();
 }
 String\[\] statesSpaced = new  String\[51\];
 File statefile, dir, infile;
 // Create files for each states
dir = new  File("States");
 dir.mkdir();
infile = new  File("climatedata.csv");
 FileReader fr = new  FileReader(infile);
 BufferedReader br = new  BufferedReader(fr);
 String line;
 line = br.readLine();
 System.out.println();
 // Read in climatedata.csv
 final  long start = System.currentTimeMillis();
 while ((line = br.readLine()) != null) {
 // Remove instances of -9999
 if (!line.contains("-9999")) {
 String stateFileName = null;
Pattern p = Pattern.compile(".\* (\[A-Z\]\[A-Z\]) US");
 Matcher m = p.matcher(line);
 if (m.find()){
 stateFileName = m.group(1);
stateFileName = "States/" \+ stateFileName + ".csv";
statefile = new  File(stateFileName);
 FileWriter stateWriter = new  FileWriter(statefile, true);
stateWriter.write(line + "\\n");
 // Progress reporting
 //System.out.printf("Writing \[%s\] to file \[%s\]\\n", line,
 // statefile);
 stateWriter.flush();
 stateWriter.close();
 }
 }
 }
 System.out.println("Elapsed " \+ (System.currentTimeMillis() - start) + " ms");
 br.close();
 fr.close();
 in.close();
 }
}

 

【回答】

用正则表达式解析字符串很慢,而且每次只写一条记录,这也很慢,应该批量写。这里用集算器实现上面的计算过程很简单,只需几秒搞定。

代码如下:

A
1=file("data.csv").import@is()
2=A1.group(mid(~,pos(~,"US'")-2,2):state;~:data)
3=A2.run(file("d:\\temp\\"+state+".cvs").export(data))

集算器提供 JDBC 接口,可以像数据库一样使用,Java 如何调用 SPL 脚本

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值