JAVA中怎么解决输入超出,Java OutOfMemoryError:处理大文本文件时超出了GC开销限制 - 无法弄清楚如何提高性能...

Note: I browsed all topics on this problem and I understand that it's often down to JVM settings and efficient coding but I dont know how to improve even more.

I am processing a large text file (1GB) of CAIDA network topologies, this is basically a dump of the entire Internet IPv4 topology. Each line is of format "node continent region country city latitude longitude" and I need to filter all the duplicate nodes (e.g. each node with the same lat/longitude).

I assign a unique name to all nodes with the same geo location and maintain a hashmap of each geo location->unique name already encountered. I also maintain a hashmap of each oldname->unique name because in a next step I must process another file where these old names have to be mapped to the new unique name per location.

I wrote this in Java because this is where all my other processing happens but I'm getting the "GC overhead limit exceeded" error. Below is my code which is being executed and the error log:

Scanner sc = new Scanner(new File(geo));

String line = null;

HashMap nodeGeoMapper = new HashMap(); // maps each coordinate to a unique node name

HashMap nodeMapper = new HashMap(); // maps each original node name to a filtered node name (1 name per geo coordinate)

PrintWriter output = new PrintWriter(geoFiltered);

output.println("#node.geo Name\tcontintent\tCountry\tregion\tcity\tlatitude\tlongitude");

int frenchCounter = 0;

// declare all variables used in loop to avoid creating thousands of tiny objects

String[] fields = null;

String name = null;

String continent = null;

String country = null;

String region = null;

String city = null;

double latitude = 0.0;

double longitude = 0.0;

String key = null;

boolean seenBefore = true;

String newname = null;

String nodename = null;

while (sc.hasNextLine()) {

line = sc.nextLine();

if (line.startsWith("node.geo")) {

// process a line and retrieve the fields

fields = line.split("\t"); // split all fields using the space as separator

name = fields[0];

name = name.trim().split(" ")[1]; // nodes.geo' 'N...

continent = ""; // is empty and gets skipped

country = fields[2];

region = fields[3];

city = fields[4];

latitude = Double.parseDouble(fields[5]);

longitude = Double.parseDouble(fields[6]);

// we only want one node for each coordinate pair so we map to a unique name

key = makeGeoKey(latitude, longitude);

// check if we have seen a node with these coordinates before

seenBefore = true;

if (!nodeGeoMapper.containsKey(key)) {

newname = "N"+nodeCounter;

nodeCounter++;

nodeGeoMapper.put(key, newname);

seenBefore = false;

if (country.equals("FR"))

frenchCounter++;

}

nodename = nodeGeoMapper.get(key); // retrieve the unique name assigned to these geo coordinates

nodeMapper.put(name, nodename); // keep a reference from old name to new name so we can map later

if (!seenBefore) {

// System.out.println("node.geo "+nodename+"\t"+continent+"\t"+country+"\t"+region+"\t"+city+"\t"+latitude+"\t"+longitude);

output.println("node.geo "+nodename+"\t"+continent+"\t"+country+"\t"+region+"\t"+city+"\t"+latitude+"\t"+longitude);

}

}

}

sc.close();

output.close();

nodeGeoMapper = null;

Error:

Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded

at java.util.regex.Matcher.(Unknown Source)

at java.util.regex.Matcher.toMatchResult(Unknown Source)

at java.util.Scanner.match(Unknown Source)

at java.util.Scanner.hasNextLine(Unknown Source)

at DataProcessing.filterGeoNodes(DataProcessing.java:236)

at DataProcessing.main(DataProcessing.java:114)

During execution my java process was constantly running on 80% CPU with a total of 1,000,000K (roughly) memory (laptop has 4GB total). The output file got to 59987 unique nodes so this is the amount of key values in the GeoLocation->Name hashmap. I dont know the size of the oldName->NewName hashmap but this should be less than Integer.Max_value because there are not that many lines in my textfile.

My two questions are:

how can I improve my code to use less memory or avoid having so much GC? (Edit: please keep it Java 7 compatible)

(solved) I've read threads on JVM settings like -Xmx1024m but I dont know where in the Eclipse IDE I can change these settings. Can someone please show me where I need to set these settings and which settings I may want to try?

Thank you

SOLVED: for people with a similar problem, the issue was the nodeMapper hashmap which had to store 34 million String objects which resulted in over 4GB of memory required. I was able to run my program by first disabling the GC threshold -XX:-UseGCOverheadLimit and then allocating 4GBRAM to my Java process using -Xmx4gb. It took a long time to process it but it did work, it was just slow because once Java reaches 3-4GB RAM it spends a lot of time collecting garbage rather than processing the file. A stronger system would not have had any problems. Thanks for all the help!

解决方案

Also you can try adding this option when running:

-XX:-UseGCOverheadLimit

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值