JAVA中怎么解决输入超出,Java OutOfMemoryError：处理大文本文件时超出了GC开销限制 - 无法弄清楚如何提高性能...

最新推荐文章于 2024-06-16 21:11:21 发布

拉菲雪球兔

最新推荐文章于 2024-06-16 21:11:21 发布

阅读量187

点赞数

文章标签： JAVA中怎么解决输入超出

Note: I browsed all topics on this problem and I understand that it's often down to JVM settings and efficient coding but I dont know how to improve even more.

I am processing a large text file (1GB) of CAIDA network topologies, this is basically a dump of the entire Internet IPv4 topology. Each line is of format "node continent region country city latitude longitude" and I need to filter all the duplicate nodes (e.g. each node with the same lat/longitude).

I assign a unique name to all nodes with the same geo location and maintain a hashmap of each geo location->unique name already encountered. I also maintain a hashmap of each oldname->unique name because in a next step I must process another file where these old names have to be mapped to the new unique name per location.

I wrote this in Java because this is where all my other processing happens but I'm getting the "GC overhead limit exceeded" error. Below is my code which is being executed and the error log:

Scanner sc = new Scanner(new File(geo));

String line = null;

HashMap nodeGeoMapper = new HashMap(); // maps each coordinate to a unique node name

HashMap nodeMapper = new HashMap(); // maps each original node name to a filtered node name (1 name per geo coordinate)

PrintWriter output = new PrintWriter(geoFiltered);

output.println("#node.geo Name\tcontintent\tCountry\tregion\tcity\tlatitude\tlongitude");

int frenchCounter = 0;

// declare all variables used in loop to avoid creating thousands of tiny objects

String[] fields = null;

String name = null;

String continent = null;

String country = null;

String region = null;

String city = null;

double latitude = 0.0;

double longitude = 0.0;

String key = null;

boolean seenBefore = true;

String newname = null;

String nodename = null;

while (sc.hasNextLine()) {

line = sc.nextLine();

if (line.startsWith("node.geo")) {

// process a line and retrieve the fields

fields = line.split("\t"); // split all fields using the space as separator

name = fields[0];

name = name.trim().split(" ")[1]; // nodes.geo' 'N...

continent = ""; // is empty and gets skipped

country = fields[2];

region = fields[3];

city = fields[4];

latitude = Double.parseDouble(fields[5]);

longitude = Double.parseDouble(fields[6]);

// we only want one node for each coordinate pair so we map to a unique name

key = makeGeoKey(latitude, longitude);

// check if we have seen a node with these coordinates before

seenBefore = true;

if (!nodeGeoMapper.containsKey(key)) {

newname = "N"+nodeCounter;

nodeCounter++;

nodeGeoMapper.put(key, newname);

seenBefore = false;

if (country.equals("FR"))

frenchCounter++;

}

nodename = nodeGeoMapper.get(key); // retrieve the unique name assigned to these geo coordinates

nodeMapper.put(name, nodename); // keep a reference from old name to new name so we can map later

if (!seenBefore) {

// System.out.println("node.geo "+nodename+"\t"+continent+"\t"+country+"\t"+region+"\t"+city+"\t"+latitude+"\t"+longitude);

output.println("node.geo "+nodename+"\t"+continent+"\t"+country+"\t"+region+"\t"+city+"\t"+latitude+"\t"+longitude);

}

sc.close();

output.close();

nodeGeoMapper = null;

Error:

Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded

at java.util.regex.Matcher.(Unknown Source)

at java.util.regex.Matcher.toMatchResult(Unknown Source)

at java.util.Scanner.match(Unknown Source)

at java.util.Scanner.hasNextLine(Unknown Source)

at DataProcessing.filterGeoNodes(DataProcessing.java:236)

at DataProcessing.main(DataProcessing.java:114)

During execution my java process was constantly running on 80% CPU with a total of 1,000,000K (roughly) memory (laptop has 4GB total). The output file got to 59987 unique nodes so this is the amount of key values in the GeoLocation->Name hashmap. I dont know the size of the oldName->NewName hashmap but this should be less than Integer.Max_value because there are not that many lines in my textfile.

My two questions are:

how can I improve my code to use less memory or avoid having so much GC? (Edit: please keep it Java 7 compatible)

(solved) I've read threads on JVM settings like -Xmx1024m but I dont know where in the Eclipse IDE I can change these settings. Can someone please show me where I need to set these settings and which settings I may want to try?

Thank you

SOLVED: for people with a similar problem, the issue was the nodeMapper hashmap which had to store 34 million String objects which resulted in over 4GB of memory required. I was able to run my program by first disabling the GC threshold -XX:-UseGCOverheadLimit and then allocating 4GBRAM to my Java process using -Xmx4gb. It took a long time to process it but it did work, it was just slow because once Java reaches 3-4GB RAM it spends a lot of time collecting garbage rather than processing the file. A stronger system would not have had any problems. Thanks for all the help!

解决方案

Also you can try adding this option when running:

-XX:-UseGCOverheadLimit

拉菲雪球兔

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
JAVA中怎么解决输入超出,Java OutOfMemoryError：处理大文本文件时超出了GC开销限制 - 无法弄清楚如何提高性能...

Note: I browsed all topics on this problem and I understand that it's often down to JVM settings and efficient coding but I dont know how to improve even more.I am processing a large text file (1GB) o...
复制链接

扫一扫