java编程删除相同字符,Java 8的字符串重复数据删除功能-CSDN博客

Since String in Java (like other languages) consumes a lot of memory because each character consumes two bytes, Java 8 has introduced a new feature called String Deduplication which takes advantage of the fact that the char arrays are internal to strings and final, so the JVM can mess around with them.

I have read this example so far but since I am not a pro java coder, I am having a hard time grasping the concept.

Here is what it says,

Various strategies for String Duplication have been considered, but

the one implemented now follows the following approach: Whenever the

garbage collector visits String objects it takes note of the char

arrays. It takes their hash value and stores it alongside with a weak

reference to the array. As soon as it finds another String which has

the same hash code it compares them char by char. If they match as

well, one String will be modified and point to the char array of the

second String. The first char array then is no longer referenced

anymore and can be garbage collected.

This whole process of course brings some overhead, but is controlled

by tight limits. For example if a string is not found to have

duplicates for a while it will be no longer checked.

My First question,

There is still a lack of resources on this topic since it is recently added in Java 8 update 20, could anyone here share some practical examples on how it help in reducing the memory consumed by String in Java ?

Edit:

The above link says,

As soon as it finds another String which has the same hash code it

compares them char by char

My 2nd question,

If hash code of two String are same then the Strings are already the same, then why compare them char by char once it is found that the two String have same hash code ?

解决方案

Imagine you have a phone book, which contains people, which have a String firstName and a String lastName. And it happens that in your phone book, 100,000 people have the same firstName = "John".

Because you get the data from a database or a file those strings are not interned so your JVM memory contains the char array {'J', 'o', 'h', 'n'} 100 thousand times, one per John string. Each of these arrays takes, say, 20 bytes of memory so those 100k Johns take up 2 MB of memory.

With deduplication, the JVM will realise that "John" is duplicated many times and make all those John strings point to the same underlying char array, decreasing the memory usage from 2MB to 20 bytes.

You can find a more detailed explanation in the JEP. In particular:

Many large-scale Java applications are currently bottlenecked on memory. Measurements have shown that roughly 25% of the Java heap live data set in these types of applications is consumed by String objects. Further, roughly half of those String objects are duplicates, where duplicates means string1.equals(string2) is true. Having duplicate String objects on the heap is, essentially, just a waste of memory.

[...]

The actual expected benefit ends up at around 10% heap reduction. Note that this number is a calculated average based on a wide range of applications. The heap reduction for a specific application could vary significantly both up and down.