java编程删除相同字符,Java 8的字符串重复数据删除功能

Since String in Java (like other languages) consumes a lot of memory because each character consumes two bytes, Java 8 has introduced a new feature called String Deduplication which takes advantage of the fact that the char arrays are internal to strings and final, so the JVM can mess around with them.

I have read this example so far but since I am not a pro java coder, I am having a hard time grasping the concept.

Here is what it says,

Various strategies for String Duplication have been considered, but

the one implemented now follows the following approach: Whenever the

garbage collector visits String objects it takes note of the char

arrays. It takes their hash value and stores it alongside with a weak

reference to the array. As soon as it finds another String which has

the same hash code it compares them char by char. If they match as

well, one String will be modified and point to the char array of the

second String. The first char array then is no longer referenced

anymore and can be garbage collected.

This whole process of course brings some overhead, but is controlled

by tight limits. For example if a string is not found to have

duplicates for a while it will be no longer checked.

My First question,

There is still a lack of resources on this topic since it is recently added in Java 8 update 20, could anyone here share some practical examples on how it help in reducing the memory consumed by String in Java ?

Edit:

The above link says,

As soon as it finds another String which has the same hash code it

compares them char by char

My 2nd question,

If hash code of two String are same then the Strings are already the same, then why compare them char by char once it is found that the two String have same hash code ?

解决方案

Imagine you have a phone book, which contains people, which have a String firstName and a String lastName. And it happens that in your phone book, 100,000 people have the same firstName = "John".

Because you get the data from a database or a file those strings are not interned so your JVM memory contains the char array {'J', 'o', 'h', 'n'} 100 thousand times, one per John string. Each of these arrays takes, say, 20 bytes of memory so those 100k Johns take up 2 MB of memory.

With deduplication, the JVM will realise that "John" is duplicated many times and make all those John strings point to the same underlying char array, decreasing the memory usage from 2MB to 20 bytes.

You can find a more detailed explanation in the JEP. In particular:

Many large-scale Java applications are currently bottlenecked on memory. Measurements have shown that roughly 25% of the Java heap live data set in these types of applications is consumed by String objects. Further, roughly half of those String objects are duplicates, where duplicates means string1.equals(string2) is true. Having duplicate String objects on the heap is, essentially, just a waste of memory.

[...]

The actual expected benefit ends up at around 10% heap reduction. Note that this number is a calculated average based on a wide range of applications. The heap reduction for a specific application could vary significantly both up and down.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值