1 问题描述
在JAVA代码中有这样一段:功能就是多个字符串拼接后,作为map的key,put到map中。
public void hashCode(List<String> values) {
long start2 = System.currentTimeMillis();
for (int i = 0; i + 1 < values.size(); i += 2) {
StringBuilder builder = new StringBuilder();
builder.append(values.get(i));
builder.append(values.get(i + 1));
}
Map<String, Object> map = new HashMap<>();
map.put(builder.toString(),new Object());
long end2 = System.currentTimeMillis();
System.out.println("string hash cost :" + (end2 - start2));
}
单个运行时,代码的性能无法体现出来,但是到了千万级的调用时,将会耗费很多时间。
在我的笔记本上运行(i7 HQ,8G内存),需要2-3s的时间跑完一千万次。从理论上来讲,耗费时间的在于字符串的拼接和hashcode的计算。为了确认问题,我们先从代码的角度找出可能出现的问题。
2 源码分析
2.1 StringBuilder构建字符串源码分析。
首先是初始化StringBuilder对象。初始化时,StringBuilder先用默认的大小(16)构建一个char数组。这里只是分配一个初始化的内存,不应该占用太多的时间。
在append的时候,如果发现申请的内存不够,将会创建一个(原大小 + append字符串长度)2大小的空间。StringBuilder会将所有的数据都拷贝到新的空间中,然后释放旧空间。
假如每次append的数据都是刚好达到当前的边界,那么空间将按照[16,172=34,35*2=70,142,…]的顺序进行扩张。每次扩张需要消耗申请空间,复制数据的时间,同时因为释放了旧空间,可能会影响gc。
public final class StringBuilder
extends AbstractStringBuilder
implements java.io.Serializable, CharSequence
{
public StringBuilder() {
super(16);
}
}
abstract class AbstractStringBuilder implements Appendable, CharSequence {
AbstractStringBuilder(int capacity) {
value = new char[capacity];
}
}
public AbstractStringBuilder append(String str) {
if (str == null)
return appendNull();
int len = str.length();
ensureCapacityInternal(count + len);
str.getChars(0, len, value, count);
count += len;
return this;
}
//Arrays
public static char[] copyOf(char[] original, int newLength) {
char[] copy = new char[newLength];
System.arraycopy(original, 0, copy, 0,
Math.min(original.length, newLength));
return copy;
}
除了内存的扩张,StringBuilder本身需要将append对象的内存拷贝到自身属性中。
public void getChars(int srcBegin, int srcEnd, char dst[], int dstBegin) {
if (srcBegin < 0) {
throw new StringIndexOutOfBoundsException(srcBegin);
}
if (srcEnd > value.length) {
throw new StringIndexOutOfBoundsException(srcEnd);
}
if (srcBegin > srcEnd) {
throw new StringIndexOutOfBoundsException(srcEnd - srcBegin);
}
System.arraycopy(value, srcBegin, dst, dstBegin, srcEnd - srcBegin);
}
从加载数据的维度来看,可能需要关注的点:1 数据长度超出申请内存,需要内存扩展;2 每次append的数据,都需要拷贝;3 返回String对象,需要再次进行内存拷贝,数据输出到String对象中。
2.2 hash
HashMap需要通过hashCode定位存储位置。如果存储位置已经有数据存在,则拉出一个list,顺次排放多个位置冲突的数据。
位置发生了冲突分为多种情况:1 hashCode相同,值不同,位置相同;2 hashCode相同,值相同,位置相同;3 hashCode不同,值不同,位置相同
对于第一,三种情况,数据会依次放在list中。对于第二种情况,则会覆盖之前的数据。
hashMap在put的时候,先行获得key的hashCode。在hashCode相等的情况下,会通过地址相等以及equals方法进行比对。
hash的比对逻辑代码:
public V put(K key, V value) {
return putVal(hash(key), key, value, false, true);
}
final V putVal(int hash, K key, V value, boolean onlyIfAbsent,
boolean evict) {
Node<K,V>[] tab; Node<K,V> p; int n, i;
if ((tab = table) == null || (n = tab.length) == 0)
n = (tab = resize()).length;
if ((p = tab[i = (n - 1) & hash]) == null)
tab[i] = newNode(hash, key, value, null);
else {
Node<K,V> e; K k;
if (p.hash == hash &&
((k = p.key) == key || (key != null && key.equals(k))))
e = p;
.
.
.
}
从上面的代码可以看出,在进行put操作时,HashMap会立即计算key的hashCode,以hashCode作为寻址的条件。如果寻址发生冲突,则hashCode作为比对是否相等的首要条件。如果hashCode相等,则需要通过地址相等或者equals方法相等,来判断是否相等。
所以总的来说,需要关注两个函数:hashCode以及equals
String的hashCode算法如下。遍历char数组的每个元素,已有数据乘以31后和新的元素相加。网上说这个算法产生冲突的概率较大,但是实际过程中不会有什么差别。
public int hashCode() {
int h = hash;
if (h == 0 && value.length > 0) {
char val[] = value;
for (int i = 0; i < value.length; i++) {
h = 31 * h + val[i];
}
hash = h;
}
return h;
}
String equals算法。遍历当前char数组和比对目标的数组,挨个char进行比较。但是没看懂的一点是:while循环采用变量n控制,但是数组元素的获取采用变量i控制。
public boolean equals(Object anObject) {
if (this == anObject) {
return true;
}
if (anObject instanceof String) {
String anotherString = (String)anObject;
int n = value.length;
if (n == anotherString.value.length) {
char v1[] = value;
char v2[] = anotherString.value;
int i = 0;
while (n-- != 0) {
if (v1[i] != v2[i])
return false;
i++;
}
return true;
}
}
return false;
}
3 hashkey实现
基于以上的分析,新作了一个对象,作为map的主键。
主要从内存拷贝的方面进行了优化,只进行一次copy。
hash算法上采用FNVHash算法,参考晚上的实现。
package org.yunzhong.test.stream;
import java.util.Arrays;
public class HashKey {
private static final int HASH_PARAM = 16777619;
private static int HASH_INIT = (int) 2166136261L;
private int hashCode;
private char[] values;
private int count;
public HashKey() {
values = new char[64];
count = 0;
}
public void append(String value) {
int minLength = 0;
if ((minLength = value.length() + count) > values.length) {
values = Arrays.copyOf(values, minLength * 2);
}
value.getChars(0, value.length(), values, count);
count += value.length();
}
public void hash1() {
for (int i = 0; i < count; ++i) {
hashCode = 31 * hashCode + values[i];
}
}
public void hash() {
hashCode = HASH_PARAM;
for (int i = 0; i < count; ++i) {
hashCode = (hashCode ^ values[i]) * HASH_PARAM;
}
hashCode += hashCode << 13;
hashCode ^= hashCode >> 7;
hashCode += hashCode << 3;
hashCode ^= hashCode >> 17;
hashCode += hashCode << 5;
}
@Override
public int hashCode() {
if(this.hashCode == 0) {
hash();
}
return hashCode;
}
public int getHashCode() {
return hashCode;
}
public void setHashCode(int hashCode) {
this.hashCode = hashCode;
}
public char[] getValues() {
return values;
}
public void setValues(char[] values) {
this.values = values;
}
public int getEnd() {
return count;
}
public void setEnd(int end) {
this.count = end;
}
@Override
public boolean equals(Object target) {
HashKey key = (HashKey) target;
int length = this.count;
if (length == key.count) {
int i = 0;
char[] v1 = this.values;
char[] v2 = key.values;
while (length-- != 0) {
if (v1[i] != v2[i]) {
return false;
}
i++;
}
return true;
}
return false;
}
@Override
public String toString() {
return String.copyValueOf(this.values, 0, count);
}
}
4 性能比对
400万数据测试。我的笔记本参数:(i7 HQ,8G内存)。
总的来说,平均时间会减少,但是终究无法达到倍数的提升。才疏学浅,只能止步于此。
StringBuilder测试用例
@Test
public void testHashPut() {
String[] characters = new String[] { "a", "b", "c", "d", "e", "f", "j", "h", "i", "j", "k", "l", "m", "n", "o",
"p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "1", "2", "3", "4", "5", "6", "7", "8", "9" };
Random random = new Random();
List<String> values = Lists.newArrayList();
for (int i = 0; i < 4000000; i++) {
StringBuilder builder = new StringBuilder();
for (int j = 0; j < 10; j++) {
int nextInt = random.nextInt(34);
builder.append(characters[nextInt]);
}
values.add(builder.toString());
}
long start = System.currentTimeMillis();
Map<String, Object> map = new HashMap<String, Object>();
for (int i = 3; i < values.size(); i++) {
StringBuilder builder = new StringBuilder();
builder.append(values.get(i - 3));
builder.append(values.get(i - 2));
builder.append(values.get(i - 1));
builder.append(values.get(i));
map.put(builder.toString(), new Object());
}
System.out.println("hash init cost:" + (System.currentTimeMillis() - start));
}
HashKey测试用例
@Test
public void testHashPutOnceCopy() {
String[] characters = new String[] { "a", "b", "c", "d", "e", "f", "j", "h", "i", "j", "k", "l", "m", "n", "o",
"p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "1", "2", "3", "4", "5", "6", "7", "8", "9" };
Random random = new Random();
List<String> values = Lists.newArrayList();
for (int i = 0; i < 4000000; i++) {
StringBuilder builder = new StringBuilder();
for (int j = 0; j < 10; j++) {
int nextInt = random.nextInt(34);
builder.append(characters[nextInt]);
}
values.add(builder.toString());
}
long start = System.currentTimeMillis();
Map<HashKey, Object> map = new HashMap<HashKey, Object>(1000000);
for (int i = 3; i < values.size(); i++) {
HashKey key = new HashKey();
key.append(values.get(i - 3));
key.append(values.get(i - 2));
key.append(values.get(i - 1));
key.append(values.get(i));
map.put(key, new Object());
}
System.out.println("once hash init cost:" + (System.currentTimeMillis() - start));
}
HashKey 2个属性
once hash init cost:7437
once hash init cost:3588
once hash init cost:3593
once hash init cost:1599
once hash init cost:4285
once hash init cost:1597
once hash init cost:1763
once hash init cost:1607
once hash init cost:1526
once hash init cost:1519
StringBuilder 2个属性
hash init cost:4588
hash init cost:2890
hash init cost:3226
hash init cost:2963
hash init cost:1743
hash init cost:1695
hash init cost:1729
hash init cost:1748
hash init cost:1641
hash init cost:1859
HashKey 4个属性
once hash init cost:7561
once hash init cost:4270
once hash init cost:3726
once hash init cost:4334
once hash init cost:4330
once hash init cost:1936
once hash init cost:1914
once hash init cost:2025
once hash init cost:1926
once hash init cost:2068
StringBuilder 4个属性
hash init cost:6841
hash init cost:3479
hash init cost:3590
hash init cost:3897
hash init cost:3676
hash init cost:4806
hash init cost:3460
hash init cost:3661
hash init cost:3512
hash init cost:3466
5 多线程
其实不想采用多线程的方式进行。多线程意味着线程间的协调,CPU资源的竞争,在系统压力大的情况下,并不能提升什么性能。
另外,初始化map只是一个很小的功能点,开启多线程有种杀鸡用牛刀的感觉。
最后,上百万的数据初始化,是很少的情况。这种情况通过1s运行,或者通过10s运行,对整体的性能来说无关紧要。
但是总的来说也是一种方案,本人也在本机进行了测试。在400万、三个字符串拼接的条件时,测试代码和数据如下:
private ExecutorService threadPool = Executors.newFixedThreadPool(8, new ThreadFactory() {
private int threadNum;
public Thread newThread(Runnable r) {
Thread th = new Thread(r);
th.setName("hashThread" + threadNum++);
return th;
}
});
@Test
public void testHashPutOnceCopyMultiTrhead() throws InterruptedException, ExecutionException {
String[] characters = new String[] { "a", "b", "c", "d", "e", "f", "j", "h", "i", "j", "k", "l", "m", "n", "o",
"p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "1", "2", "3", "4", "5", "6", "7", "8", "9" };
int batch = 100000;
Random random = new Random();
final List<String> values = Lists.newArrayList();
for (int i = 0; i < 4000000; i++) {
StringBuilder builder = new StringBuilder();
for (int j = 0; j < 10; j++) {
int nextInt = random.nextInt(34);
builder.append(characters[nextInt]);
}
values.add(builder.toString());
}
final Map<HashKey, Object> map = new ConcurrentHashMap<HashKey, Object>(1000000);
long start = System.currentTimeMillis();
List<Future<Object>> futures = Lists.newArrayList();
for (int j = 3; j < values.size(); j += batch) {
final int bottom = j;
final int top = values.size() > j + batch ? (j + batch) : values.size();
Future<Object> future = threadPool.submit(new Callable<Object>() {
public Object call() throws Exception {
for (int i = bottom; i < top; i++) {
HashKey key = new HashKey();
key.append(values.get(i - 3));
key.append(values.get(i - 2));
key.append(values.get(i - 1));
key.append(values.get(i));
map.put(key, new Object());
}
return null;
}
});
futures.add(future);
}
for (Future<Object> future : futures) {
future.get();
}
System.out.println("once hash init cost:" + (System.currentTimeMillis() - start));
}
测试数据
once hash init cost:7832
once hash init cost:3056
once hash init cost:2762
once hash init cost:3482
once hash init cost:3611
once hash init cost:3804
once hash init cost:1185
once hash init cost:1211
once hash init cost:1189
once hash init cost:1146