Java 字符串使用之性能优化实践

最新推荐文章于 2024-07-10 21:37:49 发布

weixin_34349320

最新推荐文章于 2024-07-10 21:37:49 发布

阅读量190

点赞数

文章标签： java python 测试

原文链接：https://my.oschina.net/u/1469495/blog/3058649

版权

2019独角兽企业重金招聘Python工程师标准>>>

在编写JAVA程序时，不需要像C一样去手动申请内存和释放内存，完全交给JVM来管理，提升了开发效率，但是如果编写代码不注意一些细节，那就会造成内存空间的浪费和代码性能低下等问题。接下来以字符串使用为例，因为字符串是使用最多的数据类型，再者Java中的字符串是不可变类型：

public final class String
    implements java.io.Serializable, Comparable<String>, CharSequence {
    /** The value is used for character storage. */
    private final char value[];
    ... ...
}

这种不可变类型的好处就是在多线程环境中，具有天生的线程安全特性。但也带了一些问题，比如对字符串进行拼接、截取等操作时，因不能共享char数组，会产生更多冗余的字符串实例，而实例越多对占用的内存也会越多，同时也会增重JVM垃圾回收的负担。接下来使用Benchmark工具测试字符串各种操作的性能比较。

一. 字符串的拼接

测试代码：

@BenchmarkMode(Mode.Throughput)
@Warmup(iterations = 3)
@Measurement(iterations = 10, time = 5, timeUnit = TimeUnit.SECONDS)
@Threads(8)
@Fork(2)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
public class StringBuilderBenchmark {
	
    @Benchmark
    public void testStringAdd() {
        String a = "";
        for (int i = 0; i < 10; i++) {
            a += i;
        }
        print(a);
    }
	
    @Benchmark
    public void testStringBuilderAdd() {
        StringBuilder sb = new StringBuilder();
        for (int i = 0; i < 10; i++) {
            sb.append(i);
        }
        print(sb.toString());
    }
	
    private void print(String a) {
    }
	
    public static void main(String[] args) throws RunnerException {
        Options options = new OptionsBuilder()
                .include(StringBuilderBenchmark.class.getSimpleName())
                .output("./StringBuilderBenchmark.log")
                .build();
        new Runner(options).run();
    }
}

测试结果：

Benchmark                                     Mode  Cnt      Score      Error   Units
StringBuilderBenchmark.testStringAdd         thrpt   20  22163.429 ±  537.729  ops/ms
StringBuilderBenchmark.testStringBuilderAdd  thrpt   20  43400.877 ± 2447.492  ops/ms

从上面的测试结果来看，使用StringBuilder性能的确要比直接使用字符串拼接要好。

二. 分割字符串

测试代码：

@BenchmarkMode(Mode.Throughput)
@Warmup(iterations = 3)
@Measurement(iterations = 10, time = 5, timeUnit = TimeUnit.SECONDS)
@Threads(8)
@Fork(2)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@State(Scope.Benchmark)
public class StringSplitBenchmark {
	
    private static final String regex = "\\.";
	
    private static final char CHAR = '.';
    
    private static final Pattern pattern = Pattern.compile(regex);
	
    private String[] strings;
	
    @Setup
    public void prepare() {
        strings = new String[20];
        for(int i=0;i<strings.length;i++) {
            strings[i] = System.currentTimeMillis() + ".aaa.bbb.ccc.ddd" + Math.random();
        }
    }
	
    @Benchmark
    public void testStringSplit() {
        for(int i=0;i<strings.length;i++) {
            strings[i].split(regex);
        }
    }
	
    @Benchmark
    public void testPatternSplit() {
        for(int i=0;i<strings.length;i++) {
            pattern.split(strings[i]);
        }
    }
	
    @Benchmark
    public void testCharSplit() {
        for(int i=0;i<strings.length;i++) {
            split(strings[i], CHAR, 6);
        }
	
    }
	
    public static List<String> split(final String str, final char separatorChar, int expectParts) {
        if (null == str) {
            return null;
        }
        final int len = str.length();
        if (len == 0) {
            return Collections.emptyList();
        }
        final List<String> list = new ArrayList<String>(expectParts);
        int i = 0;
        int start = 0;
        boolean match = false;
        while (i < len) {
            if (str.charAt(i) == separatorChar) {
                if (match) {
                    list.add(str.substring(start, i));
                    match = false;
                }
                start = ++i;
                continue;
            }
            match = true;
            i++;
        }
        if (match) {
            list.add(str.substring(start, i));
        }
        return list;
    }
	
    public static void main(String[] args) throws RunnerException {
        Options options = new OptionsBuilder()
                .include(StringSplitBenchmark.class.getSimpleName())
                .output("./StringSplitBenchmark.log")
                .build();
        new Runner(options).run();
    }
}

测试结果：

Benchmark                               Mode  Cnt    Score     Error   Units
StringSplitBenchmark.testCharSplit     thrpt   20  872.048 ±  63.872  ops/ms
StringSplitBenchmark.testPatternSplit  thrpt   20  534.371 ±  28.275  ops/ms
StringSplitBenchmark.testStringSplit   thrpt   20  814.661 ± 115.653  ops/ms

从测试结果来看testCharSplit 和 testStringSplit 性能差不多，与我们的预期不一样。我们都知道String.split方法需要传入一个正则表达式，而在使用正则表达式时，通过使用编译后的正则表达式性能会更高些，而这里却不是。那行我还是要看看String.split中的实现探个究竟：

    public String[] split(String regex) {
        return split(regex, 0);
    }
    public String[] split(String regex, int limit) {
        /* fastpath if the regex is a
         (1)one-char String and this character is not one of the
            RegEx's meta characters ".$|()[{^?*+\\", or
         (2)two-char String and the first char is the backslash and
            the second is not the ascii digit or ascii letter.
         */
        char ch = 0;
        if ((
           (regex.value.length == 1 && ".$|()[{^?*+\\".indexOf(ch = regex.charAt(0)) == -1) ||
           (regex.length() == 2 && regex.charAt(0) == '\\' &&
              (((ch = regex.charAt(1))-'0')|('9'-ch)) < 0 &&
              ((ch-'a')|('z'-ch)) < 0 &&
              ((ch-'A')|('Z'-ch)) < 0)) &&
            (ch < Character.MIN_HIGH_SURROGATE ||
             ch > Character.MAX_LOW_SURROGATE))
        {
            int off = 0;
            int next = 0;
            boolean limited = limit > 0;
            ArrayList<String> list = new ArrayList<>();
            while ((next = indexOf(ch, off)) != -1) {
                if (!limited || list.size() < limit - 1) {
                    list.add(substring(off, next));
                    off = next + 1;
                } else {    // last one
                    //assert (list.size() == limit - 1);
                    list.add(substring(off, value.length));
                    off = value.length;
                    break;
                }
            }
            // If no match was found, return this
            if (off == 0)
                return new String[]{this};

            // Add remaining segment
            if (!limited || list.size() < limit)
                list.add(substring(off, value.length));

            // Construct result
            int resultSize = list.size();
            if (limit == 0) {
                while (resultSize > 0 && list.get(resultSize - 1).length() == 0) {
                    resultSize--;
                }
            }
            String[] result = new String[resultSize];
            return list.subList(0, resultSize).toArray(result);
        }
        return Pattern.compile(regex).split(this, limit);
    }

原来String.split方法已经做了优化了，并不是我们想像的所有情况下都使用正则表达式来切割字符串。这也说明了为什么testCharSplit 与 testStringSplit 性能差不多的原因了。

三. 字符串替换

测试代码：

@BenchmarkMode(Mode.Throughput)
@Warmup(iterations = 3)
@Measurement(iterations = 10, time = 5, timeUnit = TimeUnit.SECONDS)
@Threads(8)
@Fork(2)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@State(Scope.Benchmark)
public class StringReplaceAllBenchmark {
	
    private static final String EMPTY = "";
	
    private static final String regex = "\\.";
	
    private static final String CHAR = ".";
    private static final Pattern pattern = Pattern.compile(regex);
	
    private String[] strings;
	
    @Setup
    public void prepare() {
        strings = new String[20];
        for (int i = 0; i < strings.length; i++) {
            strings[i] = System.currentTimeMillis() + ".aaa.bbb.ccc.ddd." + Math.random();
        }
    }
	
    @Benchmark
    public void testStringReplaceAll() {
        for (int i = 0; i < strings.length; i++) {
            strings[i].replaceAll(regex, EMPTY);
        }
    }
	
    @Benchmark
    public void testPatternReplaceAll() {
        for (int i = 0; i < strings.length; i++) {
            pattern.matcher(strings[i]).replaceAll(EMPTY);
        }
    }
	
    @Benchmark
    public void testCustomReplaceAll() {
        for (int i = 0; i < strings.length; i++) {
            replaceAll(strings[i], CHAR, EMPTY);
        }
	
    }
	
	
    public static String replaceAll(final String str, final String remove, final String replacement) {
        if (null == str) {
            return null;
        }
        final int len = str.length();
        if (len == 0) {
            return str;
        }
        final StringBuilder res = new StringBuilder(len);
        int offset = 0;
        int index;
        while (true) {
            index = str.indexOf(remove, offset);
            if (index == -1) {
                break;
            }
            res.append(str, offset, index);
            if(null != replacement && replacement.length() >0) {
                res.append(replacement);
            }
            offset = index + remove.length();
        }
        if(offset < len) {
            res.append(str, offset, len);
        }
        return res.toString();
    }
	
    public static void main(String[] args) throws RunnerException {
        String str = System.currentTimeMillis() + ".aaa.bbb.ccc.ddd." + Math.random();
        String str1 = str.replaceAll(regex, EMPTY);
        String str2 = pattern.matcher(str).replaceAll(EMPTY);
        String str3 = replaceAll(str, CHAR, EMPTY);
	
        System.out.println(str1);
        System.out.println(str2);
        System.out.println(str3);
        Options options = new OptionsBuilder()
                .include(StringReplaceAllBenchmark.class.getSimpleName())
                .output("./StringReplaceAllBenchmark.log")
                .build();
        new Runner(options).run();
    }
}

测试结果：

Benchmark                                         Mode  Cnt     Score    Error   Units
StringReplaceAllBenchmark.testCustomReplaceAll   thrpt   20  1167.891 ± 39.699  ops/ms
StringReplaceAllBenchmark.testPatternReplaceAll  thrpt   20   438.079 ±  1.859  ops/ms
StringReplaceAllBenchmark.testStringReplaceAll   thrpt   20   353.060 ± 11.177  ops/ms

testPatternReplaceAll 和 testStringReplaceAll 都是使用正则表达式来替换，所以性能其差不多。正则表达式在处理一些复杂的情况时非常方便好用，但是从性能角度来说，能不用的情况就尽量不用。

四. 以脱敏工具类为例，进行优化实践

下面的代码是未优化前的情况：

public class DesensitizeUtils {
	
    /**
     * 根据value长度取值(切分)
     * @param value
     * @return
     */
    public static String desensitizeByLengthOld(String value) {
        if (value.length() == 2) {
            value = value.substring(0, 1) + "*";
        } else if (value.length() == 3) {
            value = value.substring(0, 1) + "*" + value.substring(value.length() - 1);
        } else if (value.length() > 3 && value.length() <= 5) {
            value = value.substring(0, 1) + "**" + value.substring(value.length() - 2);
        } else if (value.length() > 5 && value.length() <= 7) {
            value = value.substring(0, 2) + "***" + value.substring(value.length() - 2);
        } else if (value.length() > 7) {
         	  String str = "";
            for(int i=0; i<value.length()-6; i++) {
              str += "*";
            }
            value = value.substring(0, 3) + str + value.substring(value.length() - 3);
        }
        return value;
    }
	
	
    /**
     * 中文名称脱敏策略：
     * 0. 少于等于1个字 直接返回
     * 1. 两个字 隐藏姓
     * 2. 三个及其以上 只保留第一个和最后一个 其他用星号代替
     * @param fullName
     * @return
     */
    public static String desensitizeChineseNameOld(final String fullName) {
        if (StringUtils.isBlank(fullName)) {
            return "";
        }
        if (fullName.length() <= 1) {
            return fullName;
        } else if (fullName.length() == 2) {
            final String name = StringUtils.right(fullName, 1);
            return StringUtils.leftPad(name, StringUtils.length(fullName), "*");
        } else {
            return StringUtils.left(fullName, 1).concat(StringUtils.removeStart(StringUtils.leftPad(StringUtils.right(fullName, 1), StringUtils.length(fullName), "*"), "*"));
        }
    }
	
}

接下来对上面代码进行优化

1. 尽量使用常量，但也要简少常量的数量

1). 如上述代码中使用“”，“”，“”的地方，使用一个'*'char常量代替。

public class DesensitizeUtils {
	private static final char DESENSITIZE_CODE = '*';
}

2). 再例如38行代码的 return “”；使用用 return StringUtils.EMPTY; 用StringUtils的类常量。

if (StringUtils.isBlank(fullName)) {
   return StringUtils.EMPTY;
}

使用常量后可以避免高并发情况下频繁实例化字符串，提高程序的整体性能。

2. 使用局部变量，来减少函数调用

把获取长度提出，避免重复获取

if (value.length() == 2) { 
	
} else if (value.length() == 3) {
  
} else if (value.length() > 3 && value.length() <= 5) {
   
} else if (value.length() > 5 && value.length() <= 7) {
   
} else if (value.length() > 7) {
   
}

优化后：

int length = value.length(); 
if (length == 2) {
           
} else if (length == 3) {
   
} else if (length > 3 && length <= 5) {
   
} else if (length > 5 && length <= 7) {
    
} else if (length > 7) { 
  
}

优化后代码更加简洁，如果value.length() 方法是个非常耗时的操作，那么势必造成重复调用，耗时乘倍增加。

3. 高度重视第三方类库

为了复用，节约成本，我们或多或少会使用别人写提供的类库，但是在使用之前也要对其原理要有一定的了解，并结合自己的实际情况来选择合理的方案，以避免踩坑。

1). 字符串截取方法substring

使用字符串的substring方法非常方便截取字串，但是由于字符串是不可变类型，所以它每次返回一个新的字符串，在下面的代码中，就会产生多个字符串实例：

value = value.substring(0, 2) + "***" + value.substring(length - 2);

使用StringBuilder的 append(CharSequence s, int start, int end) 方法来优化：

public AbstractStringBuilder append(CharSequence s, int start, int end) {
    if (s == null)
        s = "null";
    if ((start < 0) || (start > end) || (end > s.length()))
        throw new IndexOutOfBoundsException(
            "start " + start + ", end " + end + ", s.length() "
            + s.length());
    int len = end - start;
    ensureCapacityInternal(count + len);
    for (int i = start, j = count; i < end; i++, j++)
        value[j] = s.charAt(i);
    count += len;
    return this;
}

这个方法通过for循环来复制字符串，还不是最好的方案，如果JDK能进一步优化会更好一些，优化方法如下：

public AbstractStringBuilder append(String str, int start, int end) {
   if (s == null)
    	s = "null";
    if ((start < 0) || (start > end) || (end > s.length()))
        throw new IndexOutOfBoundsException(
            "start " + start + ", end " + end + ", s.length() "
            + s.length());
    int len = end - start;
    ensureCapacityInternal(count + len);
    str.getChars(start, end, value, count); // 这句代替上面的for 循环
    count += len;
    return this;
}

优化后：

StringBuilder str = new StringBuilder(length);
str.append(value, 0, 2).append(DESENSITIZE_CODE).append(DESENSITIZE_CODE).append(DESENSITIZE_CODE).append(value, length - 2, length);

2). 还有上述代码中用到的leftPad方法，里面用到了递归调用，而且也会使用字符串substring和concat会产生多余的实例，这种是不推荐使用的：

public static String leftPad(final String str, final int size, String padStr) {
        if (str == null) {
            return null;
        }
        if (isEmpty(padStr)) {
            padStr = SPACE;
        }
        final int padLen = padStr.length();
        final int strLen = str.length();
        final int pads = size - strLen;
        if (pads <= 0) {
            return str; // returns original String when possible
        }
        if (padLen == 1 && pads <= PAD_LIMIT) {
            return leftPad(str, size, padStr.charAt(0));
        }

        if (pads == padLen) {
            return padStr.concat(str);
        } else if (pads < padLen) {
            return padStr.substring(0, pads).concat(str);
        } else {
            final char[] padding = new char[pads];
            final char[] padChars = padStr.toCharArray();
            for (int i = 0; i < pads; i++) {
                padding[i] = padChars[i % padLen];
            }
            return new String(padding).concat(str);
        }
    }

4. StringBuilder的使用

1). 通过上面测试，尽量使用StringBuilder代替使用“+”拼接字符串，这里就不再赘述

2). 尽量为StringBuilder 设置容量

在可预知字符串长度的情况下，尽量给StringBuilder设置容量大小，如果字符串长度比默认容量小的话，可以减少内存分配，如果字符串长度比默认容量大的话可以减少StringBuilder 内部char数组扩容带性能损耗。

3). StringBuilder的append方法很多，最好能深入了解各个方法的用途，比如上面提到的使用public AbstractStringBuilder append(String str, int start, int end) 代替substring方法。

5. 优化后的代码如下：


public class DesensitizeUtils {
	
        private static final char DESENSITIZE_CODE = '*';

    /**
     * 根据value长度取值(切分)
     *
     * @param value
     * @return 返回值长度等于入参长度
     */
    public static String desensitizeByLength(String value) {
        if (StringUtils.isBlank(value)) {
            return StringUtils.EMPTY;
        }
        int length = value.length();
        if (length == 1) {
            return value;
        }
        StringBuilder str = new StringBuilder(length);
        switch (length) {
            case 2:
                str.append(value, 0, 1).append(DESENSITIZE_CODE);
                break;
            case 3:
                str.append(value, 0, 1).append(DESENSITIZE_CODE).append(value, length - 1, length);
                break;
            case 4:
            case 5:
                str.append(value, 0, 1).append(DESENSITIZE_CODE).append(DESENSITIZE_CODE).append(value, length - 2, length);
                break;
            case 6:
            case 7:
                str.append(value, 0, 2).append(DESENSITIZE_CODE).append(DESENSITIZE_CODE).append(DESENSITIZE_CODE).append(value, length - 2, length);
                break;
            default:
                str.append(value, 0, 3);
                for (int i = 0; i < length - 6; i++) {
                    str.append(DESENSITIZE_CODE);
                }
                str.append(value, length - 3, length);
                break;
        }
        return str.toString();
    }


    /**
     * 中文名称脱敏策略：
     * 0. 少于等于1个字 直接返回
     * 1. 两个字 隐藏姓
     * 2. 三个及其以上 只保留第一个和最后一个 其他用星号代替
     *
     * @param fullName
     * @return
     */
    public static String desensitizeChineseName(final String fullName) {
        if (StringUtils.isBlank(fullName)) {
            return StringUtils.EMPTY;
        }
        int length = fullName.length();
        switch (length) {
            case 1:
                return fullName;
            case 2:
                StringBuilder str = new StringBuilder(2);
                return str.append(DESENSITIZE_CODE).append(fullName, length - 1, length).toString();
            default:
                str = new StringBuilder(length);
                str.append(fullName, 0, 1);
                for (int i = 0; i < length - 2; i++) {
                    str.append(DESENSITIZE_CODE);
                }
                str.append(fullName, length - 1, length);
                return str.toString();
        }
    }
}

6. 性能对比：

测试代码：

private static final String testString = "akkadmmajkkakkajjk";
    @Benchmark
    public void testDesensitizeByLengthOld() {
        desensitizeByLengthOld(testString);
    }

    @Benchmark
    public void testDesensitizeChineseNameOld() {
        desensitizeChineseNameOld(testString);
    }

    @Benchmark
    public void testDesensitizeByLength() {
        desensitizeByLength(testString);
    }

    @Benchmark
    public void testDesensitizeChineseName() {
        desensitizeChineseName(testString);
    }


    public static void main(String[] args) throws RunnerException {
        Options options = new OptionsBuilder()
                .include(DesensitizeUtilsBenchmark.class.getSimpleName())
                .output("./DesensitizeUtilsBenchmark.log")
                .build();
        new Runner(options).run();
    }

测试结果：

Benchmark                                                 Mode  Cnt       Score      Error   Units
DesensitizeUtilsBenchmark.testDesensitizeByLength        thrpt   20   61460.601 ± 7262.830  ops/ms
DesensitizeUtilsBenchmark.testDesensitizeByLengthOld     thrpt   20   11700.417 ± 1402.169  ops/ms
DesensitizeUtilsBenchmark.testDesensitizeChineseName     thrpt   20  117560.449 ±  731.851  ops/ms
DesensitizeUtilsBenchmark.testDesensitizeChineseNameOld  thrpt   20   39682.513 ±  463.306  ops/ms

上面的测试用例比较少，不能覆盖所有情况，而且现有Benchmark工具不能看出代码优化前后对GC的影响，这里只是提供一些思路以供参考。

转载于:https://my.oschina.net/u/1469495/blog/3058649