JDK1.8源码笔记(2) String

最新推荐文章于 2022-03-04 22:39:19 发布

大吉大利，今晚AC

最新推荐文章于 2022-03-04 22:39:19 发布

阅读量338

点赞数 1

本文链接：https://blog.csdn.net/lalala_HFUT/article/details/98952048

版权

String

前言

* Strings are constant; their values cannot be changed after they
* are created. String buffers support mutable strings.
* Because String objects are immutable they can be shared.
String类最重要的特性，immutable。

* The class {@code String} includes methods for examining
* individual characters of the sequence, for comparing strings, for
* searching strings, for extracting substrings, and for creating a
* copy of a string with all characters translated to uppercase or to
* lowercase.
String类提供的规则。

* The Java language provides special support for the string
* concatenation operator ( + ), and for conversion of
* other objects to strings. String concatenation is implemented
* through the {@code StringBuilder}(or {@code StringBuffer})
* class and its {@code append} method.
* String conversions are implemented through the method
* {@code toString}, defined by {@code Object} and
* inherited by all classes in Java.
Java语言对使用+实现字符串拼接有特殊的支持。本质上是将+法转变为StringBuilder/StringBuffer的append方法，得到的结果再通过toString方法得到最终的结果的。

* Unless otherwise noted, passing a <tt>null</tt> argument to a constructor
* or method in this class will cause a {@link NullPointerException} to be
* thrown.
除非特殊说明，对于String中的方法，如果参数为null，则会抛出NullPointerException异常。

* A {@code String} represents a string in the UTF-16 format
* in which supplementary characters are represented by surrogate
* pairs (see the section <a href="Character.html#unicode">Unicode
* Character Representations</a> in the {@code Character} class for
* more information).
* Index values refer to {@code char} code units, so a supplementary
* character uses two positions in a {@code String}.
* The {@code String} class provides methods for dealing with
* Unicode code points (i.e., characters), in addition to those for
* dealing with Unicode code units (i.e., {@code char} values).
应该是关于字符集的描述，没太看懂，但是好像一个supplementary character会占据两个位置。

继承类&实现接口

实现了java.io.Serializable Comparable<String> CharSequence接口

java.io.Serializable具体请见其他文章

Comparable<T>具体请见其他文章

CharSequence中定义了一些String常用的方法，例如
int length();
char charAt(int index);
会抛出IndexOutOfBoundsException异常
CharSequence subSequence(int start, int end);
会抛出IndexOutOfBoundsException异常，start>end也是。
public String toString();

构造方法

public String() {
this.value = "".value;
}
可以看到无参的构造方法是默认把字符串初始化为空串了。传的是引用，可以说是浅拷贝。

public String(String original) {
this.value = original.value;
this.hash = original.hash;
}
这是唯一给hash赋值的构造函数。

注意辨析以下概念，
String s1 = "你好";
String s2 = new String(s1);
String s3 = s1;
System.out.println(s2 == s1);
System.out.println(s3 == s1);
结果是false和true。
但是，他们三个字符串的value，都是同一个内存空间。

s3引用了s1的地址，所以说他们二者value相同没什么说的。
但s2虽然开了一个新的地址空间，但这个地址空间也不过是新开了一个堆内存用来记录value的地址而已，虽然对象s2和对象s1地址不同，但本质上引用得还是同一个字符串。

public String(char value[]) {
this.value = Arrays.copyOf(value, value.length);
}
当然也有别的方式初始化String，请注意，为了保持String的不可变性，专门把数组复制之后传递的引用。否则的话将会通过修改value破坏String的不可变性。

所以我猜想也许用上面这种方式是可以打破常量池的吧。

public String(char value[], int offset, int count) {
......
}
功能类似，具体实现省略

public String(StringBuffer buffer) {
synchronized(buffer) {
this.value = Arrays.copyOf(buffer.getValue(), buffer.length());
}
}

public String(StringBuilder builder) {
this.value = Arrays.copyOf(builder.getValue(), builder.length());
}
StringBuffer是需要进行同步的，避免脏读的发生。

静态变量&静态方法

private static final long serialVersionUID = -6849794470754667710L;
不知为何用，注释说是为了interoperability。

注意到String是实现了Serializable接口的。If the receiver has loaded a class for the object that has a different serialVersionUID than that of the corresponding sender's class, then deserialization will result in an InvalidClassException.
如果说receiver加载过和这个serialVersionUID不一致的同名类，就会发生报错。
https://stackoverflow.com/questions/285793/what-is-a-serialversionuid-and-why-should-i-use-it

private static void checkBounds(byte[] bytes, int offset, int length) {
if (length < 0)
throw new StringIndexOutOfBoundsException(length);
if (offset < 0)
throw new StringIndexOutOfBoundsException(offset);
if (offset > bytes.length - length)
throw new StringIndexOutOfBoundsException(offset + length);
}
private的静态方法，用于判断检查字节数组是否发生越界。

另外就是还声明了一个静态内部类，用于对两个字符串忽略大小进行比较，这个内部类实现了Comparator接口的compare方法，可能也是因为本来不忽略大小写的比较方法已经占用Compareable接口，所以才使用这种方式实现。
还有就是可以看到内部的判断真的很有意思，先都转换为大写，又都转换为小写，似乎有点多余，不过可能考虑到多种字符吧。
public static final Comparator<String> CASE_INSENSITIVE_ORDER
= new CaseInsensitiveComparator();
private static class CaseInsensitiveComparator
implements Comparator<String>, java.io.Serializable {
// use serialVersionUID from JDK 1.2.2 for interoperability
private static final long serialVersionUID = 8575799808933029326L;

public int compare(String s1, String s2) {
int n1 = s1.length();
int n2 = s2.length();
int min = Math.min(n1, n2);
for (int i = 0; i < min; i++) {
char c1 = s1.charAt(i);
char c2 = s2.charAt(i);
if (c1 != c2) {
c1 = Character.toUpperCase(c1);
c2 = Character.toUpperCase(c2);
if (c1 != c2) {
c1 = Character.toLowerCase(c1);
c2 = Character.toLowerCase(c2);
if (c1 != c2) {
// No overflow because of numeric promotion
return c1 - c2;
}
}
}
}
return n1 - n2;
}

/** Replaces the de-serialized object. */
private Object readResolve() { return CASE_INSENSITIVE_ORDER; }
}

成员变量&成员方法

private final char value[];
The value is used for character storage.
String上修饰的final并不真正能使String不可变，value上修饰的final才起到真正的作用。
可以看到本质上String是char数组的包装。这个数组是final类型，这里没有赋值的话，必须在构造函数中进行声明。

但是要注意，被final修饰的数组的真实值并非不能改变，不能改变的只是value所指向的内存空间而已。但因为value是private修饰的，只能在String内部被访问，只要保证String内部不对value进行修改，就能保证不变性了。

private int hash; // Default to 0
Cache the hash code for the string
缓存string的hashcode

public int length() {
return value.length;
}
重写的CharSequence中的方法，果然是value的一层薄封装。

public char charAt(int index) {
if ((index < 0) || (index >= value.length)) {
throw new StringIndexOutOfBoundsException(index);
}
return value[index];
}
重写的CharSequence中的方法，判断了数组是否越界，有一点代理模式的感觉。

public boolean equals(Object anObject) {
if (this == anObject) {
return true;
}
if (anObject instanceof String) {
String anotherString = (String)anObject;
int n = value.length;
if (n == anotherString.value.length) {
char v1[] = value;
char v2[] = anotherString.value;
int i = 0;
while (n-- != 0) {
if (v1[i] != v2[i])
return false;
i++;
}
return true;
}
}
return false;
}
三点判断，1 hashcode相同 2 类型是String或其子类(可惜final不存在子类)
3 值完全相同(不是没有比较，是一个一个比较了的，常量池是可以打破的)

public int compareTo(String anotherString) {
int len1 = value.length;
int len2 = anotherString.value.length;
int lim = Math.min(len1, len2);
char v1[] = value;
char v2[] = anotherString.value;

int k = 0;
while (k < lim) {
char c1 = v1[k];
char c2 = v2[k];
if (c1 != c2) {
return c1 - c2;
}
k++;
}
return len1 - len2;
}
和equals的区别在于equals只能判断相等，而comapreTo可以判断的情况更广。
函数内是没有判断null操作的，如果传入null会直接抛出NullPointerException

public int hashCode() {
int h = hash;
if (h == 0 && value.length > 0) {
char val[] = value;

for (int i = 0; i < value.length; i++) {
h = 31 * h + val[i];
}
hash = h;
}
return h;
}
* Returns a hash code for this string. The hash code for a
* {@code String} object is computed as
* <blockquote><pre>
* s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
* </pre></blockquote>
* using {@code int} arithmetic, where {@code s[i]} is the
* ith character of the string, {@code n} is the length of
* the string, and {@code ^} indicates exponentiation.
* (The hash value of the empty string is zero.)
由此我们可以看出String的hashCode计算规则和Object中定义的略有不同，首先Object中的hashCode方法是native修饰的，意思是对象所指的内存空间。而String的hashCode是依据字符串的内容计算得到的，是一个虚拟的hashCode，也就意味着两个数值相同的字符串的hashCode是完全相同的。
我们可以看到计算规则，^是幂的意思，而非异或。如果字符串为空，则hashcode值为0，如果计算过一次的话，会使用hash成员变量进行保存。

还有就是使用int类型难道不会因为字符串过长发生溢出么？
很遗憾，就是会发生溢出，那还怎么保证hashCode唯一。
但是你怎么可能同时用得了那么多种字符串，发生冲突的风险存在，但是并不大。

String substring(int beginIndex) {
if (beginIndex < 0) {
throw new StringIndexOutOfBoundsException(beginIndex);
}
int subLen = value.length - beginIndex;
if (subLen < 0) {
throw new StringIndexOutOfBoundsException(subLen);
}
return (beginIndex == 0) ? this : new String(value, beginIndex, subLen);
}

public String substring(int beginIndex, int endIndex) {
if (beginIndex < 0) {
throw new StringIndexOutOfBoundsException(beginIndex);
}
if (endIndex > value.length) {
throw new StringIndexOutOfBoundsException(endIndex);
}
int subLen = endIndex - beginIndex;
if (subLen < 0) {
throw new StringIndexOutOfBoundsException(subLen);
}
return ((beginIndex == 0) && (endIndex == value.length)) ? this
: new String(value, beginIndex, subLen);
}

这两个截取字符串的函数，使用得是很频繁的，可以看到返回的是一个new的新串。

public String concat(String str) {
int otherLen = str.length();
if (otherLen == 0) {
return this;
}
int len = value.length;
char buf[] = Arrays.copyOf(value, len + otherLen);
str.getChars(buf, len);
return new String(buf, true);
}
concat这个函数名不难联想到linux中的cat命令。
这个函数的功能就是拼接字符串，同样也是开了一个新的空间，把本来的字符串和新增加的字符串拷贝过去。
我其实很好奇，char数组和内存的对应关系了。

另外我们前面提到过，编译器是把字符串的+法转换为了Buffer/Builder的append方法，而不是这个concat方法，不要搞混了。

public String replace(char oldChar, char newChar) {
if (oldChar != newChar) {
int len = value.length;
int i = -1;
char[] val = value; /* avoid getfield opcode */

while (++i < len) {
if (val[i] == oldChar) {
break;
}
}
if (i < len) {
char buf[] = new char[len];
for (int j = 0; j < i; j++) {
buf[j] = val[j];
}
while (i < len) {
char c = val[i];
buf[i] = (c == oldChar) ? newChar : c;
i++;
}
return new String(buf, true);
}
}
return this;
}
其实完全是有机会可以在value本身上进行修改的，但是为什么不这样做，而偏偏要复制一个新的数组，就是为了避免出现还有别的字符串引用这个内存空间，你修改这个内存，导致其他字符串的值也发生改变。

char[] val = value; /* avoid getfield opcode */
另外我们还注意到单纯的对value进行读操作的时候，也使用了另一个char[]变量进行引用，并且注释道，avoid getfield opcode。
getfield是一个字节码指令，作用应该是获取一个实例变量。
这个操作并非是出于immutable的考虑，而更多的是性能上的考虑，value是实例变量，而val是局部变量，将实例变量转化为局部变量，将在字节码层面上减少getfield的操作数量。
https://www.cnblogs.com/think-in-java/p/6130917.html

public String[] split(String regex, int limit) {
/* fastpath if the regex is a
(1)one-char String and this character is not one of the
RegEx's meta characters ".$|()[{^?*+\\", or
(2)two-char String and the first char is the backslash and
the second is not the ascii digit or ascii letter.
*/
char ch = 0;
if (((regex.value.length == 1 &&
".$|()[{^?*+\\".indexOf(ch = regex.charAt(0)) == -1) ||
(regex.length() == 2 &&
regex.charAt(0) == '\\' &&
(((ch = regex.charAt(1))-'0')|('9'-ch)) < 0 &&
((ch-'a')|('z'-ch)) < 0 &&
((ch-'A')|('Z'-ch)) < 0)) &&
(ch < Character.MIN_HIGH_SURROGATE ||
ch > Character.MAX_LOW_SURROGATE))
{
int off = 0;
int next = 0;
boolean limited = limit > 0;
ArrayList<String> list = new ArrayList<>();
while ((next = indexOf(ch, off)) != -1) {
if (!limited || list.size() < limit - 1) {
list.add(substring(off, next));
off = next + 1;
} else { // last one
//assert (list.size() == limit - 1);
list.add(substring(off, value.length));
off = value.length;
break;
}
}
// If no match was found, return this
if (off == 0)
return new String[]{this};

// Add remaining segment
if (!limited || list.size() < limit)
list.add(substring(off, value.length));

// Construct result
int resultSize = list.size();
if (limit == 0) {
while (resultSize > 0 && list.get(resultSize - 1).length() == 0) {
resultSize--;
}
}
String[] result = new String[resultSize];
return list.subList(0, resultSize).toArray(result);
}
return Pattern.compile(regex).split(this, limit);
}
观察这个方法，首先处理的分割的点是以正则表达式的形式传入的，如果这个正则表达式是单个字符或双字符但是是转义字符，就会在String方法内进行分割，称为fast path。不然的话会在由Pattern负责具体的正则表达式的处理。

对于第二个参数，有这样的大段描述：
* The {@code limit} parameter controls the number of times the
* pattern is applied and therefore affects the length of the resulting
* array. If the limit n is greater than zero then the pattern
* will be applied at most n - 1 times, the array's
* length will be no greater than n, and the array's last entry
* will contain all input beyond the last matched delimiter. If n
* is non-positive then the pattern will be applied as many times as
* possible and the array can have any length. If n is zero then
* the pattern will be applied as many times as possible, the array can
* have any length, and trailing empty strings will be discarded.
也就是说limit会限制返回的String数组的长度，当然如果limit超过了最大能分割的份数，也只会返回实际能分割的最大份数，limit为正数时不忽略空串。如果limit为0，会忽略空串，返回最大能分割份数。如果limit为负数，也会返回最大分割，但是不忽略空串。
public String[] split(String regex) {
return split(regex, 0);
}
无limit参数的split方法返回的就是没有空串的结果。

public int indexOf(String str) {
return indexOf(str, 0);
}
这个方法还是蛮实用的，可以查询一个字符串中是否包含另一个字符串。

public boolean matches(String regex) {
return Pattern.matches(regex, this);
}
正则表达式匹配。

public String trim() {
int len = value.length;
int st = 0;
char[] val = value; /* avoid getfield opcode */

while ((st < len) && (val[st] <= ' ')) {
st++;
}
while ((st < len) && (val[len - 1] <= ' ')) {
len--;
}
return ((st > 0) || (len < value.length)) ? substring(st, len) : this;
}
依然是如果修改就是返回新串，不修改就是本身的引用。

/**
* This object (which is already a string!) is itself returned.
*
* @return the string itself.
*/
public String toString() {
return this;
}

valueOf方法系列本质上好多都是调用了toString方法，或者调用String的构造方法，或者做一些判断直接枚举返回匿名串。

public native String intern();

Returns a canonical representation for the string object.

* When the intern method is invoked, if the pool already contains a
* string equal to this {@code String} object as determined by
* the {@link #equals(Object)} method, then the string from the pool is
* returned. Otherwise, this {@code String} object is added to the
* pool and a reference to this {@code String} object is returned.

* It follows that for any two strings {@code s} and {@code t},
* {@code s.intern() == t.intern()} is {@code true}
* if and only if {@code s.equals(t)} is {@code true}.

* @return a string that has the same contents as this string, but is
* guaranteed to be from a pool of unique strings.

我们查阅到Oracle官网的Java Language Specification的section 3.10.5部分
可以看到这样两个例子：
String hello = "Hello", lo = "lo";
System.out.print((hello == ("Hel"+lo)) + " ");
//false Strings computed by concatenation at run time are newly created and therefore distinct.
System.out.println(hello == ("Hel"+lo).intern());
//true The result of explicitly interning a computed string is the same string as any pre-existing literal string with the same contents.
https://docs.oracle.com/javase/specs/jls/se8/html/jls-3.html#jls-3.1