String
前言
* Strings are constant; their values cannot be changed after they
* are created. String buffers support mutable strings.
* Because String objects are immutable they can be shared.
String类最重要的特性,immutable。* The class {@code String} includes methods for examining
* individual characters of the sequence, for comparing strings, for
* searching strings, for extracting substrings, and for creating a
* copy of a string with all characters translated to uppercase or to
* lowercase.
String类提供的规则。* The Java language provides special support for the string
* concatenation operator ( + ), and for conversion of
* other objects to strings. String concatenation is implemented
* through the {@code StringBuilder}(or {@code StringBuffer})
* class and its {@code append} method.
* String conversions are implemented through the method
* {@code toString}, defined by {@code Object} and
* inherited by all classes in Java.
Java语言对使用+实现字符串拼接有特殊的支持。本质上是将+法转变为StringBuilder/StringBuffer的append方法,得到的结果再通过toString方法得到最终的结果的。* <p> Unless otherwise noted, passing a <tt>null</tt> argument to a constructor
* or method in this class will cause a {@link NullPointerException} to be
* thrown.
除非特殊说明,对于String中的方法,如果参数为null,则会抛出NullPointerException异常。* <p>A {@code String} represents a string in the UTF-16 format
* in which <em>supplementary characters</em> are represented by <em>surrogate
* pairs</em> (see the section <a href="Character.html#unicode">Unicode
* Character Representations</a> in the {@code Character} class for
* more information).
* Index values refer to {@code char} code units, so a supplementary
* character uses two positions in a {@code String}.
* <p>The {@code String} class provides methods for dealing with
* Unicode code points (i.e., characters), in addition to those for
* dealing with Unicode code units (i.e., {@code char} values).
应该是关于字符集的描述,没太看懂,但是好像一个supplementary character会占据两个位置。
继承类&实现接口
实现了java.io.Serializable Comparable<String> CharSequence接口
java.io.Serializable具体请见其他文章
Comparable<T>具体请见其他文章
CharSequence中定义了一些String常用的方法,例如
int length();
char charAt(int index);
会抛出IndexOutOfBoundsException异常
CharSequence subSequence(int start, int end);
会抛出IndexOutOfBoundsException异常,start>end也是。
public String toString();
构造方法
public String() {
this.value = "".value;
}
可以看到无参的构造方法是默认把字符串初始化为空串了。传的是引用,可以说是浅拷贝。public String(String original) {
this.value = original.value;
this.hash = original.hash;
}
这是唯一给hash赋值的构造函数。注意辨析以下概念,
String s1 = "你好";
String s2 = new String(s1);
String s3 = s1;
System.out.println(s2 == s1);
System.out.println(s3 == s1);
结果是false和true。
但是,他们三个字符串的value,都是同一个内存空间。s3引用了s1的地址,所以说他们二者value相同没什么说的。
但s2虽然开了一个新的地址空间,但这个地址空间也不过是新开了一个堆内存用来记录value的地址而已,虽然对象s2和对象s1地址不同,但本质上引用得还是同一个字符串。public String(char value[]) {
this.value = Arrays.copyOf(value, value.length);
}
当然也有别的方式初始化String,请注意,为了保持String的不可变性,专门把数组复制之后传递的引用。否则的话将会通过修改value破坏String的不可变性。所以我猜想也许用上面这种方式是可以打破常量池的吧。
public String(char value[], int offset, int count) {
......
}
功能类似,具体实现省略public String(StringBuffer buffer) {
synchronized(buffer) {
this.value = Arrays.copyOf(buffer.getValue(), buffer.length());
}
}public String(StringBuilder builder) {
this.value = Arrays.copyOf(builder.getValue(), builder.length());
}
StringBuffer是需要进行同步的,避免脏读的发生。
静态变量&静态方法
private static final long serialVersionUID = -6849794470754667710L;
不知为何用,注释说是为了interoperability。注意到String是实现了Serializable接口的。If the receiver has loaded a class for the object that has a different serialVersionUID than that of the corresponding sender's class, then deserialization will result in an InvalidClassException.
如果说receiver加载过和这个serialVersionUID不一致的同名类,就会发生报错。
https://stackoverflow.com/questions/285793/what-is-a-serialversionuid-and-why-should-i-use-itprivate static void checkBounds(byte[] bytes, int offset, int length) {
if (length < 0)
throw new StringIndexOutOfBoundsException(length);
if (offset < 0)
throw new StringIndexOutOfBoundsException(offset);
if (offset > bytes.length - length)
throw new StringIndexOutOfBoundsException(offset + length);
}
private的静态方法,用于判断检查字节数组是否发生越界。另外就是还声明了一个静态内部类,用于对两个字符串忽略大小进行比较,这个内部类实现了Comparator接口的compare方法,可能也是因为本来不忽略大小写的比较方法已经占用Compareable接口,所以才使用这种方式实现。
还有就是可以看到内部的判断真的很有意思,先都转换为大写,又都转换为小写,似乎有点多余,不过可能考虑到多种字符吧。
public static final Comparator<String> CASE_INSENSITIVE_ORDER
= new CaseInsensitiveComparator();
private static class CaseInsensitiveComparator
implements Comparator<String>, java.io.Serializable {
// use serialVersionUID from JDK 1.2.2 for interoperability
private static final long serialVersionUID = 8575799808933029326L;public int compare(String s1, String s2) {
int n1 = s1.length();
int n2 = s2.length();
int min = Math.min(n1, n2);
for (int i = 0; i < min; i++) {
char c1 = s1.charAt(i);
char c2 = s2.charAt(i);
if (c1 != c2) {
c1 = Character.toUpperCase(c1);
c2 = Character.toUpperCase(c2);
if (c1 != c2) {
c1 = Character.toLowerCase(c1);
c2 = Character.toLowerCase(c2);
if (c1 != c2) {
// No overflow because of numeric promotion
return c1 - c2;
}
}
}
}
return n1 - n2;
}/** Replaces the de-serialized object. */
private Object readResolve() { return CASE_INSENSITIVE_ORDER; }
}
成员变量&成员方法
private final char value[];
The value is used for character storage.
String上修饰的final并不真正能使String不可变,value上修饰的final才起到真正的作用。
可以看到本质上String是char数组的包装。这个数组是final类型,这里没有赋值的话,必须在构造函数中进行声明。但是要注意,被final修饰的数组的真实值并非不能改变,不能改变的只是value所指向的内存空间而已。但因为value是private修饰的,只能在String内部被访问,只要保证String内部不对value进行修改,就能保证不变性了。
private int hash; // Default to 0
Cache the hash code for the string
缓存string的hashcodepublic int length() {
return value.length;
}
重写的CharSequence中的方法,果然是value的一层薄封装。public char charAt(int index) {
if ((index < 0) || (index >= value.length)) {
throw new StringIndexOutOfBoundsException(index);
}
return value[index];
}
重写的CharSequence中的方法,判断了数组是否越界,有一点代理模式的感觉。
public boolean equals(Object anObject) {
if (this == anObject) {
return true;
}
if (anObject instanceof String) {
String anotherString = (String)anObject;
int n = value.length;
if (n == anotherString.value.length) {
char v1[] = value;
char v2[] = anotherString.value;
int i = 0;
while (n-- != 0) {
if (v1[i] != v2[i])
return false;
i++;
}
return true;
}
}
return false;
}
三点判断,1 hashcode相同 2 类型是String或其子类(可惜final不存在子类)
3 值完全相同(不是没有比较,是一个一个比较了的,常量池是可以打破的)public int compareTo(String anotherString) {
int len1 = value.length;
int len2 = anotherString.value.length;
int lim = Math.min(len1, len2);
char v1[] = value;
char v2[] = anotherString.value;int k = 0;
while (k < lim) {
char c1 = v1[k];
char c2 = v2[k];
if (c1 != c2) {
return c1 - c2;
}
k++;
}
return len1 - len2;
}
和equals的区别在于equals只能判断相等,而comapreTo可以判断的情况更广。
函数内是没有判断null操作的,如果传入null会直接抛出NullPointerException
public int hashCode() {
int h = hash;
if (h == 0 && value.length > 0) {
char val[] = value;for (int i = 0; i < value.length; i++) {
h = 31 * h + val[i];
}
hash = h;
}
return h;
}
* Returns a hash code for this string. The hash code for a
* {@code String} object is computed as
* <blockquote><pre>
* s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
* </pre></blockquote>
* using {@code int} arithmetic, where {@code s[i]} is the
* <i>i</i>th character of the string, {@code n} is the length of
* the string, and {@code ^} indicates exponentiation.
* (The hash value of the empty string is zero.)
由此我们可以看出String的hashCode计算规则和Object中定义的略有不同,首先Object中的hashCode方法是native修饰的,意思是对象所指的内存空间。而String的hashCode是依据字符串的内容计算得到的,是一个虚拟的hashCode,也就意味着两个数值相同的字符串的hashCode是完全相同的。
我们可以看到计算规则,^是幂的意思,而非异或。如果字符串为空,则hashcode值为0,如果计算过一次的话,会使用hash成员变量进行保存。还有就是使用int类型难道不会因为字符串过长发生溢出么?
很遗憾,就是会发生溢出,那还怎么保证hashCode唯一。
但是你怎么可能同时用得了那么多种字符串,发生冲突的风险存在,但是并不大。
String substring(int beginIndex) {
if (beginIndex < 0) {
throw new StringIndexOutOfBoundsException(beginIndex);
}
int subLen = value.length - beginIndex;
if (subLen < 0) {
throw new StringIndexOutOfBoundsException(subLen);
}
return (beginIndex == 0) ? this : new String(value, beginIndex, subLen);
}public String substring(int beginIndex, int endIndex) {
if (beginIndex < 0) {
throw new StringIndexOutOfBoundsException(beginIndex);
}
if (endIndex > value.length) {
throw new StringIndexOutOfBoundsException(endIndex);
}
int subLen = endIndex - beginIndex;
if (subLen < 0) {
throw new StringIndexOutOfBoundsException(subLen);
}
return ((beginIndex == 0) && (endIndex == value.length)) ? this
: new String(value, beginIndex, subLen);
}这两个截取字符串的函数,使用得是很频繁的,可以看到返回的是一个new的新串。
public String concat(String str) {
int otherLen = str.length();
if (otherLen == 0) {
return this;
}
int len = value.length;
char buf[] = Arrays.copyOf(value, len + otherLen);
str.getChars(buf, len);
return new String(buf, true);
}
concat这个函数名不难联想到linux中的cat命令。
这个函数的功能就是拼接字符串,同样也是开了一个新的空间,把本来的字符串和新增加的字符串拷贝过去。
我其实很好奇,char数组和内存的对应关系了。另外我们前面提到过,编译器是把字符串的+法转换为了Buffer/Builder的append方法,而不是这个concat方法,不要搞混了。
public String replace(char oldChar, char newChar) {
if (oldChar != newChar) {
int len = value.length;
int i = -1;
char[] val = value; /* avoid getfield opcode */while (++i < len) {
if (val[i] == oldChar) {
break;
}
}
if (i < len) {
char buf[] = new char[len];
for (int j = 0; j < i; j++) {
buf[j] = val[j];
}
while (i < len) {
char c = val[i];
buf[i] = (c == oldChar) ? newChar : c;
i++;
}
return new String(buf, true);
}
}
return this;
}
其实完全是有机会可以在value本身上进行修改的,但是为什么不这样做,而偏偏要复制一个新的数组,就是为了避免出现还有别的字符串引用这个内存空间,你修改这个内存,导致其他字符串的值也发生改变。char[] val = value; /* avoid getfield opcode */
另外我们还注意到单纯的对value进行读操作的时候,也使用了另一个char[]变量进行引用,并且注释道,avoid getfield opcode。
getfield是一个字节码指令,作用应该是获取一个实例变量。
这个操作并非是出于immutable的考虑,而更多的是性能上的考虑,value是实例变量,而val是局部变量,将实例变量转化为局部变量,将在字节码层面上减少getfield的操作数量。
https://www.cnblogs.com/think-in-java/p/6130917.html
public String[] split(String regex, int limit) {
/* fastpath if the regex is a
(1)one-char String and this character is not one of the
RegEx's meta characters ".$|()[{^?*+\\", or
(2)two-char String and the first char is the backslash and
the second is not the ascii digit or ascii letter.
*/
char ch = 0;
if (((regex.value.length == 1 &&
".$|()[{^?*+\\".indexOf(ch = regex.charAt(0)) == -1) ||
(regex.length() == 2 &&
regex.charAt(0) == '\\' &&
(((ch = regex.charAt(1))-'0')|('9'-ch)) < 0 &&
((ch-'a')|('z'-ch)) < 0 &&
((ch-'A')|('Z'-ch)) < 0)) &&
(ch < Character.MIN_HIGH_SURROGATE ||
ch > Character.MAX_LOW_SURROGATE))
{
int off = 0;
int next = 0;
boolean limited = limit > 0;
ArrayList<String> list = new ArrayList<>();
while ((next = indexOf(ch, off)) != -1) {
if (!limited || list.size() < limit - 1) {
list.add(substring(off, next));
off = next + 1;
} else { // last one
//assert (list.size() == limit - 1);
list.add(substring(off, value.length));
off = value.length;
break;
}
}
// If no match was found, return this
if (off == 0)
return new String[]{this};// Add remaining segment
if (!limited || list.size() < limit)
list.add(substring(off, value.length));// Construct result
int resultSize = list.size();
if (limit == 0) {
while (resultSize > 0 && list.get(resultSize - 1).length() == 0) {
resultSize--;
}
}
String[] result = new String[resultSize];
return list.subList(0, resultSize).toArray(result);
}
return Pattern.compile(regex).split(this, limit);
}
观察这个方法,首先处理的分割的点是以正则表达式的形式传入的,如果这个正则表达式是单个字符或双字符但是是转义字符,就会在String方法内进行分割,称为fast path。不然的话会在由Pattern负责具体的正则表达式的处理。对于第二个参数,有这样的大段描述:
* <p> The {@code limit} parameter controls the number of times the
* pattern is applied and therefore affects the length of the resulting
* array. If the limit <i>n</i> is greater than zero then the pattern
* will be applied at most <i>n</i> - 1 times, the array's
* length will be no greater than <i>n</i>, and the array's last entry
* will contain all input beyond the last matched delimiter. If <i>n</i>
* is non-positive then the pattern will be applied as many times as
* possible and the array can have any length. If <i>n</i> is zero then
* the pattern will be applied as many times as possible, the array can
* have any length, and trailing empty strings will be discarded.
也就是说limit会限制返回的String数组的长度,当然如果limit超过了最大能分割的份数,也只会返回实际能分割的最大份数,limit为正数时不忽略空串。如果limit为0,会忽略空串,返回最大能分割份数。如果limit为负数,也会返回最大分割,但是不忽略空串。
public String[] split(String regex) {
return split(regex, 0);
}
无limit参数的split方法返回的就是没有空串的结果。
public int indexOf(String str) {
return indexOf(str, 0);
}
这个方法还是蛮实用的,可以查询一个字符串中是否包含另一个字符串。public boolean matches(String regex) {
return Pattern.matches(regex, this);
}
正则表达式匹配。
public String trim() {
int len = value.length;
int st = 0;
char[] val = value; /* avoid getfield opcode */while ((st < len) && (val[st] <= ' ')) {
st++;
}
while ((st < len) && (val[len - 1] <= ' ')) {
len--;
}
return ((st > 0) || (len < value.length)) ? substring(st, len) : this;
}
依然是如果修改就是返回新串,不修改就是本身的引用。
/**
* This object (which is already a string!) is itself returned.
*
* @return the string itself.
*/
public String toString() {
return this;
}valueOf方法系列本质上好多都是调用了toString方法,或者调用String的构造方法,或者做一些判断直接枚举返回匿名串。
public native String intern();
Returns a canonical representation for the string object.
* When the intern method is invoked, if the pool already contains a
* string equal to this {@code String} object as determined by
* the {@link #equals(Object)} method, then the string from the pool is
* returned. Otherwise, this {@code String} object is added to the
* pool and a reference to this {@code String} object is returned.* It follows that for any two strings {@code s} and {@code t},
* {@code s.intern() == t.intern()} is {@code true}
* if and only if {@code s.equals(t)} is {@code true}.* @return a string that has the same contents as this string, but is
* guaranteed to be from a pool of unique strings.我们查阅到Oracle官网的Java Language Specification的section 3.10.5部分
可以看到这样两个例子:
String hello = "Hello", lo = "lo";
System.out.print((hello == ("Hel"+lo)) + " ");
//false Strings computed by concatenation at run time are newly created and therefore distinct.
System.out.println(hello == ("Hel"+lo).intern());
//true The result of explicitly interning a computed string is the same string as any pre-existing literal string with the same contents.
https://docs.oracle.com/javase/specs/jls/se8/html/jls-3.html#jls-3.1