I've tried to understand when Python strings are identical (aka sharing the same memory location). However during my tests, there seems to be no obvious explanation when two string variables that are equal share the same memory:
import sys
print(sys.version) # 3.4.3
# Example 1
s1 = "Hello"
s2 = "Hello"
print(id(s1) == id(s2)) # True
# Example 2
s1 = "Hello" * 3
s2 = "Hello" * 3
print(id(s1) == id(s2)) # True
# Example 3
i = 3
s1 = "Hello" * i
s2 = "Hello" * i
print(id(s1) == id(s2)) # False
# Example 4
s1 = "HelloHelloHelloHelloHello"
s2 = "HelloHelloHelloHelloHello"
print(id(s1) == id(s2)) # True
# Example 5
s1 = "Hello" * 5
s2 = "Hello" * 5
print(id(s1) == id(s2)) # False
Strings are immutable, and as far as I know Python tries to re-use existing immutable objects, by having other variables point to them instead of creating new objects in memory with the same value.
With this in mind, it seems obvious that Example 1 returns True.
It's still obvious (to me) that Example 2 returns True.
It's not obvious to me, that Example 3 returns False - am I not doing the same thing as in Example 2?!?
and read through http://guilload.com/python-string-interning/ (though I probably didn't understand it all) and thougt - okay, maybe "interned" strings depend on the length, so I used HelloHelloHelloHelloHello in Example 4. The result was True.
And what the puzzled me, was doing the same as in Example 2 just with a bigger number (but it would effectively return the same string as Example 4) - however this time the result was False?!?
I have really no idea how Python decides whether or not to use the same memory object, or when to create a new one.
Are the any official sources that can explain this behavior?
解决方案Avoiding large .pyc files
So why does 'a' * 21 is 'aaaaaaaaaaaaaaaaaaaaa' not evaluate to True? Do you remember the .pyc files you encounter in all your packages? Well, Python bytecode is stored in these files. What would happen if someone wrote something like this ['foo!'] * 10**9? The resulting .pyc file would be huge! In order to avoid this phenomena, sequences generated through peephole optimization are discarded if their length is superior to 20.
If you have the string "HelloHelloHelloHelloHello", Python will necessarily have to store it as it is (asking the interpreter to detect repeating patterns in a string to save space might be too much). However, when it comes to string values that can be computed at parsing time, such as "Hello" * 5, Python evaluate those as part of this so-called "peephole optimization", which can decide whether it is worth it or not to precompute the string. Since len("Hello" * 5) > 20, the interpreter leaves it as it is to avoid storing too many long strings.
EDIT:
As indicated in this question, you can check this on the source code in peephole.c, function fold_binops_on_constants, near the end you will see:
// ...
} else if (size > 20) {
Py_DECREF(newconst);
return -1;
}
EDIT 2:
Actually, that optimization code has recently been moved to the AST optimizer for Python 3.7, so now you would have to look into ast_opt.c, function fold_binop, which calls now function safe_multiply, which checks that the string is no longer than MAX_STR_SIZE, newly defined as 4096. So it seems that the limit has been significantly bumped up for the next releases.