python比较字符串是否一样_Python如何确定两个字符串是否相同

1586010002-jmsa.png

I've tried to understand when Python strings are identical (aka sharing the same memory location). However during my tests, there seems to be no obvious explanation when two string variables that are equal share the same memory:

import sys

print(sys.version) # 3.4.3

# Example 1

s1 = "Hello"

s2 = "Hello"

print(id(s1) == id(s2)) # True

# Example 2

s1 = "Hello" * 3

s2 = "Hello" * 3

print(id(s1) == id(s2)) # True

# Example 3

i = 3

s1 = "Hello" * i

s2 = "Hello" * i

print(id(s1) == id(s2)) # False

# Example 4

s1 = "HelloHelloHelloHelloHello"

s2 = "HelloHelloHelloHelloHello"

print(id(s1) == id(s2)) # True

# Example 5

s1 = "Hello" * 5

s2 = "Hello" * 5

print(id(s1) == id(s2)) # False

Strings are immutable, and as far as I know Python tries to re-use existing immutable objects, by having other variables point to them instead of creating new objects in memory with the same value.

With this in mind, it seems obvious that Example 1 returns True.

It's still obvious (to me) that Example 2 returns True.

It's not obvious to me, that Example 3 returns False - am I not doing the same thing as in Example 2?!?

and read through http://guilload.com/python-string-interning/ (though I probably didn't understand it all) and thougt - okay, maybe "interned" strings depend on the length, so I used HelloHelloHelloHelloHello in Example 4. The result was True.

And what the puzzled me, was doing the same as in Example 2 just with a bigger number (but it would effectively return the same string as Example 4) - however this time the result was False?!?

I have really no idea how Python decides whether or not to use the same memory object, or when to create a new one.

Are the any official sources that can explain this behavior?

解决方案Avoiding large .pyc files

So why does 'a' * 21 is 'aaaaaaaaaaaaaaaaaaaaa' not evaluate to True? Do you remember the .pyc files you encounter in all your packages? Well, Python bytecode is stored in these files. What would happen if someone wrote something like this ['foo!'] * 10**9? The resulting .pyc file would be huge! In order to avoid this phenomena, sequences generated through peephole optimization are discarded if their length is superior to 20.

If you have the string "HelloHelloHelloHelloHello", Python will necessarily have to store it as it is (asking the interpreter to detect repeating patterns in a string to save space might be too much). However, when it comes to string values that can be computed at parsing time, such as "Hello" * 5, Python evaluate those as part of this so-called "peephole optimization", which can decide whether it is worth it or not to precompute the string. Since len("Hello" * 5) > 20, the interpreter leaves it as it is to avoid storing too many long strings.

EDIT:

As indicated in this question, you can check this on the source code in peephole.c, function fold_binops_on_constants, near the end you will see:

// ...

} else if (size > 20) {

Py_DECREF(newconst);

return -1;

}

EDIT 2:

Actually, that optimization code has recently been moved to the AST optimizer for Python 3.7, so now you would have to look into ast_opt.c, function fold_binop, which calls now function safe_multiply, which checks that the string is no longer than MAX_STR_SIZE, newly defined as 4096. So it seems that the limit has been significantly bumped up for the next releases.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值