Consider the following code:
arr = []
for (str, id, flag) in some_data:
arr.append((str, id, flag))
Imagine the input strings being 2 chars long in average and 5 chars max and some_data having 1 million elements.
What will the memory requirement of such a structure be?
May it be that a lot of memory is wasted for the strings? If so, how can I avoid that?
解决方案
In this case, because the strings are quite short, and there are so many of them, you stand to save a fair bit of memory by using intern on the strings. Assuming there are only lowercase letters in the strings, that's 26 * 26 = 676 possible strings, so there must be a lot of repetitions in this list; intern will ensure that those repetitions don't result in unique objects, but all refer to the same base object.
It's possible that Python already interns short strings; but looking at a number of different sources, it seems this is highly implementation-dependent. So calling intern in this case is probably the way to go; YMMV.
As an elaboration on why this is very likely to save memory, consider the following:
>>> sys.getsizeof('')
40
>>> sys.getsizeof('a')
41
>>> sys.getsizeof('ab')
42
>>> sys.getsizeof('abc')
43
Adding single characters to a string adds only a byte to the size of the string itself, but every string takes up 40 bytes on its own.