I'm reading each line of a file into both a list and a dict,
with open("../data/title/pruned2_titleonly.txt", 'rb') as f_titles:
titles_lst = f_titles.read().split('\n')
assert titles_lst[-1] == ''
titles_lst.pop() # remove the last element, an empty string
titles_dict = {}
with open("../data/title/pruned2_titleonly.txt", 'rb') as f_titles:
for i,line in enumerate(f_titles):
titles_dict[i] = line
and I'm testing the performance by accessing each item in the list/dict in random order:
n = len(titles_lst)
a = np.random.permutation(n)
%%time
for i in xrange(10):
t = []
for b in a:
t.append(titles_lst[b])
del t
>>> CPU times: user 18.2 s, sys: 60 ms, total: 18.2 s
>>> Wall time: 18.1 s
%%time
for i in xrange(10):
t = []
for b in a:
t.append(titles_dict[b])
del t
>>> CPU times: user 41 s, sys: 208 ms, total: 41.2 s
>>> Wall time: 40.9 s
The above result seems to imply that dictionaries are not as efficient as lists for lookup tables, even though list lookups are O(n) while dict lookups are O(1). I've tested the following to see if the O(n)/O(1) performance was true... turns out it isn't...
%timeit titles_lst[n/2]
>>> 10000000 loops, best of 3: 81 ns per loop
%timeit titles_dict[n/2]
>>> 10000000 loops, best of 3: 120 ns per loop
What is the deal? If it's important to note, I am using Python 2.7.6 Anaconda distribution under Ubuntu 12.04, and I built NumPy under Intel MKL.
解决方案The above result seems to imply that dictionaries are not as efficient
as lists for lookup tables, even though list lookups are O(n) while
dict lookups are O(1). I've tested the following to see if the
O(n)/O(1) performance was true... turns out it isn't...
It's not true that dict lookups are O(N), in the sense of "getting an item" which is the sense your code seems to test. Determining where (if at all) an element exists could be O(N), e.g. somelist.index(someval_not_in_the_list) or someval_not_in_the_list in somelist will both have to scan over each element. Try comparing x in somelist with x in somedict to see a major difference.
But simply accessing somelist[index] is O(1) (see the Time Complexity page). And the coefficient is probably going to be smaller than in the case of a dictionary, also O(1), because you don't have to hash the key.