在回答了6年后,有人向我指出我看错了这个问题。虽然我的原始答案(下面)计算输入序列中唯一的键,但实际上您有一个不同的count-distinct problem;您希望计算每个键的值。
要计算每个键的唯一值,正好,必须先将这些值收集到集合中:values_per_key = {}
for d in iterable_of_dicts:
for k, v in d.items():
values_per_key.setdefault(k, set()).add(v)
counts = {k: len(v) for k, v in values_per_key.items()}
对于您的输入,产生:>>> values_per_key = {}
>>> for d in iterable_of_dicts:
... for k, v in d.items():
... values_per_key.setdefault(k, set()).add(v)
...
>>> counts = {k: len(v) for k, v in values_per_key.items()}
>>> counts
{'abc': 3, 'xyz': 1, 'pqr': 4}
如果您想利用这个类提供的附加功能,我们仍然可以将该对象包装在一个Counter()实例中,请参见以下内容:>>> from collections import Counter
>>> Counter(counts)
Counter({'pqr': 4, 'abc': 3, 'xyz': 1})
缺点是,如果您的输入iterable非常大,那么上述方法可能需要大量内存。如果不需要精确的计数,例如,当数量级足够时,还有其他方法,例如hyperloglog structure或其他算法“勾画”流的计数。
此方法要求您安装第三方库。例如,^{} project同时提供了HyperLogLog和MinHash。下面是一个HLL示例(使用HyperLogLogPlusPlus类,这是对HLL方法的最新改进):from collections import defaultdict
from datasketch import HyperLogLogPlusPlus
counts = defaultdict(HyperLogLogPlusPlus)
for d in iterable_of_dicts:
for k, v in d.items():
counts[k].update(v.encode('utf8'))
我最初的答案是:from collections import Counter
from itertools import chain
counts = Counter(chain.from_iterable(e.keys() for e in d))
这样可以确保在输入列表中包含多个键的词典被正确计数。
演示:>>> from collections import Counter
>>> from itertools import chain
>>> d = [{"abc":"movies"}, {"abc": "sports"}, {"abc": "music"}, {"xyz": "music"}, {"pqr":"music"}, {"pqr":"movies"},{"pqr":"sports"}, {"pqr":"news"}, {"pqr":"sports"}]
>>> Counter(chain.from_iterable(e.keys() for e in d))
Counter({'pqr': 5, 'abc': 3, 'xyz': 1})
或者输入字典中有多个键:>>> d = [{"abc":"movies", 'xyz': 'music', 'pqr': 'music'}, {"abc": "sports", 'pqr': 'movies'}, {"abc": "music", 'pqr': 'sports'}, {"pqr":"news"}, {"pqr":"sports"}]
>>> Counter(chain.from_iterable(e.keys() for e in d))
Counter({'pqr': 5, 'abc': 3, 'xyz': 1})
Counter()具有其他有用的功能,例如^{} method,它按相反顺序列出元素及其计数:for key, count in counts.most_common():
print '{}: {}'.format(key, count)
# prints
# 5: pqr
# 3: abc
# 1: xyz