python随机分组的方法_Python几种分组计数方法比较

weixin_39933484

于 2020-12-08 15:20:35 发布

阅读量1.5k

点赞数

文章标签： python随机分组的方法

在数据清洗的过程中，常常会用到分组计数，当数据量很大的时候，需要考虑运行速度。Python中有多种方式可以实现分组计数，本文汇总了搜集到的几种方法，并对它们的运行速度做了测试。

首先，生成一个长度为1000万的随机字符串列表，并引入时间模块计算运行时长：

import random

import string

import time

n = 10000000

ran_str = [None]*n

for i in range(n):

ran_str[i] = random.choice(string.ascii_letters)

方法一：利用Python基础的字典

tic = time.process_time()

d = {}

for i in ran_str:

if i not in d:

d[i] = 1

else:

d[i] += 1

toc = time.process_time()

print(str((toc-tic)*1000))

方法一用分组的元素作为字典的键，然后循环叠加计数。运行时间大概为1800ms。

方法二：内置模块collections的defaultdict

tic = time.process_time()

from collections import defaultdict

dd = defaultdict(int)

for i in ran_str:

dd[i] += 1

toc = time.process_time()

print(str((toc-tic)*1000))

Python原始的字典，如果没有键，直接赋值会报错，所以方法一要进行if判断。而defaultdict可以直接给不存在的键赋值，运行时间大概为1400ms，稍微快一些。

方法三：内置模块collections的Counter

tic = time.process_time()

from collections import Counter

word_counts = Counter(ran_str)

toc = time.process_time()

print(str((toc-tic)*1000))

collections的Counter是一个计数器，返回的是字典形式的数据类型，可以用word_counts.items()调用键和值。该方法运行时间大概为500ms，速度明显提升。

方法四：内置模块itertools的groupby

tic = time.process_time()

from itertools import groupby

ran_str.sort()

groups = ((k,len(list(g))) for k,g in groupby(ran_str))

toc = time.process_time()

print(str((toc-tic)*1000))

itertools的groupby需要先对序列进行排序，第一次运行时间大概为2700ms，但是第二次以后运行时间仅有150ms，因为序列已经排序了。如果事先对序列排序，这种方法是最快的，而前三种方法的运行时间都不受排序的影响。groupby返回的是一个迭代器，可以通过for i,j in groups: print(i,j)调用结果。

weixin_39933484

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
python随机分组的方法_Python几种分组计数方法比较

在数据清洗的过程中，常常会用到分组计数，当数据量很大的时候，需要考虑运行速度。Python中有多种方式可以实现分组计数，本文汇总了搜集到的几种方法，并对它们的运行速度做了测试。首先，生成一个长度为1000万的随机字符串列表，并引入时间模块计算运行时长：import randomimport stringimport timen = 10000000ran_str = [None]*nfor i i...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。