假设我有以下数据:
import pandas as pd
import numpy as np
import random
from string import ascii_uppercase
random.seed(100)
n = 1000000
# Create a bunch of factor data... throw some NaNs in there for good measure
data = {letter: [random.choice(list(ascii_uppercase) + [np.nan]) for _ in range(n)] for letter in ascii_uppercase}
df = pd.DataFrame(data)
我想快速计算数据框中所有值集合中每个值的全局出现.
这有效:
from collections import Counter
c = Counter([v for c in df for v in df[c].fillna(-999)])
但是很慢:
%timeit Counter([v for c in df for v in df[c].fillna(-999)])
1 loop, best of 3: 4.12 s per loop
我认为这个功能可以通过使用一些熊猫的马力来加快速度:
def quick_global_count(df, na_value=-999):
df = df.fillna(na_value)
# Get counts of each element for each column in the passed dataframe
group_bys = {c: df.groupby(c).size() for