I have a list of tuples [(val1, freq1), (val2, freq2) .... (valn, freqn)]. I need to get measures of central tendencies (mean, median ) and measures of deviation (variance , mean) for the above data.I would also like to plot a boxplot for the values.
I see that numpy arrays have direct methods for getting mean / median and standard deviation (or variance) from list of values.
Does numpy (or any other well-known library) have a direct means to operate on such a frequency distribution table ?
Also What is the best way to programtically expand the above list of tuples to one list ? (e.g if freq dist is [(1,3) , (50,2)], best way to get a list [1,1,1,50,50] to use np.mean([1,1,1,50,50]))
I see a custom function here, but I would like to use a standard implementation if possible
解决方案
First, I'd change that messy list into two numpy arrays like @user8153 did:
val, freq = np.array(list_tuples).T
Then you can reconstruct the array (using np.repeat prevent looping):
data = np.repeat(val, freq)
If that causes memory errors (or you just want to squeeze out as much performance as possible), you can also use some purpose-built functions:
def mean_(val, freq):
return np.average(val, weights = freq)
def median_(val, freq):
ord = np.argsort(val)
cdf = np.cumsum(freq[ord])
return val[ord][np.searchsorted(cdf, cdf[-1] // 2)]
def mode_(val, freq): #in the strictest sense, assuming unique mode
return val[np.argmax(freq)]
def var_(val, freq):
avg = mean_(val, freq)
dev = freq * (val - avg) ** 2
return dev.sum() / (freq.sum() - 1)
def std_(val, freq):
return np.sqrt(var_(val, freq))