上一篇说到BoxCox变换,用来把分布显著有偏的数据转换成近似正态分布的数据。
如何确定最佳的λ呢?在scipy中,采用了极大似然估计,因为转换后的y是近似正态分布的,所以最佳的λ就是对数似然最大时对应的λ。
查看scipy源码,在_morestats.py这个文件里,简化后的代码如下:
from scipy import stats,optimize
def boxcox_llf(lmb, data):
data = np.asarray(data)
N = data.shape[0]
if N == 0:
return np.nan
logdata = np.log(data)
# Compute the variance of the transformed data.
if lmb == 0:
variance = np.var(logdata, axis=0)
else:
# Transform without the constant offset 1/lmb. The offset does
# not effect the variance, and the subtraction of the offset can
# lead to loss of precision.
variance = np.var(data**lmb / lmb, axis=0)
return (lmb - 1) * np.sum(lo