[更新ing]sklearn(十七)：Density Estimation

Sarah ฅʕ•̫͡•ʔฅ

已于 2022-05-24 18:06:02 修改

阅读量577

点赞数

分类专栏： Sklearn 文章标签： sklearn 机器学习人工智能

于 2018-10-04 23:08:07 首次发布

本文链接：https://blog.csdn.net/u014765410/article/details/82940722

版权

Sklearn 专栏收录该内容

27 篇文章 4 订阅

订阅专栏

利用“直方图”进行密度估计

利用直方图进行数据的密度估计：确定bin的大小后，计数各个bin中n_sample的个数作为数据密度。
直方图密度估计有一个明显的缺陷：即bin的大小不一样，可能得到的概率密度图存在较大差异。除此以外，利用直方图得到的概率密度图不连续。而kernel density estimation可以很好的解决上述问题。下图为“直方图”密度估计，和kernel density estimation图示，upper figure为直方图估计，左右两幅图的bin大小不一样，可以看出，不同的bin得出的密度图存在明显不同（一个双峰，一个单峰）。

从上图可以看出，kernel density estimation的密度图连续型更好。

kernel density estimation

可利用kernel density estimation计算数据中各个点的概率密度值，kernel density estimation在估计point密度时，并不假设data服从什么样的概率密度形式，而是根据公式：p=k/n/Vn来计算某点的密度值（k为在以point为中心， Vn的体积范围内点的个数；n为样本总量；Vn为体积（趋于0）），上述公式中k的计算可以衍生为用kernel来计算Vn体积内各个点的加权值，然后将k个加权值相加，作为最后的点数量的估计值。（Parzen窗方法理论）

#neighbor-based approaches 
sklearn.neighbors.KernelDensity(bandwidth=1.0, algorithm=’auto’, kernel=’gaussian’, metric=’euclidean’, atol=0, rtol=0, breadth_first=True, leaf_size=40, metric_params=None)
#bandwidth：kernel的参数，bandwidth越大，bias越大，容易underfitting；bandwidth越小，variance越大，容易overfitting；
#algorithm：{kd_tree,ball_tree,auto}   需要搜索一定radius范围内point的近邻点，因此，需要用到kd_tree去构建搜索树
#breadth_first：用到相关算法，决定搜索策略
#kernel：[‘gaussian’|’tophat’|’epanechnikov’|’exponential’|’linear’|’cosine’] Default is ‘gaussian’.

核密度估计 Kernel Density Estimation(KDE)

GaussianMixture

sklearn.mixture.GaussianMixture(n_components=1, covariance_type=’full’, tol=0.001, reg_covar=1e-06, max_iter=100, n_init=1, init_params=’kmeans’, weights_init=None, means_init=None, precisions_init=None, random_state=None, warm_start=False, verbose=0, verbose_interval=10)
#n_components：Guassian distribution的个数
#covariance_type：{full：每个component有各自的covariance，tied：所有component有相同的covariance，diag：每个component有各自的diagonal covariance matrix，spherical：每个component有自己的covariance}
#tol：目标值达到tol，停止迭代
#reg_covar：Non-negative regularization added to the diagonal of covariance. Allows to assure that the covariance matrices are all positive.？？？
#max_iter：最大迭代次数
#n_init：执行几个初始值
#init_params：用来初始化component_weight，mean,covariance的方法。{kmeans，random}？？？
#weight_init：各个component的权重初始值
#means_init：各个component的平均值初始值
#precisions_init：各个component的精确度初始值，precisions为inverse of covariance
#random_state：给定一个随机状态
#warm_start：用上一次的结果作为初始值
#verbose：Enable verbose output.
#verbose_interval：Number of iteration done before the next print.

#attributes
.weights_   #各个component的权重
.means_   #各个component的平均值
.covariances_   #各个component的协方差
.precisions_  #各个component的precision
.precisions_cholesky_  #The cholesky decomposition of the precision matrices of each mixture component.
.converged_ #bool：是否达到收敛
.n_iter_  #迭代次数
.lower_bound_  #Lower bound value on the log-likelihood of the best fit of EM.

#method
aic(X)	#Akaike information criterion for the current model on the input X.
bic(X)	#Bayesian information criterion for the current model on the input X.
fit(X[, y])	#Estimate model parameters with the EM algorithm.
fit_predict(X[, y])	#Estimate model parameters using X and predict the labels for X.
get_params([deep])	#Get parameters for this estimator.
predict(X)	#Predict the labels for the data samples in X using trained model.
predict_proba(X)	#Predict posterior probability of each component given the data.
sample([n_samples])	#Generate random samples from the fitted Gaussian distribution.
score(X[, y])	#Compute the per-sample average log-likelihood of the given data X.
score_samples(X)	#Compute the weighted log probabilities for each sample.
set_params(**params)	  #Set the parameters of this estimator.