python概率密度函数参数估计_如何估算密度函数并计算其峰值?

I have started to use python for analysis. I would like to do the following:

Get the distribution of dataset

Get the peaks in this distribution

I used gaussian_kde from scipy.stats to make estimation for kernel density function. Does guassian_kde make any assumption about the data ?. I am using data that are changed over time. so if data has one distribution (e.g. Gaussian), it could have another distribution later. Does gaussian_kde have any drawbacks in this scenario?. It was suggested in question to try to fit the data in every distribution in order to get the data distribution. So what's the difference between using gaussian_kde and the answer provided in question. I used the code below, I was wondering also to know is gaussian_kde good way to estimate pdf if the data will be changed over time ?. I know one advantage of gaussian_kde is that it calculate bandwidth automatically by a rule of thumb as in here. Also, how can I get its peaks?

import pandas as pd

import numpy as np

import pylab as pl

import scipy.stats

df = pd.read_csv('D:\dataset.csv')

pdf = scipy.stats.kde.gaussian_kde(df)

x = np.linspace((df.min()-1),(df.max()+1), len(df))

y = pdf(x)

pl.plot(x, y, color = 'r')

pl.hist(data_column, normed= True)

pl.show(block=True)

解决方案

I think you need to distinguish non-parametric density (the one implemented in scipy.stats.kde) from parametric density (the one in the StackOverflow question you mention). To illustrate the difference between these two, try the following code.

import pandas as pd

import numpy as np

import scipy.stats as stats

import matplotlib.pyplot as plt

np.random.seed(0)

gaussian1 = -6 + 3 * np.random.randn(1700)

gaussian2 = 4 + 1.5 * np.random.randn(300)

gaussian_mixture = np.hstack([gaussian1, gaussian2])

df = pd.DataFrame(gaussian_mixture, columns=['data'])

# non-parametric pdf

nparam_density = stats.kde.gaussian_kde(df.values.ravel())

x = np.linspace(-20, 10, 200)

nparam_density = nparam_density(x)

# parametric fit: assume normal distribution

loc_param, scale_param = stats.norm.fit(df)

param_density = stats.norm.pdf(x, loc=loc_param, scale=scale_param)

fig, ax = plt.subplots(figsize=(10, 6))

ax.hist(df.values, bins=30, normed=True)

ax.plot(x, nparam_density, 'r-', label='non-parametric density (smoothed by Gaussian kernel)')

ax.plot(x, param_density, 'k--', label='parametric density')

ax.set_ylim([0, 0.15])

ax.legend(loc='best')

From the graph, we see that the non-parametric density is nothing but a smoothed version of histogram. In histogram, for a particular observation x=x0, we use a bar to represent it (put all probability mass on that single point x=x0 and zero elsewhere) whereas in non-parametric density estimation, we use a bell-shaped curve (the gaussian kernel) to represent that point (spreads over its neighbourhood). And the result is a smoothed density curve. This internal gaussian kernel has nothing to do with your distributional assumption on the underlying data x. Its sole purpose is for smoothing.

To get the mode of non-parametric density, we need to do an exhaustive search, as the density is not guaranteed to have uni-mode. As shown in the example above, if you quasi-Newton optimization algo starts between [5,10], it is very likely to end up with a local optimal point rather than the global one.

# get mode: exhastive search

x[np.argsort(nparam_density)[-1]]

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值