我的理解是,你的二维矩阵是个 histogram. 里面都是 int 对吧?
这样的话,方法也许不唯一。但我提供一个从 histogram 里 recover data 的思路。为了方便理解,我先模拟一遍你观测到的数据,然后再拟合。以下代码是数据生产的模拟过程。
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import multivariate_normal
# simulate the data with some fake mean and cov
mean = np.array([1., 3.])
cov = np.array([[3., 1.],
[1., 3.]])
# this is the true MVN we want to fit from data
mvn = multivariate_normal(mean=mean, cov=cov)
# the observed data
data = mvn.rvs(size=200)
# the histogram `hist` is your (9, 12) matrix
hist, xe, ye = np.histogram2d(rvs[:,0], rvs[:,1], bins=[9, 12])
# show the histogram and true density
xx, yy = np.meshgrid(xe, ye, indexing='ij')
pos = np.dstack([xx, yy])
density = mvn.pdf(pos)
_, (ax1, ax2) = plt.subplots(ncols=2)
ax1.imshow(hist)
ax2.imshow(density)
ax1.set_title("histogram")
ax2.set_title("true density")
好,现在假设我得到的这个 histogram 就是你的那个矩阵。我该怎么你和右边的 true density 呢?我这里用的方式是先从 histogram 里还原出那200个二维随机矢量,然后对 MVN 做极大似然估计就行了。看下面这个函数就行。
def hist_to_data_array(hist, xedges, yedges):
"""
recover data from histogram.
Parameters
----------
hist: 2D histogram, shape (9, 12)
xedges: bin edges in x direction
yedges: bin edges in y direction
Returns
-------
data: shape (N, 2), where n = hist.sum()
"""
n = int(hist.sum())
data_array = np.empty(shape=(n, 2), dtype=np.float)
nx = len(xedges) - 1
ny = len(yedges) - 1
k = 0
for i in range(nx):
for j in range(ny):
m = int(hist[i,j])
data_array[k:k+m] = [(xedges[i] + xedges[i+1])/2., (yedges[j] + yedges[j+1])/2.]
k += m
return data_array
有了这个函数之后,剩下的就简单了。
# notice the hist, xe and ye are from np.histogram2d in the first code block
data_recovered = hist_to_data_array(hist, xe, ye)
# estimate the mean vector and cov matrix
mean_fit = np.mean(data_recovered , axis=0)
cov_fit = np.cov(data_recovered , rowvar=0)
# show the result
mvn_fit = multivariate_normal(mean=mean_fit, cov=cov_fit)
density_fit = mvn_fit.pdf(pos)
plt.imshow(density_fit)
plt.title("fitted density")
这样基本就OK了。注意我在模拟数据的过程中,histogram 这一步是有 xe, ye 的。 如果你手上的数据只是一个 histogram, 没有这些 bin edge 的信息,那么在调用函数的时候把xe设为range(0, 10), ye设置成range(0, 13)就行了。就像下面这样。注意10=9+1, 13=12+1因为它们是edge,比bins多1. 当然之后你的坐标系也要相应调整。
xe = np.arange(0, 10)
ye = np.arange(0, 13)
data_recovered = hist_to_data_array(hist, xe , ye)