numpy 中划分bins,并计算一个bin内的均值
import numpy
data = np.array([range(100)])
bins = numpy.linspace(0, 50, 10)
bins=np.append(bins,np.inf)#最后一个bin到无穷大
digitized = numpy.digitize(data, bins)#Return the indices of the bins to which each value in input array belongs.
# 计算bin内均值法一
bin_means = [data[digitized == i].mean() for i in range(1, len(bins))]
#法二
bin_means1 = (numpy.histogram(data, bins, weights=data)[0] /
numpy.histogram(data, bins)[0])
# https://stackoverflow.com/questions/6163334/binning-data-in-python-with-scipy-numpy
如果numpy.digitize(data, bins)
中,data
,超过bins
的边缘,那么函数会自动在bins
边缘加一个bin,如:
data=np.array([-1,0.5,1.5,2.5,3.5,4.5,5,6])
bins=np.linspace(0,5,6)
print(bins)
di=np.digitize(data,bins)
dt=np.c_[data,di]
print(dt)
'''
[0. 1. 2. 3. 4. 5.]
[[-1. 0. ]
[ 0.5 1. ]
[ 1.5 2. ]
[ 2.5 3. ]
[ 3.5 4. ]
[ 4.5 5. ]
[ 5. 6. ]
[ 6. 6. ]]
'''
解释下法二,
numpy.histogram(a, bins=10, range=None, normed=None, weights=None, density=None)
- Returns
– histarray
The values of the histogram. See density and weights for a description of the possible semantics.
– bin_edges array of dtype float
Return the bin edges (length(hist)+1). - Parameters
– weights array_like, optional
An array of weights, of the same shape as a. Each value in a only contributes its associated weight towards the bin count (instead of 1).
举例说明这里怎么计算均值,一个bin里包括[1,2,3,4],那么
n
u
m
p
y
.
h
i
s
t
o
g
r
a
m
(
d
a
t
a
,
b
i
n
s
,
w
e
i
g
h
t
s
=
d
a
t
a
)
[
0
]
/
n
u
m
p
y
.
h
i
s
t
o
g
r
a
m
(
d
a
t
a
,
b
i
n
s
)
[
0
]
=
(
1
∗
1
+
2
∗
1
+
3
∗
1
+
4
∗
1
)
/
4
=
2.5
numpy.histogram(data, bins, weights=data)[0] /numpy.histogram(data, bins)[0]=(1*1+2*1+3*1+4*1)/4=2.5
numpy.histogram(data,bins,weights=data)[0]/numpy.histogram(data,bins)[0]=(1∗1+2∗1+3∗1+4∗1)/4=2.5
pandas 划分bins
a=pd.DataFrame(np.random.rand(10,1),columns=['A'])
a['A_cat']=pd.cut(a['A'],bins=np.linspace(0,1,5),labels=[1,2,3,4])
显然labels应该比bins多一个。
参考:
- Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow
- https://stackoverflow.com/questions/6163334/binning-data-in-python-with-scipy-numpy