考虑到您提到的真实世界的例子,我建议您将输入视为一个价格范围,而不仅仅是一个价格,在这种情况下,功能可以组合在一起以对应于特定的价格范围。在
所以你可以从对数据集进行聚类并根据房价形成聚类开始,Mean-Shift聚类算法也会建议数据中可以形成的聚类数量。在
然后,您可以确定每个集群的最低和最高房价,然后您可以获得数字数据和大多数分类数据(用于预测房价的特征)的平均值,并指出这些预测值对应于这个价格范围。在
映射完成后,我们可以看到输入对应于价格范围的哪个集群,然后获得与上述相同的聚合参数。在
代码:import pandas as pd
df = pd.read_csv('housing.csv')
df.drop(['longitude','latitude'], axis=1, inplace=True)
X_train = df['median_house_value']
X_train.head()
import numpy as np
X_train = np.array(X_train)
X_train = np.reshape(X_train,(-1,1))
from sklearn.cluster import MeanShift, estimate_bandwidth
ms = MeanShift(bandwidth=None, bin_seeding=True)
ms.fit(X_train)
labels = ms.labels_
cluster_centers = ms.cluster_centers_
labels_unique = np.unique(labels)
n_clusters_ = len(labels_unique)
print("number of estimated clusters : %d" % n_clusters_)
print(labels)
df['cluster'] = labels
df1 = df[df['cluster'] == 1]
df2 = df[df['cluster'] == 0]
ranges = []
ranges.append([min(df1['median_house_value']),max(df1['median_house_value'])])
ranges.append([min(df2['median_house_value']),max(df2['median_house_value'])])
df1_categorical = 'ocean_proximity'
df1_categorical_set = df1[df1_categorical]
df1 = df1.drop(df1_categorical, axis=1)
df2_categorical_set = df2[df1_categorical]
df2 = df2.drop(df1_categorical, axis=1)
df1_feature = []
for i in df1.columns :
df1_feature.append(np.mean(df1[i]))
df2_feature = []
for i in df1.columns :
df2_feature.append(np.mean(df2[i]))
print ("Range : ",ranges[0],"\nFeatures : ",df1_feature,'\n',"Range : ",ranges[1],"\nFeatures : ", df2_feature)
如果您现在打印df1_特性和df2_特性,您将获得两个集群范围的平均特征值(如列表范围中所附,您也可以打印该值),因此任何价格范围为第一个的房子都将df1_特性作为理想的功能集,df2_特性也是如此。在
如果您想要更多的价格范围,您可以使用k均值来指定集群的数量