我有大量的数据,我想运行一个kmean分类。数据集太大了,我无法将文件加载到内存中。在
我的想法是对数据集的某些部分(如训练数据集)运行分类,然后将calssification逐部分应用到数据集的其余部分。在import pandas as pd
import pickle
from sklearn.cluster import KMeans
frames = [pd.read_hdf(fin) for fin in ifiles]
data = pd.concat(frames, ignore_index=True, axis=0)
data.dropna(inplace=True)
k = 12
x = pd.concat(data['A'], data['B'], data['C'], axis=1, keys=['A','B','C'])
model = KMeans(n_clusters=k, random_state=0, n_jobs = -2)
model.fit(x)
pickle.dump(model, open(filename, 'wb'))
x看起来像这样:
^{pr2}$
模型如下所示:KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
n_clusters=12, n_init=10, n_jobs=-2, precompute_distances='auto',