import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from sklearn import datasets
%matplotlib
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
Using matplotlib backend: MacOSX
# 加载数据
df = pd.read_csv('https://query.data.world/s/ei6k5toqscwnarxr2ttfavrx2zodxc')
df.head(5)
number | density | sugercontent | |
---|---|---|---|
0 | 1 | 0.697 | 0.460 |
1 | 2 | 0.774 | 0.376 |
2 | 3 | 0.634 | 0.264 |
3 | 4 | 0.608 | 0.318 |
4 | 5 | 0.556 | 0.215 |
df.plot.scatter(x='density', y='sugercontent')
<matplotlib.axes._subplots.AxesSubplot at 0x11a539780>
K-Means算法
对于给定样本集,按照样本之间的距离大小,将样本集划分为K个簇,让簇内的点尽量紧密的连在一起,而让簇间的距离尽量大。
E = ∑ i = 1 k ∑ x ∈ C i ∥ x − μ i ∥ 2 2 E=\sum_{i=1}^{k} \sum_{\boldsymbol{x} \in C_{i}}\left\|\boldsymbol{x}-\boldsymbol{\mu}_{i}\right\|_{2}^{2} E=i=1∑kx∈Ci∑∥x−μi∥22