k-means作为一种聚类算法,是2006年评出的10大数据挖掘算法之一,
应用广泛。k-means这个名字恰到好处的说明了算法的工作原理。
该算法的目的就是要在n个对象中归纳出k种分类,而这是通过这样
一种聚类迭代过程实现的。
(1)输入n个对象,并指定其中的k个为初始聚类中心;
(2)分别计算n个对象和k个中心的相似度;
(3)任意对象都有一个和其相似度最大的中心,把该对象
规为该中心。属于同一个中心点的对象,聚为一类;
(4)分别计算出每一类对象的均值(means),并把这个值
作为新的聚类中心;
(5)满足2个条件之一认为算法结束:1、分别计算每个类的
新中心和旧中心的相似度,当相似度超过给定值时,
认为聚类完成。2、迭代次数超过给定值。否则执行(2);
下面的代码演示了k-means算法,相似度最初想采用欧几里得
距离,但是效果不好。改用了文章最后参考中提到的相似度算法。
kmeans_cluster类有两个工作模式:学习模式和匹配模式。
学习模式对样本空间进行聚类,以便得到k个聚类中心的坐标,半径,密度;
聚类中心根据密度进行排序,密度越大越容易命中。
匹配模式根据学习模式中得到的聚类中心对未知数据进行分类。
main函数是一个简单的测试。
kmeans.h:
=====================================================
// 2012年 04月 18日 星期三 09:02:20 CST
// author: 李小丹(Li Shao Dan) 字 殊恒(shuheng)
// K.I.S.S
// S.P.O.T
#ifndef KMEANS_H
#define KMEANS_H
#include <vector>
using std::vector;
struct cluster_vec;
struct cluster_point;
struct cluster_center;
class kmeans_cluster {
public:
enum work_mode { LEARN, MATCH };
kmeans_cluster(int, work_mode);
~kmeans_cluster();
public:
// for match
int push_center(const double *, double, int); // push center according to dense order
int match(const double *); //can't match return -1
int load_centers(const char *); // from csv file
// for learn
int push_point(const double *, bool = false);
int load_points(const char *);
int cluster();
// for match and learn
// to csv file
int dump_center(const char *); // after cluster()
int dump_point(const char *);
private:
void center_sort();
int vec_2_str(const struct cluster_vec *, char *, int);
int str_2_vec(char *, struct cluster_vec *, char **);
int ct_2_str(const struct cluster_center *, char *, int);
struct cluster_center *str_2_center(char *);
void dump_info(FILE *fp);
int load_info(FILE *fp);
//double get_max_distance();
void recal_clusters();
void cluster_iterate();
bool recal_centers();
//void recal_radius(struct cluster_center *);
void get_center(const vector<cluster_point> &, double *);
void rm_centers();
void rm_points(vector<cluster_point> &pts);
//for test
void print_center(struct cluster_center *);
private:
int dimension;
//bool sorted;
work_mode mode;
//double hit_radius;
vector<cluster_center *> centers;
vector<cluster_point> points;
};
#endif
==========================================================
kmeans.cpp:
==========================================================
// 2012年 04月 18日 星期三 09:22:37 CST
// author: 李小丹(Li Shao Dan) 字 殊恒(shuheng)
// K.I.S.S
// S.P.O.T
#include <iostream>
#include <algorithm>
#include <limits>
// <cfloat>
// <climits>
#include <cstdio>
#include <cstdlib>
#include <cstring>
#include <cassert>
#include <cmath>
#include "correlation.h"
#include "kmeans.h"
using namespace std;
#define BUF_SIZE 1024
#define MAX_ITERATE 10000
struct cluster_vec {
int reference;
//int dimension;
double vec[0];
};
// center point can't involve each other
struct cluster_point {
//int reference;
//bool is_center;
struct cluster_vec *vec;
};
struct cluster_center {
// XXX use points.size();
int dense;
double radius;
struct cluster_vec *vec;
vector<cluster_point> points;
};
bool center_dense_cmp(struct cluster_center *c1,
struct cluster_center *c2)
{
return c1->dense > c2->dense;
}
static inline struct cluster_vec *alloc_vector(int,
const double *);
static inline struct cluster_vec *get_vector(
struct cluster_vec *);
static inline struct cluster_vec *put_vector(
struct cluster_vec *);
static inline void vector_addto(double *, const double *, int);
/
static inline struct cluster_center *alloc_center(
struct cluster_vec *, double, int);
static inline void free_center(
struct cluster_center *);
static inline void rm_points(vector<cluster_point> &);
// radius = 1 - correlation
static inline void recal_radius(struct cluster_center *, int);
//
kmeans_cluster::kmeans_cluster(int d, work_mode wm)
:dimension(d), mode(wm)
{
}
kmeans_cluster::~kmeans_cluster()
{
rm_centers();
rm_points(points);
}
// XXX sort by dense
int kmeans_cluster::push_center(const double *p, double r, int d)
{
if(mode == MATCH) {
struct cluster_vec *v = alloc_vector(dimension, p);
struct cluster_center *c = alloc_center(v, r, d);
centers.push_back(c);
return centers.size();
}
return -1;
}
int kmeans_cluster::push_point(const double *p, bool is_c)
{
if(mode == LEARN) {
struct cluster_point pt = { alloc_vector(dimension, p) };
points.push_back(pt);
if(is_c) {
get_vector(pt.vec);
struct cluster_center *c
= alloc_center(pt.vec, 0, dimension);
centers.push_back(c);
}
return points.size();
}
return -1;
}
int kmeans_cluster::dump_point(const char *fn)
{
FILE *fp;
if(!(fp = fopen(fn, "a"))) {
perror(fn);
return -1;
}
dump_info(fp);
char buf[BUF_SIZE];
for(vector<cluster_point>::iterator pt = points.begin();
pt != points.end(); ++pt) {
vec_2_str(pt->vec, buf, BUF_SIZE);
fputs(buf, fp);
}
fclose(fp);
return 0;
}
int kmeans_cluster::vec_2_str(const struct cluster_vec *v,
char *buf, int s)
{
int len;
char *p = buf;
for(int i = 0; s > 0 && i < dimension; ++i) {
len = snprintf(p, s, "%.20lf,", v->vec[i]);
p += len;
s -= len;
}
--p;
if(s >= 2) {
snprintf(p, s, "\n");
return 0;
}
return -1;
}
void kmeans_cluster::dump_info(FILE *fp)
{
fprintf(fp, "%d\n", dimension);
}
int kmeans_cluster::dump_center(const char *fn)
{
FILE *fp;
if(!(fp = fopen(fn, "a"))) {
perror(fn);
return -1;
}
dump_info(fp);
char buf[BUF_SIZE];
for(vector<cluster_center *>::iterator p = centers.begin();
p != centers.end(); ++p) {
ct_2_str((*p), buf, BUF_SIZE);
fputs(buf, fp);
}
fclose(fp);
return 0;
}
int kmeans_cluster::ct_2_str(
const struct cluster_center *ct, char *buf, int s)
{
char *p = buf;
int len = snprintf(p, s, "%.20lf,", ct->radius);
p += len;
s -= len;
len = snprintf(p, s, "%d,", ct->dense);
p += len;
s -= len;
if(s > 0) return vec_2_str(ct->vec, p, s);
return -1;
}
// XXX sort by dense
int kmeans_cluster::load_centers(const char *fn)
{
if(mode == MATCH) {
FILE *fp;
if(!(fp = fopen(fn, "r"))) {
perror(fn);
return -1;
}
if(load_info(fp)) {
fclose(fp);
return -1;
}
char buf[BUF_SIZE];
while(fgets(buf, BUF_SIZE, fp)) {
struct cluster_center *p;
if((p = str_2_center(buf)))
centers.push_back(p);
}
fclose(fp);
return centers.size();
}
return -1;
}
struct cluster_center *kmeans_cluster::str_2_center(char *buf)
{
struct cluster_vec *v = alloc_vector(dimension, 0);
struct cluster_center *c = alloc_center(v, 0, 0);
char *p;
char *sp = 0;
if(!(p = strtok_r(buf, ",", &sp)))
goto transform_error;
c->radius = strtod(p, 0);
if(!(p = strtok_r(0, ",", &sp)))
goto transform_error;
c->dense = atoi(p);
// XXX
if(str_2_vec(0, v, &sp) < 0) {
fprintf(stderr, "format error\n");
goto transform_error;
}
/*for(int i = 0; i < dimension; ++i) {
if(!(p = strtok(0, ",")))
goto transform_error;
v->vec[i] = strtod(p, 0);
}*/
return c;
transform_error:
//put_vector(v);
free_center(c);
return 0;
}
int kmeans_cluster::load_info(FILE *fp)
{
char buf[128];
if(fgets(buf, sizeof(buf), fp)) {
dimension = atoi(buf);
return 0;
}
return -1;
}
int kmeans_cluster::str_2_vec(char *buf, struct cluster_vec *vec, char **sp)
{
char *p;
if(!(p = strtok_r(buf, ",", sp)))
return -1;
int i = 0;
do {
vec->vec[i++] = strtod(p, 0);
} while((p = strtok_r(0, ",", sp)));
if(i != dimension)
return -1;
return 0;
}
int kmeans_cluster::load_points(const char *fn)
{
FILE *fp;
if(!(fp = fopen(fn, "r"))) {
perror(fn);
return -1;
}
char buf[2048];
struct cluster_point pt = { alloc_vector(dimension, 0) };
while(fgets(buf, sizeof(buf), fp)) {
char *p;
int c = strtol(buf, &p, 10);
char *sp = 0;
if(str_2_vec(++p, pt.vec, &sp) < 0) {
fprintf(stderr, "format error\n");
continue;
}
push_point(pt.vec->vec, c);
}
fclose(fp);
put_vector(pt.vec);
return 0;
}
int kmeans_cluster::match(const double *v)
{
// sort according to dense;
double dis = numeric_limits<double>::max();
int idx = -1;
const int len = (int)centers.size();
for(int i = 0; i < len; ++i) {
/*double tmp = 1 - cal_correlation(
centers.at(i)->vec->vec, v, dimension);*/
/*double tmp
= cal_distance(centers.at(i)->vec->vec, v, dimension);*/
double tmp = 1 - cal_correlation(
centers.at(i)->vec->vec, v, dimension);
if(tmp <= (centers.at(i)->radius + DOUBLE_ZERO)
&& tmp < dis) {
idx = i;
dis = tmp;
}
}
return idx;
}
int kmeans_cluster::cluster()
{
if(mode == LEARN && points.size() > centers.size()) {
cluster_iterate();
// according to dense;
sort(centers.begin(), centers.end(), center_dense_cmp);
return 0;
}
return -1;
}
/*double kmeans_cluster::get_max_distance()
{
vector<cluster_point>::iterator m1 = points.end(),
m2 = points.end();
double min = 1;
for(vector<cluster_point>::iterator p1 = points.begin();
p1 != points.end(); ++p1) {
for(vector<cluster_point>::iterator p2 = p1 + 1;
p2 != points.end(); ++p2) {
double tmp
= cal_correlation(p1->vec->vec, p2->vec->vec, dimension);
if(tmp < min) {
min = tmp;
m1 = p1;
m2 = p2;
}
}
}
if(min != 1) {
struct cluster_center *c1 =
alloc_center(get_vector(m1->vec), 0, dimension);
struct cluster_center *c2 =
alloc_center(get_vector(m2->vec), 0, dimension);
centers.push_back(c1);
centers.push_back(c2);
}
return min;
}*/
void kmeans_cluster::recal_clusters()
{
vector<cluster_center *>::iterator minp = centers.end();
for(vector<cluster_point>::iterator p
= points.begin();
p != points.end();
++p) {
//double max = -1;
double max = numeric_limits<double>::max();
const double *vec1 = p->vec->vec;
for(vector<cluster_center *>::iterator cp
= centers.begin(); cp != centers.end();
++cp) {
/*double tmp = cal_correlation((*cp)->vec->vec,
vec1, dimension);*/
/*double tmp = cal_distance((*cp)->vec->vec,
vec1, dimension);*/
double tmp = 1 - cal_correlation((*cp)->vec->vec,
vec1, dimension);
/*if(tmp > max) {
max = tmp;
minp = cp;
}*/
if(tmp < max) {
max = tmp;
minp = cp;
}
}
get_vector(p->vec);
(*minp)->points.push_back(*p);
}
}
void kmeans_cluster::cluster_iterate()
{
int i = 0;
do {
recal_clusters();
} while(++i < MAX_ITERATE && recal_centers());
}
bool kmeans_cluster::recal_centers()
{
double *tmp = new double[dimension];
assert(tmp);
bool ch = false;
for(vector<cluster_center *>::iterator cp
= centers.begin(); cp != centers.end();
++cp) {
struct cluster_center *ct = *cp;
get_center(ct->points, tmp);
/*if(cal_correlation(ct->vec->vec, tmp, dimension)
< 1 - DOUBLE_ZERO) {*/
//if(cal_distance(ct->vec->vec, tmp, dimension)
if(1 - cal_correlation(ct->vec->vec, tmp, dimension)
> DOUBLE_ZERO) {
ch = true;
//FIXME
put_vector(ct->vec);
ct->vec = alloc_vector(dimension, tmp);
recal_radius(ct, dimension);
ct->dense = ct->points.size();
print_center(ct);
}
rm_points(ct->points);
}
delete [] tmp;
return ch;
}
void kmeans_cluster::get_center(
const vector<cluster_point> &pts, double *vec)
{
memset(vec, 0, sizeof(double) * dimension);
for(vector<cluster_point>::const_iterator p = pts.begin();
p != pts.end(); ++p)
vector_addto(vec, p->vec->vec, dimension);
const int s = pts.size();
assert(s);
//if(s) {
for(int i = 0; i < dimension; ++i)
vec[i] /= s;
//}
}
void kmeans_cluster::rm_centers()
{
for(vector<cluster_center *>::iterator p = centers.begin();
p != centers.end(); ++p)
free_center(*p);
centers.clear();
}
void kmeans_cluster::rm_points(vector<cluster_point> &pts)
{
::rm_points(pts);
}
// for test
void kmeans_cluster::print_center(struct cluster_center *ct)
{
cout << "XXX:\n";
cout << "center corrd:\n";
for(int i = 0; i < dimension; ++i)
cout << ct->vec->vec[i] << '\t';
cout << endl;
cout << "center have " << ct->points.size()
<< " points." << endl;
cout << "center radius is: " << ct->radius << endl;
cout << "points is:\n";
vector<cluster_point> &pts = ct->points;
for(vector<cluster_point>::iterator pt
= pts.begin(); pt != pts.end(); ++pt) {
for(int i = 0; i < dimension; ++i)
cout << pt->vec->vec[i] << '\t';
cout << endl;
}
cout << "++++++++++++++++++++++++++++++++++++++++++++\n";
}
/
static inline struct cluster_vec *alloc_vector(
int d, const double *v)
{
struct cluster_vec *p = (struct cluster_vec *)malloc(
sizeof(struct cluster_vec) + sizeof(double) * d);
assert(p);
p->reference = 0;
if(v) {
for(int i = 0; i < d; ++i)
p->vec[i] = v[i];
}
return get_vector(p);
}
static inline struct cluster_vec *get_vector(
struct cluster_vec *p)
{
return (++p->reference, p);
}
static inline struct cluster_vec *put_vector(
struct cluster_vec *p)
{
if(!--p->reference) {
free(p);
return 0;
}
return p;
}
static inline void vector_addto(double *sum, const double *add, int d)
{
for(int i = 0; i < d; ++i)
sum[i] += add[i];
}
//
static inline struct cluster_center *alloc_center(
struct cluster_vec *v, double r, int d)
{
struct cluster_center *c = new struct cluster_center;
assert(c);
c->vec = v;
c->radius = r;
c->dense = d;
return c;
}
static inline void free_center(struct cluster_center *c)
{
put_vector(c->vec);
/*vector<cluster_point> &pts = c->points;
for(vector<cluster_point>::iterator p
= pts.begin(); p != pts.end(); ++p)
put_vector(p->vec);
pts.clear();*/
rm_points(c->points);
delete c;
}
static inline void rm_points(vector<cluster_point> &pts)
{
for(vector<cluster_point>::iterator p = pts.begin();
p != pts.end(); ++p)
put_vector(p->vec);
pts.clear();
}
static void recal_radius(struct cluster_center *ct, int d)
{
double min = -1;
vector<cluster_point> &pts = ct->points;
double *ct_v = ct->vec->vec;
for(vector<cluster_point>::iterator p
= pts.begin(); p != pts.end(); ++p) {
/*double tmp = cal_correlation(p->vec->vec, ct_v, d);
if(tmp < min)
min = tmp;*/
//double tmp = cal_distance(p->vec->vec, ct_v, d);
double tmp = 1 - cal_correlation(p->vec->vec, ct_v, d);
if(tmp > min)
min = tmp;
}
ct->radius = min;
}
=============================================================
main.cpp:
=============================================================
#include <iostream>
#include <cstdlib>
#include <unistd.h>
//#include "fft.h"
#include "correlation.h"
#include "kmeans.h"
using namespace std;
void tst_cluster();
int main()
{
//tst_fft();
//tst_correlation();
tst_cluster();
return 0;
}
void tst_cluster()
{
kmeans_cluster kmeans(8, kmeans_cluster::LEARN);
double d[8] = {1, 2, 3, 4, 5, 6, 7, 20};
kmeans.push_point(d, true);
d[7] = 8;
for(int i = 0; i < 10; ++i) {
++d[7];
kmeans.push_point(d);
}
///
for(int i = 0, j = 0; j < 8; i += 10, ++j)
d[j] = i;
kmeans.push_point(d);
for(int i = 0; i < 7; ++i) {
d[i] += 7;
kmeans.push_point(d);
d[i] -= 7;
}
d[7] += 100;
kmeans.push_point(d, true);
for(int i = 0, j = 0; j < 8; i += 10, ++j)
d[j] = 100 - i;
d[3] = 7;
kmeans.push_point(d, true);
d[3] = 30;
for(int i = 0; i < 8; ++i) {
d[i] -= 7;
kmeans.push_point(d);
d[i] += 7;
}
kmeans.cluster();
unlink("center.txt");
unlink("point.txt");
kmeans.dump_center("center.txt");
kmeans.dump_point("point.txt");
kmeans_cluster match(0, kmeans_cluster::MATCH);
match.load_centers("center.txt");
/*kmeans_cluster match(8, kmeans_cluster::MATCH);
double c1[] = {
1.00000000000000000000,2.00000000000000000000,3.00000000000000000000,4.00000000000000000000,5.00000000000000000000,6.00000000000000000000,7.00000000000000000000,13.00000000000000000000};
double c2[] = {
0.77777777777777779011,10.77777777777777856727,20.77777777777777856727,30.77777777777777856727,40.77777777777777856727,50.77777777777777856727,60.77777777777777856727,70.77777777777777146184};
double c3[] = {
99.22222222222222853816,89.22222222222222853816,79.22222222222222853816,69.22222222222222853816,59.22222222222222143273,49.22222222222222143273,39.22222222222222143273,29.22222222222222143273};
match.push_center(c1, 5.0, 11);
match.push_center(c2, 6.55367204580384132839, 9);
match.push_center(c3, 6.55367204580383777568, 9);*/
for(;;) {
int i = 0;
for(; i < 8 && cin >> d[i]; ++i);
// must be normal return, can't exit(0) for valgrind
if(i != 8) break;
int ret;
ret = match.match(d);
cout << "result is: " << ret << endl;
}
}
=============================================================
correlation.h:
=============================================================
// 2012年 04月 16日 星期一 09:29:30 CST
// author: 李小丹(Li Shao Dan) 字 殊恒(shuheng)
// K.I.S.S
// S.P.O.T
#ifndef CORRELATION_H
#define CORRELATION_H
#define DOUBLE_ZERO 1e-15
double cal_correlation(const double *, const double *, int);
double cal_distance(const double *, const double *, int);
#endif
=============================================================
correlation.cpp:
=============================================================
// 2012年 04月 16日 星期一 09:32:09 CST
// author: 李小丹(Li Shao Dan) 字 殊恒(shuheng)
// K.I.S.S
// S.P.O.T
#include <cmath>
#include "correlation.h"
static void cal_average(const double *a, const double *b,
int, double *, double *);
static double cal_solution(double, double, double, double);
//#include <iostream>
//using namespace std;
double cal_correlation(const double *a, const double *b, int d)
{
double av_a;
double av_b;
cal_average(a, b, d, &av_a, &av_b);
if(av_a * av_b == 0.0) {
if(av_a == 0.0 && av_b == 0.0)
return 1;
else
return 0;
}
double num = 0;
double den1 = 0, den2 = 0;
double t1, t2;
for(int i = 0; i < d; ++i) {
//t1 = a[i] + (i + 1) - av_a;
//t2 = b[i] + (i + 1) - av_b;
t1 = a[i] - av_a;
t2 = b[i] - av_b;
num += t1 * t2;
den1 += t1 * t1;
den2 += t2 * t2;
}
return cal_solution(num, sqrt(den1 * den2), av_a, av_b);
}
static void cal_average(const double *a, const double *b,
int d, double *av_a, double *av_b)
{
*av_a = *av_b = 0;
for(int i = 0; i < d; ++i) {
// *av_a += a[i] + (i + 1);
// *av_b += b[i] + (i + 1);
*av_a += a[i];
*av_b += b[i];
}
*av_a /= d;
*av_b /= d;
if(*av_a < DOUBLE_ZERO) *av_a = 0;
if(*av_b < DOUBLE_ZERO) *av_b = 0;
}
static double cal_solution(double num, double den, double av_a, double av_b)
{
double ret;
num = fabs(num);
if(den < DOUBLE_ZERO || num == den) {
if(av_a != av_b)
ret = 0;
else
ret = 1;
} else {
ret = num / den;
}
if(ret > 1) ret = 1;
else if(ret < DOUBLE_ZERO) ret = 0;
return ret;
}
double cal_distance(const double *a, const double *b, int d)
{
double sum = 0;
for(int i = 0; i < d; ++i) {
double tmp = *a - *b;
sum += tmp * tmp;
++a;
++b;
}
return sqrt(sum);
}
=========================================================
编译:
g++ -g -W -Wall -Wextra -o mytest correlation.cpp main.cpp kmeans.cpp
执行:
./mytest
参考:
http://en.wikipedia.org/wiki/K-means_algorithm
http://local.wasp.uwa.edu.au/~pbourke/miscellaneous/correlate/
应用广泛。k-means这个名字恰到好处的说明了算法的工作原理。
该算法的目的就是要在n个对象中归纳出k种分类,而这是通过这样
一种聚类迭代过程实现的。
(1)输入n个对象,并指定其中的k个为初始聚类中心;
(2)分别计算n个对象和k个中心的相似度;
(3)任意对象都有一个和其相似度最大的中心,把该对象
规为该中心。属于同一个中心点的对象,聚为一类;
(4)分别计算出每一类对象的均值(means),并把这个值
作为新的聚类中心;
(5)满足2个条件之一认为算法结束:1、分别计算每个类的
新中心和旧中心的相似度,当相似度超过给定值时,
认为聚类完成。2、迭代次数超过给定值。否则执行(2);
下面的代码演示了k-means算法,相似度最初想采用欧几里得
距离,但是效果不好。改用了文章最后参考中提到的相似度算法。
kmeans_cluster类有两个工作模式:学习模式和匹配模式。
学习模式对样本空间进行聚类,以便得到k个聚类中心的坐标,半径,密度;
聚类中心根据密度进行排序,密度越大越容易命中。
匹配模式根据学习模式中得到的聚类中心对未知数据进行分类。
main函数是一个简单的测试。
kmeans.h:
=====================================================
// 2012年 04月 18日 星期三 09:02:20 CST
// author: 李小丹(Li Shao Dan) 字 殊恒(shuheng)
// K.I.S.S
// S.P.O.T
#ifndef KMEANS_H
#define KMEANS_H
#include <vector>
using std::vector;
struct cluster_vec;
struct cluster_point;
struct cluster_center;
class kmeans_cluster {
public:
enum work_mode { LEARN, MATCH };
kmeans_cluster(int, work_mode);
~kmeans_cluster();
public:
// for match
int push_center(const double *, double, int); // push center according to dense order
int match(const double *); //can't match return -1
int load_centers(const char *); // from csv file
// for learn
int push_point(const double *, bool = false);
int load_points(const char *);
int cluster();
// for match and learn
// to csv file
int dump_center(const char *); // after cluster()
int dump_point(const char *);
private:
void center_sort();
int vec_2_str(const struct cluster_vec *, char *, int);
int str_2_vec(char *, struct cluster_vec *, char **);
int ct_2_str(const struct cluster_center *, char *, int);
struct cluster_center *str_2_center(char *);
void dump_info(FILE *fp);
int load_info(FILE *fp);
//double get_max_distance();
void recal_clusters();
void cluster_iterate();
bool recal_centers();
//void recal_radius(struct cluster_center *);
void get_center(const vector<cluster_point> &, double *);
void rm_centers();
void rm_points(vector<cluster_point> &pts);
//for test
void print_center(struct cluster_center *);
private:
int dimension;
//bool sorted;
work_mode mode;
//double hit_radius;
vector<cluster_center *> centers;
vector<cluster_point> points;
};
#endif
==========================================================
kmeans.cpp:
==========================================================
// 2012年 04月 18日 星期三 09:22:37 CST
// author: 李小丹(Li Shao Dan) 字 殊恒(shuheng)
// K.I.S.S
// S.P.O.T
#include <iostream>
#include <algorithm>
#include <limits>
// <cfloat>
// <climits>
#include <cstdio>
#include <cstdlib>
#include <cstring>
#include <cassert>
#include <cmath>
#include "correlation.h"
#include "kmeans.h"
using namespace std;
#define BUF_SIZE 1024
#define MAX_ITERATE 10000
struct cluster_vec {
int reference;
//int dimension;
double vec[0];
};
// center point can't involve each other
struct cluster_point {
//int reference;
//bool is_center;
struct cluster_vec *vec;
};
struct cluster_center {
// XXX use points.size();
int dense;
double radius;
struct cluster_vec *vec;
vector<cluster_point> points;
};
bool center_dense_cmp(struct cluster_center *c1,
struct cluster_center *c2)
{
return c1->dense > c2->dense;
}
static inline struct cluster_vec *alloc_vector(int,
const double *);
static inline struct cluster_vec *get_vector(
struct cluster_vec *);
static inline struct cluster_vec *put_vector(
struct cluster_vec *);
static inline void vector_addto(double *, const double *, int);
/
static inline struct cluster_center *alloc_center(
struct cluster_vec *, double, int);
static inline void free_center(
struct cluster_center *);
static inline void rm_points(vector<cluster_point> &);
// radius = 1 - correlation
static inline void recal_radius(struct cluster_center *, int);
//
kmeans_cluster::kmeans_cluster(int d, work_mode wm)
:dimension(d), mode(wm)
{
}
kmeans_cluster::~kmeans_cluster()
{
rm_centers();
rm_points(points);
}
// XXX sort by dense
int kmeans_cluster::push_center(const double *p, double r, int d)
{
if(mode == MATCH) {
struct cluster_vec *v = alloc_vector(dimension, p);
struct cluster_center *c = alloc_center(v, r, d);
centers.push_back(c);
return centers.size();
}
return -1;
}
int kmeans_cluster::push_point(const double *p, bool is_c)
{
if(mode == LEARN) {
struct cluster_point pt = { alloc_vector(dimension, p) };
points.push_back(pt);
if(is_c) {
get_vector(pt.vec);
struct cluster_center *c
= alloc_center(pt.vec, 0, dimension);
centers.push_back(c);
}
return points.size();
}
return -1;
}
int kmeans_cluster::dump_point(const char *fn)
{
FILE *fp;
if(!(fp = fopen(fn, "a"))) {
perror(fn);
return -1;
}
dump_info(fp);
char buf[BUF_SIZE];
for(vector<cluster_point>::iterator pt = points.begin();
pt != points.end(); ++pt) {
vec_2_str(pt->vec, buf, BUF_SIZE);
fputs(buf, fp);
}
fclose(fp);
return 0;
}
int kmeans_cluster::vec_2_str(const struct cluster_vec *v,
char *buf, int s)
{
int len;
char *p = buf;
for(int i = 0; s > 0 && i < dimension; ++i) {
len = snprintf(p, s, "%.20lf,", v->vec[i]);
p += len;
s -= len;
}
--p;
if(s >= 2) {
snprintf(p, s, "\n");
return 0;
}
return -1;
}
void kmeans_cluster::dump_info(FILE *fp)
{
fprintf(fp, "%d\n", dimension);
}
int kmeans_cluster::dump_center(const char *fn)
{
FILE *fp;
if(!(fp = fopen(fn, "a"))) {
perror(fn);
return -1;
}
dump_info(fp);
char buf[BUF_SIZE];
for(vector<cluster_center *>::iterator p = centers.begin();
p != centers.end(); ++p) {
ct_2_str((*p), buf, BUF_SIZE);
fputs(buf, fp);
}
fclose(fp);
return 0;
}
int kmeans_cluster::ct_2_str(
const struct cluster_center *ct, char *buf, int s)
{
char *p = buf;
int len = snprintf(p, s, "%.20lf,", ct->radius);
p += len;
s -= len;
len = snprintf(p, s, "%d,", ct->dense);
p += len;
s -= len;
if(s > 0) return vec_2_str(ct->vec, p, s);
return -1;
}
// XXX sort by dense
int kmeans_cluster::load_centers(const char *fn)
{
if(mode == MATCH) {
FILE *fp;
if(!(fp = fopen(fn, "r"))) {
perror(fn);
return -1;
}
if(load_info(fp)) {
fclose(fp);
return -1;
}
char buf[BUF_SIZE];
while(fgets(buf, BUF_SIZE, fp)) {
struct cluster_center *p;
if((p = str_2_center(buf)))
centers.push_back(p);
}
fclose(fp);
return centers.size();
}
return -1;
}
struct cluster_center *kmeans_cluster::str_2_center(char *buf)
{
struct cluster_vec *v = alloc_vector(dimension, 0);
struct cluster_center *c = alloc_center(v, 0, 0);
char *p;
char *sp = 0;
if(!(p = strtok_r(buf, ",", &sp)))
goto transform_error;
c->radius = strtod(p, 0);
if(!(p = strtok_r(0, ",", &sp)))
goto transform_error;
c->dense = atoi(p);
// XXX
if(str_2_vec(0, v, &sp) < 0) {
fprintf(stderr, "format error\n");
goto transform_error;
}
/*for(int i = 0; i < dimension; ++i) {
if(!(p = strtok(0, ",")))
goto transform_error;
v->vec[i] = strtod(p, 0);
}*/
return c;
transform_error:
//put_vector(v);
free_center(c);
return 0;
}
int kmeans_cluster::load_info(FILE *fp)
{
char buf[128];
if(fgets(buf, sizeof(buf), fp)) {
dimension = atoi(buf);
return 0;
}
return -1;
}
int kmeans_cluster::str_2_vec(char *buf, struct cluster_vec *vec, char **sp)
{
char *p;
if(!(p = strtok_r(buf, ",", sp)))
return -1;
int i = 0;
do {
vec->vec[i++] = strtod(p, 0);
} while((p = strtok_r(0, ",", sp)));
if(i != dimension)
return -1;
return 0;
}
int kmeans_cluster::load_points(const char *fn)
{
FILE *fp;
if(!(fp = fopen(fn, "r"))) {
perror(fn);
return -1;
}
char buf[2048];
struct cluster_point pt = { alloc_vector(dimension, 0) };
while(fgets(buf, sizeof(buf), fp)) {
char *p;
int c = strtol(buf, &p, 10);
char *sp = 0;
if(str_2_vec(++p, pt.vec, &sp) < 0) {
fprintf(stderr, "format error\n");
continue;
}
push_point(pt.vec->vec, c);
}
fclose(fp);
put_vector(pt.vec);
return 0;
}
int kmeans_cluster::match(const double *v)
{
// sort according to dense;
double dis = numeric_limits<double>::max();
int idx = -1;
const int len = (int)centers.size();
for(int i = 0; i < len; ++i) {
/*double tmp = 1 - cal_correlation(
centers.at(i)->vec->vec, v, dimension);*/
/*double tmp
= cal_distance(centers.at(i)->vec->vec, v, dimension);*/
double tmp = 1 - cal_correlation(
centers.at(i)->vec->vec, v, dimension);
if(tmp <= (centers.at(i)->radius + DOUBLE_ZERO)
&& tmp < dis) {
idx = i;
dis = tmp;
}
}
return idx;
}
int kmeans_cluster::cluster()
{
if(mode == LEARN && points.size() > centers.size()) {
cluster_iterate();
// according to dense;
sort(centers.begin(), centers.end(), center_dense_cmp);
return 0;
}
return -1;
}
/*double kmeans_cluster::get_max_distance()
{
vector<cluster_point>::iterator m1 = points.end(),
m2 = points.end();
double min = 1;
for(vector<cluster_point>::iterator p1 = points.begin();
p1 != points.end(); ++p1) {
for(vector<cluster_point>::iterator p2 = p1 + 1;
p2 != points.end(); ++p2) {
double tmp
= cal_correlation(p1->vec->vec, p2->vec->vec, dimension);
if(tmp < min) {
min = tmp;
m1 = p1;
m2 = p2;
}
}
}
if(min != 1) {
struct cluster_center *c1 =
alloc_center(get_vector(m1->vec), 0, dimension);
struct cluster_center *c2 =
alloc_center(get_vector(m2->vec), 0, dimension);
centers.push_back(c1);
centers.push_back(c2);
}
return min;
}*/
void kmeans_cluster::recal_clusters()
{
vector<cluster_center *>::iterator minp = centers.end();
for(vector<cluster_point>::iterator p
= points.begin();
p != points.end();
++p) {
//double max = -1;
double max = numeric_limits<double>::max();
const double *vec1 = p->vec->vec;
for(vector<cluster_center *>::iterator cp
= centers.begin(); cp != centers.end();
++cp) {
/*double tmp = cal_correlation((*cp)->vec->vec,
vec1, dimension);*/
/*double tmp = cal_distance((*cp)->vec->vec,
vec1, dimension);*/
double tmp = 1 - cal_correlation((*cp)->vec->vec,
vec1, dimension);
/*if(tmp > max) {
max = tmp;
minp = cp;
}*/
if(tmp < max) {
max = tmp;
minp = cp;
}
}
get_vector(p->vec);
(*minp)->points.push_back(*p);
}
}
void kmeans_cluster::cluster_iterate()
{
int i = 0;
do {
recal_clusters();
} while(++i < MAX_ITERATE && recal_centers());
}
bool kmeans_cluster::recal_centers()
{
double *tmp = new double[dimension];
assert(tmp);
bool ch = false;
for(vector<cluster_center *>::iterator cp
= centers.begin(); cp != centers.end();
++cp) {
struct cluster_center *ct = *cp;
get_center(ct->points, tmp);
/*if(cal_correlation(ct->vec->vec, tmp, dimension)
< 1 - DOUBLE_ZERO) {*/
//if(cal_distance(ct->vec->vec, tmp, dimension)
if(1 - cal_correlation(ct->vec->vec, tmp, dimension)
> DOUBLE_ZERO) {
ch = true;
//FIXME
put_vector(ct->vec);
ct->vec = alloc_vector(dimension, tmp);
recal_radius(ct, dimension);
ct->dense = ct->points.size();
print_center(ct);
}
rm_points(ct->points);
}
delete [] tmp;
return ch;
}
void kmeans_cluster::get_center(
const vector<cluster_point> &pts, double *vec)
{
memset(vec, 0, sizeof(double) * dimension);
for(vector<cluster_point>::const_iterator p = pts.begin();
p != pts.end(); ++p)
vector_addto(vec, p->vec->vec, dimension);
const int s = pts.size();
assert(s);
//if(s) {
for(int i = 0; i < dimension; ++i)
vec[i] /= s;
//}
}
void kmeans_cluster::rm_centers()
{
for(vector<cluster_center *>::iterator p = centers.begin();
p != centers.end(); ++p)
free_center(*p);
centers.clear();
}
void kmeans_cluster::rm_points(vector<cluster_point> &pts)
{
::rm_points(pts);
}
// for test
void kmeans_cluster::print_center(struct cluster_center *ct)
{
cout << "XXX:\n";
cout << "center corrd:\n";
for(int i = 0; i < dimension; ++i)
cout << ct->vec->vec[i] << '\t';
cout << endl;
cout << "center have " << ct->points.size()
<< " points." << endl;
cout << "center radius is: " << ct->radius << endl;
cout << "points is:\n";
vector<cluster_point> &pts = ct->points;
for(vector<cluster_point>::iterator pt
= pts.begin(); pt != pts.end(); ++pt) {
for(int i = 0; i < dimension; ++i)
cout << pt->vec->vec[i] << '\t';
cout << endl;
}
cout << "++++++++++++++++++++++++++++++++++++++++++++\n";
}
/
static inline struct cluster_vec *alloc_vector(
int d, const double *v)
{
struct cluster_vec *p = (struct cluster_vec *)malloc(
sizeof(struct cluster_vec) + sizeof(double) * d);
assert(p);
p->reference = 0;
if(v) {
for(int i = 0; i < d; ++i)
p->vec[i] = v[i];
}
return get_vector(p);
}
static inline struct cluster_vec *get_vector(
struct cluster_vec *p)
{
return (++p->reference, p);
}
static inline struct cluster_vec *put_vector(
struct cluster_vec *p)
{
if(!--p->reference) {
free(p);
return 0;
}
return p;
}
static inline void vector_addto(double *sum, const double *add, int d)
{
for(int i = 0; i < d; ++i)
sum[i] += add[i];
}
//
static inline struct cluster_center *alloc_center(
struct cluster_vec *v, double r, int d)
{
struct cluster_center *c = new struct cluster_center;
assert(c);
c->vec = v;
c->radius = r;
c->dense = d;
return c;
}
static inline void free_center(struct cluster_center *c)
{
put_vector(c->vec);
/*vector<cluster_point> &pts = c->points;
for(vector<cluster_point>::iterator p
= pts.begin(); p != pts.end(); ++p)
put_vector(p->vec);
pts.clear();*/
rm_points(c->points);
delete c;
}
static inline void rm_points(vector<cluster_point> &pts)
{
for(vector<cluster_point>::iterator p = pts.begin();
p != pts.end(); ++p)
put_vector(p->vec);
pts.clear();
}
static void recal_radius(struct cluster_center *ct, int d)
{
double min = -1;
vector<cluster_point> &pts = ct->points;
double *ct_v = ct->vec->vec;
for(vector<cluster_point>::iterator p
= pts.begin(); p != pts.end(); ++p) {
/*double tmp = cal_correlation(p->vec->vec, ct_v, d);
if(tmp < min)
min = tmp;*/
//double tmp = cal_distance(p->vec->vec, ct_v, d);
double tmp = 1 - cal_correlation(p->vec->vec, ct_v, d);
if(tmp > min)
min = tmp;
}
ct->radius = min;
}
=============================================================
main.cpp:
=============================================================
#include <iostream>
#include <cstdlib>
#include <unistd.h>
//#include "fft.h"
#include "correlation.h"
#include "kmeans.h"
using namespace std;
void tst_cluster();
int main()
{
//tst_fft();
//tst_correlation();
tst_cluster();
return 0;
}
void tst_cluster()
{
kmeans_cluster kmeans(8, kmeans_cluster::LEARN);
double d[8] = {1, 2, 3, 4, 5, 6, 7, 20};
kmeans.push_point(d, true);
d[7] = 8;
for(int i = 0; i < 10; ++i) {
++d[7];
kmeans.push_point(d);
}
///
for(int i = 0, j = 0; j < 8; i += 10, ++j)
d[j] = i;
kmeans.push_point(d);
for(int i = 0; i < 7; ++i) {
d[i] += 7;
kmeans.push_point(d);
d[i] -= 7;
}
d[7] += 100;
kmeans.push_point(d, true);
for(int i = 0, j = 0; j < 8; i += 10, ++j)
d[j] = 100 - i;
d[3] = 7;
kmeans.push_point(d, true);
d[3] = 30;
for(int i = 0; i < 8; ++i) {
d[i] -= 7;
kmeans.push_point(d);
d[i] += 7;
}
kmeans.cluster();
unlink("center.txt");
unlink("point.txt");
kmeans.dump_center("center.txt");
kmeans.dump_point("point.txt");
kmeans_cluster match(0, kmeans_cluster::MATCH);
match.load_centers("center.txt");
/*kmeans_cluster match(8, kmeans_cluster::MATCH);
double c1[] = {
1.00000000000000000000,2.00000000000000000000,3.00000000000000000000,4.00000000000000000000,5.00000000000000000000,6.00000000000000000000,7.00000000000000000000,13.00000000000000000000};
double c2[] = {
0.77777777777777779011,10.77777777777777856727,20.77777777777777856727,30.77777777777777856727,40.77777777777777856727,50.77777777777777856727,60.77777777777777856727,70.77777777777777146184};
double c3[] = {
99.22222222222222853816,89.22222222222222853816,79.22222222222222853816,69.22222222222222853816,59.22222222222222143273,49.22222222222222143273,39.22222222222222143273,29.22222222222222143273};
match.push_center(c1, 5.0, 11);
match.push_center(c2, 6.55367204580384132839, 9);
match.push_center(c3, 6.55367204580383777568, 9);*/
for(;;) {
int i = 0;
for(; i < 8 && cin >> d[i]; ++i);
// must be normal return, can't exit(0) for valgrind
if(i != 8) break;
int ret;
ret = match.match(d);
cout << "result is: " << ret << endl;
}
}
=============================================================
correlation.h:
=============================================================
// 2012年 04月 16日 星期一 09:29:30 CST
// author: 李小丹(Li Shao Dan) 字 殊恒(shuheng)
// K.I.S.S
// S.P.O.T
#ifndef CORRELATION_H
#define CORRELATION_H
#define DOUBLE_ZERO 1e-15
double cal_correlation(const double *, const double *, int);
double cal_distance(const double *, const double *, int);
#endif
=============================================================
correlation.cpp:
=============================================================
// 2012年 04月 16日 星期一 09:32:09 CST
// author: 李小丹(Li Shao Dan) 字 殊恒(shuheng)
// K.I.S.S
// S.P.O.T
#include <cmath>
#include "correlation.h"
static void cal_average(const double *a, const double *b,
int, double *, double *);
static double cal_solution(double, double, double, double);
//#include <iostream>
//using namespace std;
double cal_correlation(const double *a, const double *b, int d)
{
double av_a;
double av_b;
cal_average(a, b, d, &av_a, &av_b);
if(av_a * av_b == 0.0) {
if(av_a == 0.0 && av_b == 0.0)
return 1;
else
return 0;
}
double num = 0;
double den1 = 0, den2 = 0;
double t1, t2;
for(int i = 0; i < d; ++i) {
//t1 = a[i] + (i + 1) - av_a;
//t2 = b[i] + (i + 1) - av_b;
t1 = a[i] - av_a;
t2 = b[i] - av_b;
num += t1 * t2;
den1 += t1 * t1;
den2 += t2 * t2;
}
return cal_solution(num, sqrt(den1 * den2), av_a, av_b);
}
static void cal_average(const double *a, const double *b,
int d, double *av_a, double *av_b)
{
*av_a = *av_b = 0;
for(int i = 0; i < d; ++i) {
// *av_a += a[i] + (i + 1);
// *av_b += b[i] + (i + 1);
*av_a += a[i];
*av_b += b[i];
}
*av_a /= d;
*av_b /= d;
if(*av_a < DOUBLE_ZERO) *av_a = 0;
if(*av_b < DOUBLE_ZERO) *av_b = 0;
}
static double cal_solution(double num, double den, double av_a, double av_b)
{
double ret;
num = fabs(num);
if(den < DOUBLE_ZERO || num == den) {
if(av_a != av_b)
ret = 0;
else
ret = 1;
} else {
ret = num / den;
}
if(ret > 1) ret = 1;
else if(ret < DOUBLE_ZERO) ret = 0;
return ret;
}
double cal_distance(const double *a, const double *b, int d)
{
double sum = 0;
for(int i = 0; i < d; ++i) {
double tmp = *a - *b;
sum += tmp * tmp;
++a;
++b;
}
return sqrt(sum);
}
=========================================================
编译:
g++ -g -W -Wall -Wextra -o mytest correlation.cpp main.cpp kmeans.cpp
执行:
./mytest
参考:
http://en.wikipedia.org/wiki/K-means_algorithm
http://local.wasp.uwa.edu.au/~pbourke/miscellaneous/correlate/