python做数据挖掘_python 数据挖掘

python -- 面向程序员的数据挖掘指南-推荐系统入门-分类-007

在上一章中,我们已经介绍过了最近邻分类算法,接下来,我们用几个例子来复习一下

下表是原始数据:

这里列出的是2008和2012奥运会上排名靠前的二十位女运动员。篮球运动员参加了WNBA;田径运动员则完成了2012年奥运会的马拉松赛。虽然数据量很小,但我们仍可以对其应用一些数据挖掘算法。

下表是我们需要进行预测的运动员列表,一起来做分类器吧!

我会使用第一个文件中的数据来训练分类器,然后使用测试文件里的数据来进行评价。

文件格式大致如下:

import numpy as np

import pandas as pd

首先从文本中读取数据

def read_data_from_file(filename):

result = {}

with open(filename,'r', encoding='utf-8-sig') as f:

for line in f.readlines():

line = line.split('\t')

result[line[0]] = {'class':line[1].replace(r'\u', ''), 'h':int(line[3].strip()), 'w':int(line[2])}

result = pd.DataFrame(result)

return result

train_data = read_data_from_file('./datamining/7/athletesTrainingSet.txt')

train_data = train_data.T

print(train_data.head())

class h w

Asuka Teramoto Gymnastics 66 54

Brittainey Raven Basketball 162 72

Chen Nan Basketball 204 78

Gabby Douglas Gymnastics 90 49

Helalia Johannes Track 99 65

data_mean = train_data[['h','w']].mean()

下面来对数据做标准化处理吧

def normalize(data):

data_mean = data[['h','w']].mean()

data_std = data[['h','w']].std()

data = (data[['h','w']]-data_mean)/data_std

return data

nor_train_data = normalize(train_data)

print(nor_train_data.head())

h w

Asuka Teramoto -1.29515 -1.46938

Brittainey Raven 0.939073 0.881631

Chen Nan 1.91655 1.6653

Gabby Douglas -0.736597 -2.12244

Helalia Johannes -0.527138 -0.032653

然后,看一下修正的标准化结果

def correctNormalize(data):

data_mean = data[['h','w']].median()

data_std = (data[['h','w']]-data_mean).abs().mean()

data = (data[['h','w']]-data_mean)/data_std

return data

cor_train_data = correctNormalize(train_data)

print(cor_train_data.head())

h w

Asuka Teramoto -1.21842 -1.93277

Brittainey Raven 1.63447 1.09244

Chen Nan 2.88262 2.10084

Gabby Douglas -0.505201 -2.77311

Helalia Johannes -0.237741 -0.0840336

接下来计算距离:

# 曼哈顿距离

def manhattan(v1, v2):

temp = (v1-v2).abs()

if len(temp.shape)>1:

return temp.sum(axis=1)

return temp.sum()

# 欧式距离

def Euclidean(v1, v2):

temp = (v1 - v2)**2

if len(temp.shape)>1:

return np.sqrt(temp.sum(axis=1))

return np.sqrt(np.sum(temp))

print(manhattan(train_data.loc['Asuka Teramoto'][['h','w']], train_data.loc['Brittainey Raven'][['h','w']]))

114

print(Euclidean(train_data.loc['Asuka Teramoto'][['h','w']], train_data.loc['Brittainey Raven'][['h','w']]))

97.67292357659824

接着,求取最近邻:

# 返回k个最近邻

def nearestNeighbor(v1, data, k, method='m'):

"""返回itemVector的近邻"""

if method=='m':

return manhattan(v1, data).sort_values()[:k]

return Euclidean(v1, data).sort_values()[:k]

# 曼哈顿距离

print(nearestNeighbor(pd.Series({'h':66, 'w':54}), train_data[['h','w']], 3))

# 欧式距离

print(nearestNeighbor(pd.Series({'h':66, 'w':54}), train_data[['h','w']], 3, 'e'))

Asuka Teramoto 0.0

Linlin Deng 2.0

Rebecca Tunney 15.0

dtype: float64

Asuka Teramoto 0.0000

Linlin Deng 2.0000

Rebecca Tunney 11.7047

dtype: float64

再编写一个分类器就可以了:

def classifier(data, v1, method):

#算出最近的一个元素

k = 1

result = nearestNeighbor(v1, data, k, method=method)

return result

near = classifier(train_data[['h','w']], pd.Series({'h':68, 'w':52}), method='m')

print(train_data.loc[near.index]['class'].values)

['Gymnastics']

最后,我们使用测试数据集看看正确率是多少吧:

test_data = read_data_from_file('./datamining/7/athletesTestSet.txt')

test_data= test_data.T

print(len(test_data))

20

near = classifier(train_data[['h','w']], test_data.loc['Aly Raisman'][['h','w']], method='m')

print(train_data.loc[near.index]['class'].values)

['Track']

编写一个测试统计正确率的函数:

def test(test_data, train_data, method='m'):

i = 0

for item in test_data.iterrows():

near = classifier(train_data[['h','w']], item[1][['h','w']], method=method)

pre = train_data.loc[near.index]['class'].values[0]

if pre == item[1]['class']:

i += 1

print('正确率为%s%%'%(i/len(test_data)*100))

test(test_data, train_data, method='m')

正确率为80.0%

简单测试一下使用欧式距离的正确率

test(test_data, train_data, method='e')

正确率为80.0%

由于数据较少,而且数据的维度很少,很难看出两种距离的差异

接下来看看标准化数据的正确率,特别注意的是对测试集做标准化处理时,要使用训练集的均值和方差

data_mean = train_data[['h','w']].mean()

data_std = train_data[['h','w']].std()

nor_test_data = (test_data[['h','w']]-data_mean)/data_std

# 把原来的class重新插入

nor_test_data.insert(0,'class',test_data['class'])

print(nor_test_data.head())

nor_train_data.insert(0,'class',train_data['class'])

print(nor_train_data.head())

class h w

Aly Raisman Gymnastics -0.154767 -0.424489

Crystal Langhorne Basketball 1.59072 1.14285

Diana Taurasi Basketball 0.962347 0.881631

Erin Thorn Basketball 0.520156 0.489795

Hannah Whelan Gymnastics -0.10822 -0.293877

class h w

Asuka Teramoto Gymnastics -1.29515 -1.46938

Brittainey Raven Basketball 0.939073 0.881631

Chen Nan Basketball 1.91655 1.6653

Gabby Douglas Gymnastics -0.736597 -2.12244

Helalia Johannes Track -0.527138 -0.032653

test(nor_test_data, nor_train_data, method='e')

正确率为80.0%

正确率没有变化,最好看看修正标准化的正确率:

data_mean = train_data[['h','w']].median()

data_std = (train_data[['h','w']]-data_mean).abs().mean()

cor_test_data = (test_data[['h','w']]-data_mean)/data_std

# 把原来的class重新插入

cor_test_data.insert(0,'class',test_data['class'])

print(cor_test_data.head())

cor_train_data.insert(0,'class',train_data['class'])

print(cor_train_data.head())

class h w

Aly Raisman Gymnastics 0.237741 -0.588235

Crystal Langhorne Basketball 2.46657 1.42857

Diana Taurasi Basketball 1.66419 1.09244

Erin Thorn Basketball 1.09955 0.588235

Hannah Whelan Gymnastics 0.297177 -0.420168

class h w

Asuka Teramoto Gymnastics -1.21842 -1.93277

Brittainey Raven Basketball 1.63447 1.09244

Chen Nan Basketball 2.88262 2.10084

Gabby Douglas Gymnastics -0.505201 -2.77311

Helalia Johannes Track -0.237741 -0.0840336

test(nor_test_data, nor_train_data, method='m')

正确率为80.0%

鸢尾花数据集

我们可以用鸢尾花数据集做测试,这个数据集在数据挖掘领域是比较有名的。

鸢尾花数据集的格式如下,我们要预测的是Species这一列:

鸢尾花数据集可以通过sklearn库得到。

from sklearn.datasets import load_iris

iris = load_iris()

print(type(iris))

print(len(iris['data']))

150

一共有150条数据,需要划分为测试集和训练集。

可以自己编写函数,随机的抽取测试集和训练集,也可以使用封装好的函数随机抽取测试集和训练集

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(iris['data'], iris['target'], random_state=0, test_size=0.2)

print(len(x_train), len(x_test))

120 30

print(x_train[:10])

[[6.4 3.1 5.5 1.8]

[5.4 3. 4.5 1.5]

[5.2 3.5 1.5 0.2]

[6.1 3. 4.9 1.8]

[6.4 2.8 5.6 2.2]

[5.2 2.7 3.9 1.4]

[5.7 3.8 1.7 0.3]

[6. 2.7 5.1 1.6]

[5.9 3. 4.2 1.5]

[5.8 2.6 4. 1.2]]

同样,对数据进行标准化处理,以训练集为准:

mean = x_train.mean(axis=0)

std = x_train.std()

x_train_nor = (x_train-mean)/std

print(x_train_nor[:10])

x_test_nor = (x_test-mean)/std

[[ 0.26146462 0.02350244 0.84818618 0.28622611]

[-0.24215904 -0.02685993 0.34456252 0.13513902]

[-0.34288378 0.2249519 -1.16630846 -0.51957174]

[ 0.11037752 -0.02685993 0.54601199 0.28622611]

[ 0.26146462 -0.12758466 0.89854855 0.48767558]

[-0.34288378 -0.17794703 0.04238832 0.08477665]

[-0.09107195 0.376039 -1.06558373 -0.46920938]

[ 0.06001515 -0.17794703 0.64673672 0.18550138]

[ 0.00965279 -0.02685993 0.19347542 0.13513902]

[-0.04070958 -0.22830939 0.09275069 -0.01594808]]

接下来计算距离:

def get_distance(v1,v2, method='e'):

if method=='m':

temp = abs(v1-v2)

if len(temp.shape)>1:

return temp.sum(axis=1)

return temp.sum()

temp = (v1 - v2)**2

if len(temp.shape)>1:

return np.sqrt(temp.sum(axis=1))

return np.sqrt(np.sum(temp))

a = np.array([6.8, 2.5, 5, 2.8])

print(get_distance(x_train,a, method='m'))

[2.5 3.7 8.7 2.3 1.9 4.3 8.2 2.3 3.5 3.7 1.7 9.1 1.3 8.4 8.7 5.6 2.3 1.9

2.4 1.9 4. 3. 2.8 4.3 1.8 2.2 3.5 1.4 2.6 1.8 3.4 8.6 2.2 3.7 4.5 3.8

3. 3.1 8.8 9.6 1.8 4. 8.2 8.9 2.1 8.9 2.5 5.4 9.2 2.7 2.9 3.8 8.8 2.4

2. 2.2 2. 9. 8.9 1.9 2.3 8.7 2.4 8.6 1.9 1.8 9.4 9.1 2.9 8.8 9.2 8.5

4. 2.1 3.6 8.9 8.6 8.7 3.4 2.7 8.9 8.1 3. 8.2 3.4 5.5 4.6 2.7 9.4 2.5

8.9 3.7 9. 8.8 2.9 8.8 2.4 3.8 3.9 4.5 1.9 1.3 1.9 1.7 8.3 4.5 1.7 2.2

8.7 2.5 4.1 2.8 2.8 8.6 8.8 8.7 2.5 3.9 4.5 9.1]

接着,构建一个分类器:

def nearestNeighbor(v1, data, y, method='m'):

"""返回itemVector的近邻"""

dis = get_distance(v1, data, method=method)

#获取最小值的下标索引

index = np.argmin(dis)

return y[index]

predict = nearestNeighbor(a, x_train, y_train, method='e')

print('该鸢尾花属于分类'+str(predict))

该鸢尾花属于分类2

最后看一下正确率结果:

def test(x_train,x_test,y_train,y_test,method='e'):

num = 0

for i in range(len(x_test)):

item = x_test[i]

predict = nearestNeighbor(item, x_train, y_train, method=method)

if y_test[i] == predict:

num += 1

print('正确率为%s%%'%(num/len(x_test)*100))

使用欧式距离的正确率:

test(x_train,x_test,y_train,y_test,method='e')

正确率为100.0%

使用曼哈顿距离的正确率:

test(x_train,x_test,y_train,y_test,method='m')

正确率为96.66666666666667%

使用欧式距离的标准化数据的正确率:

test(x_train_nor,x_test_nor,y_train_nor,y_test_nor,method='e')

正确率为100.0%

使用曼哈顿距离的标准化数据的正确率:

test(x_train_nor,x_test_nor,y_train_nor,y_test_nor,method='m')

正确率为96.66666666666667%

好像修正前后,正确率不变。

当不同特征的评分尺度不一致时,为了得到更准确的距离结果,就需要将这些特征进行标准化,使他们在同一个尺度内波动。

我们来比较一下使用不同的标准化方法得到的准确度:

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值