20200203_knn分类算法

这是国外大哥的一个单子,总的来说没有什么技术难点
In this homework, you will develop a model t0 predict whether a given ca gets high or low gasmileage based on the Auto data set.

在本作业中,您将开发一个模型来预测给定的ca是高还是低的汽油里程,基于Auto数据集

import numpy as np
import pandas as pd
%matplotlib inline
#读取数据
test=pd.read_csv('Auto.csv')
#展示数据前5行
test.head()
mpgcylindersdisplacementhorsepowerweightaccelerationyearoriginname
018.08307.0130350412.0701chevrolet chevelle malibu
115.08350.0165369311.5701buick skylark 320
218.08318.0150343611.0701plymouth satellite
316.08304.0150343312.0701amc rebel sst
417.08302.0140344910.5701ford torino
#数据信息展示
test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 397 entries, 0 to 396
Data columns (total 9 columns):
mpg             397 non-null float64
cylinders       397 non-null int64
displacement    397 non-null float64
horsepower      397 non-null object
weight          397 non-null int64
acceleration    397 non-null float64
year            397 non-null int64
origin          397 non-null int64
name            397 non-null object
dtypes: float64(3), int64(4), object(2)
memory usage: 28.0+ KB
#由于有缺失值,所以将缺失值删除
test.replace('?',np.nan,inplace = True)
test.dropna(inplace=True)
#强制转换为int类型
test['horsepower']=test['horsepower'].astype('int')

Create a binary variable. mpg01. that contains a 1 if mpg contains a value above its median, and a0 if mpg contains a value below its median. Y ou can compute the median using themedian0) function. 10 points Explore the data graphically in order to investigate the association between mpg01 and theother features. Which of the other features seem most likely to be useful in predicting mpg01?Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.
(A)创建一个二进制变量。mpg 01。如果mpg包含高于中位数的值,则包含1;如果mpg包含低于中位数的值,则为0。你可以用梅迪安0)函数来计算中值。10点以图形方式研究mpg 01与其他特征之间的关联。在预测mpg 01时,其他哪些功能似乎最有用?散乱图和盒图可能是回答这个问题的有用工具。描述你的发现。

#查看他的中位数
test['mpg'].median()
#编写函数,分割类别变量
def function(x):
    if x>23.0:
        return 1
    else:
        return 0
test['mpg01']=test['mpg'].apply(lambda x: function(x))
#查看相关性高低
test.corr()
import seaborn as sns
g = sns.pairplot(test, hue='mpg01', palette='seismic', diag_kind = 'kde',diag_kws=dict(shade=True))
g.set()

© Split the data into a training set and a test set

将数据分为训练集和测试集

from sklearn.model_selection import train_test_split
# 使用train_test_split方法,划分训练集和测试集,指定80%数据为训练集,20%为测试集
x=test.drop(['mpg01','mpg','name'],axis=1)
y=test['mpg01']
X_train, X_test, y_train, y_test = train_test_split(x,y, test_size=0.2)

Perform LDA on the training data in order to predict mpg01 using the variables that seemed
most associated with mpg01 in (b). What is the test error of the model obtained?

对训练数据进行LDA,使用(b)中与mpg01关联最大的变量来预测mpg01,得到的模型的测试误差是多少?(15分)

test.info()
#导入包
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
numerical=['weight']
X_train1=X_train[numerical]
X_test1=X_test[numerical]
lda = LinearDiscriminantAnalysis(n_components=1)
lda.fit(X_train1, y_train)
print(lda.score(X_test1, y_test)) #score是指分类的正确率

Perform QDA on the training data in order to predict mpg01 using the variables that seemed
most associated with mpg01 in (b). What is the test error of the model obtained?

对训练数据进行QDA,使用(b)中与mpg01关联最大的变量来预测mpg01,得到的模型的测试误差是多少?(15分)

from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
Qda = QuadraticDiscriminantAnalysis()
Qda.fit(X_train1, y_train)
print(Qda.score(X_test1, y_test)) #score是指分类的正确率

Perform logistic regression on the training data in order to predict mpg01 using the variables
that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

对训练数据进行逻辑回归,使用(b)中与mpg01关系最密切的变量来预测mpg01,得到的模型检验误差是多少?

numerical=['weight','cylinders']
X_train1=X_train[numerical]
X_test1=X_test[numerical]
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr = lr.fit(X_train1,y_train)
from sklearn.metrics import classification_report
print('----------------Train Set----------------------')
print(classification_report(y_train, lr.predict(X_train1)))
print('----------------test set----------------------')
print(classification_report(y_test, lr.predict(X_test1)))

Perform KNN on the training data, with several values of K, in order to predict mpg01. Use
only the variables that seemed most associated with mpg01 in (b). What test errors do you
obtain? Which value of K seems to perform the best on this data set?

对训练数据执行几个K值的KNN,以预测mpg01。只使用(b)中与mpg01关联最大的变量。你得到了什么测试错误?K的哪个值在这个数据集中表现最好?

from sklearn.neighbors import KNeighborsClassifier
# K参数选项
neighbors=range(1,30)
# 准确率
numerical=['weight']
X_train1=X_train[numerical]
X_test1=X_test[numerical]
knn_acc=[]
# 尝试neighbors中所列举的所有K选项,使用KNeighborsClassifier模型做多次训练。
# 针对每种K值情况计算一次在测试集上的准确率,打印每次训练所获得的准确率,并将每次准确率结果添入列表knn_acc中。
for i in neighbors:
    model = KNeighborsClassifier(n_neighbors=i)
    model.fit(X_train1, y_train)
    knn_acc.append(model.score(X_test1, y_test))
print(knn_acc)
import matplotlib.pyplot as plt
plt.plot(neighbors,knn_acc, label='AUC')
  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
KNN(K-近邻)分类算法python实现如下: 1. 导入必要的库 ```python from sklearn.neighbors import KNeighborsClassifier from sklearn.datasets import load_iris # 用于测试的鸢尾花数据集 from sklearn.model_selection import train_test_split # 用于将数据集划分为训练集和测试集 from sklearn.metrics import accuracy_score # 用于计算分类准确率 ``` 2. 加载测试数据集 ```python iris = load_iris() # 加载鸢尾花数据集 X, y = iris.data, iris.target # 获取数据和标签 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # 将数据集划分为训练集和测试集,测试集占比为30% ``` 3. 创建KNN分类器 ```python knn = KNeighborsClassifier(n_neighbors=3) # 创建KNN分类器,设置K=3 ``` 4. 训练模型 ```python knn.fit(X_train, y_train) # 使用训练集对KNN分类器进行训练 ``` 5. 进行预测 ```python y_pred = knn.predict(X_test) # 使用测试集进行预测 ``` 6. 计算分类准确率 ```python accuracy = accuracy_score(y_test, y_pred) # 计算分类准确率 print('Accuracy:', accuracy) # 输出分类准确率 ``` 完整代码如下: ```python from sklearn.neighbors import KNeighborsClassifier from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score iris = load_iris() X, y = iris.data, iris.target X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) knn = KNeighborsClassifier(n_neighbors=3) knn.fit(X_train, y_train) y_pred = knn.predict(X_test) accuracy = accuracy_score(y_test, y_pred) print('Accuracy:', accuracy) ``` 运行结果如下: ``` Accuracy: 1.0 ``` 说明该KNN分类器在测试集上分类准确率为100%。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值