深度学习-从零开始(1) - Pandas相关用法及KNN

本章背景

本章是来源于coursera课程 python-machine-learning中的作业1内容。

本章参考

本章内容

  • Pandas用法
  • DataFrame用法
  • Series用法
  • K最近邻 (KNN,k-NearestNeighbor)

0. breast cancer 数据集

import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer

cancer = load_breast_cancer()

print(cancer.DESCR) # Print the data set description

1. Pandas.DataFrame

创建DataFrame:

dataFrame = pd.DataFrame(data=cancer.data, index=pd.RangeIndex(start=0, stop=569, step=1),
                             columns=cancer.feature_names)

DataFrame切片:

#截取第0-29列(前30列)所有行的数据
X = dataFrame.iloc[:, :30]

统计DataFrame列中某值频数
需要进行转换list:

malignant_count = list(dataFrame['target']).count(0)

or

malignant_count = list(dataFrame.target).count(0)

2. Pandas.Series

    malignant_count = list(dataFrame['target']).count(0)
    benign_count = list(dataFrame['target']).count(1)
    series = pd.Series(data=[malignant_count, benign_count], index=["malignant", "benign"])

3. train_test_split()

<!--        
<!--        test_size : float, int or None, optional (default=None)-->
<!--        If float, should be between 0.0 and 1.0 and represent the proportion-->
<!--        of the dataset to include in the test split. If int, represents the-->
<!--        absolute number of test samples. If None, the value is set to the-->
<!--        complement of the train size. If ``train_size`` is also None, it will-->
<!--        be set to 0.25.
-->
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=143, random_state=0)

4. KNN

如下包含所有代码:

import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

# 加载breast_cancer数据集,包含569个样本和30个维度的属性
cancer = load_breast_cancer()
# 将cancer数据集转化为DataFrame,转化后的shape为 (569, 31),其中最后一个为target(0/1)
dataFrame = pd.DataFrame(data=cancer.data, index=pd.RangeIndex(start=0, stop=569, step=1),
                             columns=cancer.feature_names)
dataTarget = pd.DataFrame(data=cancer.target, index=pd.RangeIndex(start=0, stop=569, step=1), columns=['target'])
finalDataFrame = dataFrame.join(dataTarget)


# Your code here
X = finalDataFrame.iloc[:, :30]
y = pd.Series(data=finalDataFrame.target)

# Your code here
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=143, random_state=0)
    
   
# Your code here
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train, y_train)
    

# 用各个属性的均值尝试一下预测
means = cancerdf.mean()[:-1].values.reshape(1, -1)
label = knn.predict(means)
print('label', label)

    
# 评估一下测试集上的表现
score = knn.score(X_test, y_test)
print(score)
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值