1. UCI Breast Cancer 数据集的下载途径
- Kaggle平台
https://www.kaggle.com/uciml/breast-cancer-wisconsin-data - UCI Machine Learning Repository
https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29 - Python Sklearn 的数据集
2. 数据探索
该数据集是从乳腺肿块的细针抽吸(FNA)的数字化图像计算特征。 它们描述了图像中存在的细胞核的特征。
属性信息:
1)身份证号码
2)诊断(M =恶性,B =良性)
3-32)
为每个细胞核计算十个实值特征:
a)半径(从中心到周长上的点的距离的平均值)
b)纹理(灰度值的标准偏差)
c)周长
d)面积
e)平滑度(半径长度的局部变化)
f)紧凑度(周长^ 2 /面积-1.0)
g)凹度(轮廓凹部的严重程度)
h)凹点(轮廓的凹入部分的数量)
i)对称
j)分形维数(“海岸线近似”-1)
from sklearn.datasets import load_breast_cancer
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
data = load_breast_cancer()
data.keys()
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])
train = pd.DataFrame(data.data, columns = data.feature_names)
train.head()
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension ... worst radius worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension
0 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419 0.07871 ... 25.38 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890
1 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812 0.05667 ... 24.99 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902
2 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 0.2069 0.05999 ... 23.57 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758
3 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 0.2597 0.09744 ... 14.91 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300
4 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 0.1809 0.05883 ... 22.54 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 30 columns):
mean radius 569 non-null float64
mean texture 569 non-null float64
mean perimeter 569 non-null float64
mean area 569 non-null float64
mean smoothness 569 non-null float64
mean compactness 569 non-null float64
mean concavity 569 non-null float64
mean concave points 569 non-null float64
mean symmetry 569 non-null float64
mean fractal dimension 569 non-null float64
radius error 569 non-null float64
texture error 569 non-null float64
perimeter error 569 non-null float64
area error 569 non-null float64
smoothness error 569 non-null float64
compactness error 569 non-null float64
concavity error 569 non-null float64
concave points error 569 non-null float64
symmetry error 569 non-null float64
fractal dimension error 569 non-null float64
worst radius 569 non-null float64
worst texture 569 non-null float64
worst perimeter 569 non-null float64
worst area 569 non-null float64
worst smoothness 569 non-null float64
worst compactness 569 non-null float64
worst concavity 569 non-null float64
worst concave points 569 non-null float64
worst symmetry 569 non-null float64
worst fractal dimension 569 non-null float64
dtypes: float64(30)
memory usage: 133.5 KB
train.describe()
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension ... worst radius worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension
count 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 ... 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000
mean 14.127292 19.289649 91.969033 654.889104 0.096360 0.104341 0.088799 0.048919 0.181162 0.062798 ... 16.269190 25.677223 107.261213 880.583128 0.132369 0.254265 0.272188 0.114606 0.290076 0.083946
std 3.524049 4.301036 24.298981 351.914129 0.014064 0.052813 0.079720 0.038803 0.027414 0.007060 ... 4.833242 6.146258 33.602542 569.356993 0.022832 0.157336 0.208624 0.065732 0.061867 0.018061
min 6.981000 9.710000 43.790000 143.500000 0.052630 0.019380 0.000000 0.000000 0.106000 0.049960 ... 7.930000 12.020000 50.410000 185.200000 0.071170 0.027290 0.000000 0.000000 0.156500 0.055040
25% 11.700000 16.170000 75.170000 420.300000 0.086370 0.064920 0.029560 0.020310 0.161900 0.057700