r语言用于多分类的预测
In this analysis i’ll build a model that will predict whether a tumor is malignant or benign, based on data from a study on breast cancer. Classification algorithms will be used in the modelling process.
在此分析中,我将基于一项有关乳腺癌研究的数据,建立一个预测肿瘤是恶性还是良性的模型。 分类算法将在建模过程中使用。
The dataset
数据集
The data for this analysis refer to 569 patients from a study on breast cancer. The actual data can be found at UCI (Machine Learning Repository): https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic). The variables were computed from a digitized image of a breast mass and describe characteristics of the cell nucleus present in the image. In particular the variables are the following:
该分析的数据涉及来自乳腺癌研究的569名患者。 实际数据可以在UCI(机器学习存储库)中找到: https : //archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic) 。 这些变量是根据乳腺肿块的数字化图像计算得出的,并描述了图像中存在的细胞核的特征。 特别是以下变量:
radius (mean of distances from center to points on the perimeter)
半径 (从中心到外围点的距离的平均值)
texture (standard deviation of gray-scale values)
纹理 (灰度值的标准偏差)
perimeter
周长
area
区
smoothness (local variation in radius lengths)
平滑度 (半径长度的局部变化)
compactness (perimeter^² / area — 1.0)
紧凑度 (周长^²/面积— 1.0)
concavity (severity of concave portions of the contour)
凹度 (轮廓凹部的严重程度)
concave points (number of concave portions of the contour)
凹点 (轮廓的凹入部分的数量)
symmetry
对称
fractal dimension (“coastline approximation” — 1)
分形维数 (“海岸线近似” — 1)
type (tumor can be either malignant -M- or benign -B-)
类型 (肿瘤可以是恶性-M-或良性-B-)
探索性分析 (Exploratory Analysis)
It is essential to have an overview of the dataset. Below there is a box-plot of each predictor against the target variable (tumor). The log value of the predictors used instead of the actual values, for a better view of the plot.
概述数据集至关重要。 下面是每个预测变量相对于目标变量(肿瘤)的箱形图。 为了更好地查看图表,使用了预测变量的对数值而不是实际值。