uci数据集中的缺失数据
To begin we must first go and download the dataset from the UCI dataset repository. The link for the dataset can be found below.
首先,我们必须首先从UCI数据集存储库下载数据集。 数据集的链接可以在下面找到。
https://archive.ics.uci.edu/ml/datasets/Early+stage+diabetes+risk+prediction+dataset.
https://archive.ics.uci.edu/ml/datasets/Early+stage+diabetes+risk+prediction+dataset 。
After downloading the dataset, as long as it is not too big, I like to look at it in a spreadsheet to get a sense of what I am working with.
下载数据集后,只要它不是太大,我就喜欢在电子表格中查看它,以了解自己正在使用什么。
As you can see we have 17 total variables with what appears as binary record values for each field except for ‘Age’. From here we’ll open the dataset in a notebook environment to explore it more. For this project, I used Google Colab which is based on a Jupyter notebook environment and does not require any configuration before using.
如您所见,我们共有17个变量,每个变量的字段都显示为二进制记录值(“年龄”除外)。 从这里,我们将在笔记本环境中打开数据集以进行更多研究。 对于这个项目,我使用了基于Jupyter笔记本环境的Google Colab,并且在使用之前不需要任何配置。
There are a few ways to pull data into Google Colab from a personal location of yours. For this project, I ran the following command which allows you to browse your local computer for a file to upload.
有几种方法可以将数据从您的个人位置提取到Google Colab中。 对于此项目,我运行了以下命令,该命令可让您浏览本地计算机以查找要上传的文件。
From there we’ll load in some necessary libraries.
从那里我们将加载一些必要的库。
The next step is to read in the data to a DataFrame and to explore the variables to see if we will need to do any data imputation.
下一步是将数据读入DataFrame并探究变量,以查看是否需要进行任何数据插补。