数据预处理 泰坦尼克号
什么是数据预处理? (What is Data Pre-Processing?)
We know from my last blog that data preprocessing is a data mining technique that involves transforming raw data into an understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven method of resolving such issues. Data preprocessing prepares raw data for further processing.
从我的上一篇博客中我们知道,数据预处理是一种数据挖掘技术,它涉及将原始数据转换为可理解的格式。 实际数据通常不完整,不一致和/或缺少某些行为或趋势,并且可能包含许多错误。 数据预处理是解决此类问题的一种行之有效的方法。 数据预处理将准备原始数据以进行进一步处理。
So in this blog we will learn about the implementation of data pre-processing on a data set. I have decided to do my implementation using the Titanic data set, which I have downloaded from Kaggle. Here is the link to get this dataset- https://www.kaggle.com/c/titanic-gettingStarted/data
因此,在本博客中,我们将学习在数据集上实施数据预处理的方法。 我决定使用我从Kaggle下载的Titanic数据集进行实施。 这是获取此数据集的链接-https : //www.kaggle.com/c/titanic-gettingStarted/data
Note- Kaggle gives 2 datasets, the train and the test dataset, so we will use both of them in this process.
注意 -Kaggle提供了2个数据集,即训练和测试数据集,因此在此过程中我们将同时使用它们。
预期的结果是什么? (What is the expected outcome?)
The Titanic shipwreck was a massive disaster, so we will implement data pre- processing on this data set to know the number of survivors and their details.
泰坦尼克号沉船事故是一场巨大的灾难,因此我们将对该数据集进行数据预处理,以了解幸存者的人数及其详细信息。
I will show you how to apply data preprocessing techniques on the Titanic dataset, with a tinge of my own ideas into this.
我将向您展示如何在Titanic数据集上应用数据预处理技术,并结合我自己的想法。
So let’s get started…
因此,让我们开始吧...
![Image for post](https://miro.medium.com/max/9999/0*LE7Xyr7YMEw_Alob.jpeg)
导入所有重要的库 (Importing all the important libraries)
Firstly after loading the data sets in our system, we will import the libraries that are needed to perform the functions. In my case I imported NumPy, Pandas and Matplot libraries.
首先,在将数据集加载到我们的系统中之后,我们将导入执行功能所需的库。 就我而言,我导入了NumPy,Pandas和Matplot库。
#importing librariesimport numpy as npimport matplotlib.pyplot as pltimport pandas as pd
#importing librarys将numpy导入为npimport matplotlib.pyplot作为pltimport熊猫作为pd
使用Pandas导入数据集 (Importing dataset using Pandas)
To work on the data, you can either load the CSV in excel software or in pandas. So I will load the CSV data in pandas. Then we will also use a function to view that data in the Jupyter notebook.
要处理数据,可以在excel软件或熊猫中加载CSV。 因此,我将在熊猫中加载CSV数据。 然后,我们还将使用一个函数在Jupyter笔记本中查看该数据。
#importing dataset using pandasdf = pd.read_csv(r’C:\Users\KIIT\Desktop\Internity Internship\Day 4 task\train.csv’)df.shapedf.head()
#使用pandasdf = pd.read_csv(r'C:\ Users \ KIIT \ Desktop \ Internal Internship \ Day 4 task \ train.csv')df.shapedf.head()导入数据集