Pandas数据预处理指南-CSDN博客

If you like to cook you know this very well. Turning on the stove and cooking food is a tiny part of the whole cooking process. Much of your sweat and tears actually go into preparing the right ingredients.

如果您想做饭，您会非常了解。打开炉子和烹饪食物只是整个烹饪过程的一小部分。实际上，您的大部分汗水和眼泪都用于准备正确的食材。

Cliché, but worth saying it again — data preparation is 80% of work in any data science project. Whether it is about making a dashboard, a simple statistical analysis, or fitting advanced machine learning model — it all starts with finding the data and transforming it into the right format so the algorithm can take care of the rest.

陈词滥调，但值得一提—在任何数据科学项目中，数据准备工作占80％。无论是制作仪表板，简单的统计分析，还是安装先进的机器学习模型，这一切都始于查找数据并将其转换为正确的格式，以便算法可以处理其余的一切。

If you are a Python fan, then pandas is your best friend in your data science journey. Equipped with all the tools, it helps you get through the most difficult parts of a project.

如果您是Python爱好者，那么pandas是您数据科学之旅中最好的朋友。配备了所有工具，它可以帮助您完成项目中最困难的部分。

That said, like any new tool you first need to learn it’s functionalities and how to put them into use. Many beginners in data science still struggle to make the best use of Pandas and instead spend much of their time on Stack Overflow. The principal reason for this is, I’d say, not being able to match Pandas functionalities with their analytics needs.

也就是说，像任何新工具一样，您首先需要了解它的功能以及如何使用它们。许多数据科学初学者仍在努力充分利用Pandas，而将大量时间花在Stack Overflow上。我要说的主要原因是，无法将Pandas功能与其分析需求相匹配。

Much of this struggle can be overcome simply by making an inventory of typical data preparation problems and matching them with appropriate Pandas tools. Below I am presenting a typical data preparation and exploratory analysis workflow and matching with necessary Pandas functions. I am not trying to document everything under the sun on Pandas rather demonstrating the process of creating your own data wrangling cheatsheet.

简单地盘点典型数据准备问题并将其与适当的Pandas工具进行匹配，就可以克服许多难题。下面，我将介绍典型的数据准备和探索性分析工作流程，并与必要的Pandas功能进行匹配。我并不是要记录所有关于Pandas的资料，而是要演示创建您自己的数据的备忘单的过程。

建立 (Set up)

Soon after you fire up your favorite Python IDE you might want to get started right away and import the necessary libraries. That’s fine, but you still need to set up your environment for setting the working directory, locate data and other files etc.

在启动喜欢的Python IDE之后不久，您可能想立即开始并导入必要的库。很好，但是您仍然需要设置环境以设置工作目录，查找数据和其他文件等。

# find out your current directory
import os
os.getcwd()# if you want to set a different working directory
os.chdir("folder-path")# to get a list of all files in the directory
os.listdir()

资料汇入 (Data import)

Next up data import, and this is where you’ll be using Pandas for the first time.

下一步数据导入，这是您第一次使用Pandas。

Your data may be sitting anywhere in the world — your local machine, SQL database, in the cloud or even in an online database. And data can be saved in a variety of formats — csv, txt, excel, sav etc.

您的数据可能位于世界任何地方-您的本地计算机，SQL数据库，云中甚至在线数据库中。数据可以多种格式保存-csv，txt，excel，sav等。

Depending on where the data is coming from and it’s file extension, you’d need different Pandas commands. Below are a couple of examples.

根据数据的来源和文件扩展名，您需要使用不同的Pandas命令。以下是几个示例。

# import pandas and numpy libraries
import pandas as pd
import numpy as np# import a csv file from local machine
df = pd.read_csv("file_path")# import a csv file from an online database
df = pd.read_csv("https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv")

资料检查 (Data inspection)

After importing data you’d like to inspect it for a number of things such as the number of columns and rows, columns names etc.

导入数据后，您需要检查数据是否有很多事情，例如列和行数，列名等。

# description of index, entries, columns, data types, memory info
df.info() # check out first few rows
df.head(5) # head# number of columns & rows
df.shape # column names
df.columns # number of unique values of a column
df["sepal_length"].nunique()# show unique values of a column
df["sepal_length"].unique()# number of unique values alltogether
df.columns.nunique()# value counts
df['species'].value_counts()

处理NA值 (Dealing with NA values)

Next, check for NA, NaN or missing values. Some algorithms can handle missing values but others require that missing values are taken care of before putting data into use. Regardless, checking for missing values and understanding how to handle them is an essential part of your “getting to know” the data.

接下来，检查NA，NaN或缺失值。一些算法可以处理缺失值，但其他算法则要求在使用数据之前要对缺失值进行处理。无论如何，检查缺失值并了解如何处理它们是“了解”数据的重要部分。

# show null/NA values per column
df.isnull().sum()# show NA values as % of total observations per column
df.isnull().sum()*100/len(df)# drop all rows containing null
df.dropna()# drop all columns containing null
df.dropna(axis=1)# drop columns with less than 5 NA values
df.dropna(axis=1, thresh=5)# replace all na values with -9999
df.fillna(-9999)# fill na values with NaN
df.fillna(np.NaN)# fill na values with strings
df.fillna("data missing")# fill missing values with mean column values
df.fillna(df.mean())# replace na values of specific columns with mean value
df["columnName"] = df["columnName"].fillna(df["columnName"].mean())# interpolation of missing values (useful in time-series)
df["columnName"].interpolate()

列操作 (Column operation)

As often the case, you may need to perform a wide range of column operations such as renaming or dropping a column, sorting column values, creating new calculated columns etc.

通常，您可能需要执行各种列操作，例如重命名或删除列，对列值进行排序，创建新的计算列等。

# select a column
df["sepal_length"]# select multiple columns and create a new dataframe X
X = df[["sepal_length", "sepal_width", "species"]]# select a column by column number
df.iloc[:, [1,3,4]]# drop a column from dataframe X
X = X.drop("sepalL", axis=1)# save all columns to a list
df.columns.tolist()# Rename columns
df.rename(columns={"old colum1": "new column1", "old column2": "new column2"})# sorting values by column "sepalW" in ascending order
df.sort_values(by = "sepal_width", ascending = True)# add new calculated column
df['newcol'] = df["sepal_length"]*2# create a conditional calculated column
df['newcol'] = ["short" if i<3 else "long" for i in df["sepal_width"]]

行操作(排序，过滤器，切片)(Row operation (sort, filter, slice))

Up until the previous section you have mostly cleaned up your data, but another important part of data preparation is slicing and filtering data to go into the next round of the analytics pipeline.

直到上一节，您基本上都已经清理了数据，但是数据准备的另一个重要部分是对数据进行切片和过滤，以进入下一轮分析管道。

# select rows 3 to 10
df.iloc[3:10,]# select rows 3 to 49 and columns 1 to 3
df.iloc[3:50, 1:4]# randomly select 10 rows
df.sample(10)# find rows with specific strings
df[df["species"].isin(["Iris-setosa"])]# conditional filtering
df[df.sepal_length >= 5]# filtering rows with multiple values e.g. 0.2, 0.3
df[df["petal_width"].isin([0.2, 0.3])]# multi-conditional filtering
df[(df.petal_length > 1) & (df.species=="Iris-setosa") | (df.sepal_width < 3)]# drop rows
df.drop(df.index[1]) # 1 is row index to be deleted

分组 (Grouping)

Last but not least, often you will need to group data by different categories — and it is especially useful in exploratory data analysis and in getting insights on categorical variables.

最后但并非最不重要的一点是，您通常需要按不同类别对数据进行分组-这在探索性数据分析和深入了解分类变量时特别有用。

# data grouped by column "species"
X = df.groupby("species")# return mean values of a column ("sepal_length" ) grouped by "species" column
df.groupby("spp")["sepal_length"].mean()# return mean values of ALL columns grouped by "species" category
df.groupby("species").mean()# get counts in different categories
df.groupby("spp").nunique()

概要(Summary)

The purpose of this article was to show some essential Pandas functions needed for making data analysis-ready. In this demonstration, I followed a typical analytics process rather than showing codes in a random fashion, which will allow data scientists to find the right tool in the right order in the project. Of course, I did not intend to show every single code required to deal with every single problem in data preparation, rather the intention was to show how to create an essential Pandas cheatsheet.

本文的目的是展示使数据分析就绪所需的一些必不可少的Pandas功能。在此演示中，我遵循了典型的分析过程，而不是以随机的方式显示代码，这将使数据科学家可以在项目中按正确的顺序找到正确的工具。当然，我并不打算展示处理数据准备中每个问题所需的每个代码，而是要展示如何创建基本的熊猫备忘单。

Hope this was useful. If you liked this article feel free to follow me on Twitter.

希望这是有用的。如果您喜欢这篇文章，请随时在Twitter上关注我。