DataCamp “Data Scientist with Python track” 第七章 Cleaning Data in Python 学习笔记

最新推荐文章于 2022-06-13 22:41:52 发布

DeclanNYC

最新推荐文章于 2022-06-13 22:41:52 发布

阅读量709

点赞数

文章标签： DataCamp Python 数据科学 DS 数据处理

本文链接：https://blog.csdn.net/weixin_41803041/article/details/84559658

版权

Exploring your data在数据分析过程中，我们总会遇到不那么“干净”的数据，比如：Inconsistent column names、Missing data、Outliers、Duplicate rows、Untidy、Need to process columns、Column types can signal unexpected data values等等，所以在数据...

摘要由CSDN通过智能技术生成

Exploring your data

在数据分析过程中，我们总会遇到不那么“干净”的数据，比如：Inconsistent column names、Missing data、Outliers、Duplicate rows、Untidy、Need to process columns、Column types can signal unexpected data values等等，所以在数据分析之前我们先要对数据进行处理。其中一个对数据进行检验的方法是“.info”，可以查看有多少列缺失数据，或者不符合type的数据，比如下面的例子中，只有122个population数据，而我们的期望是164个。

# Print the info of df
print(df.info())

# Print the info of df_subset
print(df_subset.info())

除了“.info”以外我们还有很多种不同的检验数据的方法，比如“.describe()”，这让我们可以得到中位数、最大值等等一些列分布情况，从而在这几个值中找出数据的问题所在。

还有在列中计算出现频率的方法，可以用来验证缺少多少组数据（注意这里特意使用了两种写法，如果没有特殊字符也不是python指令的话可以使用第二行的简洁写法）：

# Print the value counts for 'Borough'
print(df['Borough'].value_counts(dropna=False))

# Print the value_counts for 'State'
print(df.State.value_counts(dropna = False))

# Print the value counts for 'Site Fill'
print(df['Site Fill'].value_counts(dropna = False))

Visual exploratory data analysis

在这之后我们还可以更加具象化我们的数据，比如通过各种图表来检验数据：

# Import matplotlib.pyplot
import matplotlib.pyplot as plt

# Plot the histogram
df['Existing Zoning Sqft'].plot(kind='hist', rot=70, logx=True, logy=True)

# Display the histogram
plt.show()

*这里要注明的几个argument包括kind（图表类型）、rot（Rotate the axis labels by 70 degrees），还有logx和logy（use a log scale for both axes）。

Boxplot是一个可以用来检验数据好坏的图表，在生成的图表中很容易就可以发现bad point：

# Import necessary modules
import pandas as pd
import matplotlib.pyplot as plt

# Create the boxplot
df.boxplot(column='initial_cost', by='Borough', rot=90)

# Display the plot
plt.show()

最低0.47元/天解锁文章

DeclanNYC

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
DataCamp “Data Scientist with Python track” 第七章 Cleaning Data in Python 学习笔记

Exploring your data在数据分析过程中，我们总会遇到不那么“干净”的数据，比如：Inconsistent column names、Missing data、Outliers、Duplicate rows、Untidy、Need to process columns、Column types can signal unexpected data values等等，所以在数据...
复制链接

扫一扫