Review of pandas DataFrame
在这一节中我们重新回到了pandas package的学习,除了在之前提到过的一些命令以外(比如slice命令“.loc[]”),在下面这道题中还是用了新的values命令,注意这里values后面不加括号,此外注意log10的用法:
# Import numpy
import numpy as np
# Create array of DataFrame values: np_vals
np_vals = df.values
# Create new array of base 10 logarithm values: np_vals_log10
np_vals_log10 = np.log10(np_vals)
# Create array of new DataFrame by passing df to np.log10(): df_log10
df_log10 = np.log10(df)
# Print original and new data containers
[print(x, 'has type', type(eval(x))) for x in ['np_vals', 'np_vals_log10', 'df', 'df_log10']]
Importing & exporting data
在数据中我们常用到csv文件,但是基础的csv文件通常有如下问题:
可以在pd.read_csv()中添加argument:header=None来使得行列上拥有从index0开始的title,如图所示:
同样我们还可以使用names keyword:“names=col_names”,这样我们之前单独定义的col_names就可以使用了。
# Read the raw file as-is: df1
df1 = pd.read_csv(file_messy)
# Print the output of df1.head()
print(df1.head())
# Read in the file with the correct parameters: df2
df2 = pd.read_csv(file_messy, delimiter=' ', header=3, comment='#')
# Print the output of df2.head()
print(df2.head())
# Save the cleaned up DataFrame to a CSV file without the index
df2.to_csv(file_clean, index=False)
# Save the cleaned up DataFrame to an excel file without the index
df2.to_excel('file_clean.xlsx', index=False)
Visual exploratory data analysis
这一节主要讲了几种将数据plot的方法,并且引到了绘制PDF和CDF的图的方法,并且提到了如何将生成的图像进行保存的方法。其中对于PDF和CDF部分,要注意几个argument的写法:
# This formats the plots such that they appear on separate rows
fig, axes = plt.subplots(nrows=2, ncols=1)
# Plot the PDF
df.fraction.plot(ax=axes[0], kind='hist', normed=True, bins=30, ran