DataCamp “Data Scientist with Python track” 第九章 Manipulating DataFrames with pandas 学习笔记

本文链接：https://blog.csdn.net/weixin_41803041/article/details/84670516

Slicing DataFrames

在本节中提到了在slice过程中何时生成series何时生成DataFrame的问题，解决了之前留下的疑惑：

此外，在slice的过程中，我们可以反向slice，to do this for hypothetical row labels 'a' and 'b', you could use a stepsize of -1 like so: df.loc['b':'a':-1]：

# Slice the row labels 'Perry' to 'Potter': p_counties
p_counties = election.loc['Perry':'Potter',:]

# Print the p_counties DataFrame
print(p_counties)

# Slice the row labels 'Potter' to 'Perry' in reverse order: p_counties_rev
p_counties_rev = election.loc['Potter':'Perry':-1]

# Print the p_counties_rev DataFrame
print(p_counties_rev)

此外还有从头选到某列或者从某列选到尾的命令写法，这都是之前练习很难注意到的地方：

# Slice the columns from the starting column to 'Obama': left_columns
left_columns = election.loc[:, :'Obama']

# Print the output of left_columns.head()
print(left_columns.head())

# Slice the columns from 'Obama' to 'winner': middle_columns
middle_columns = election.loc[:, 'Obama':'winner']

# Print the output of middle_columns.head()
print(middle_columns.head())

# Slice the columns from 'Romney' to the end: 'right_columns'
right_columns = election.loc[:, 'Romney':]

# Print the output of right_columns.head()
print(right_columns.head())

下面又提到了DataFrame的数据问题，比如0和NaN数据，如何寻找、定位、改变。这道题中就将过于接近的数据变为NaN以方便以后的处理：

# Import numpy
import numpy as np

# Create the boolean array: too_close
too_close = election.margin < 1

# Assign np.nan to the 'winner' column where the results were too close to call
election.loc[too_close, 'winner'] = np.nan

# Print the output of election.info()
print(election.info())

Transforming DataFrames

在这一节中提到了一个命令：“.floordiv()”，也就是地板除，也可以写成“np.floor_divide(文件, 数值)”，使得我们可以对整个文件进行处理，其中第一个指令是pandas的，而第二个指令是numpy的。

Setting & sorting a MultiIndex

在很多的数据列中都会出现重复的数据，比如很多数据都属于同一个日期，而为了让数据更具备可读性，我们可以将这样的可分类列定位index，然后对其进行sort操作：

# Set the index to be the columns ['state', 'month']: sales
sales = sales.set_index(['state', 'month'])

# Sort the MultiIndex: sales
sales = sales.sort_index()

# Print the sales DataFrame
print(sales)

原数据：

set_index之后的数据：

对index进行sort之后的数据：

Pivoting a single variable

我们同样可以使用pivot命令更改DataFrame的结构，我们可以定义index、columns和value来使用某一列数据填充整个DataFrame，比如 Pivot the users DataFrame with the rows indexed by 'weekday', the columns indexed by 'city', and the values populated with 'visitors'：

# Pivot the users DataFrame: visitors_pivot
visitors_pivot = users.pivot(index='weekday', columns='city', values='visitors')

# Print the pivoted DataFrame
print(visitors_pivot)

下面这两张图便于我们理解stacking和unstacking的作用：