第5章 pandas入门

最新推荐文章于 2024-08-08 14:28:23 发布

Drakens_Africa

最新推荐文章于 2024-08-08 14:28:23 发布

阅读量199

点赞数

分类专栏：《利用Python进行数据分析》学习笔记文章标签： python 数据分析

本文链接：https://blog.csdn.net/Marco458748194811/article/details/109826865

版权

《利用Python进行数据分析》学习笔记专栏收录该内容

9 篇文章 1 订阅

订阅专栏

Python数据分析再学习

1. pandas数据结构介绍

1.1 Series

## 生成Series
obj = pd.Series([4, 7, -5, 3])

## 数组表示形式、索引对象
obj.values
obj.index

## 对个数据进行标记
obj2 = pd.Series([4, 7, -5, 3], index = ['d', 'b', 'a', 'c'])

## 索引
obj2['a']
obj2[['c', 'a', 'd']]

## 通过字典创建Series
d = dict((i, i ** 2) for i in range(5))
obj3 = pd.Series(d)

## 改变顺序
I = ['0', '2', '3', '1', '4']
obj4 = pd.Series(d, index = I)

## 检测缺失值
pd.isnull(obj4)
pd.notnull(obj4)

## name属性
obj4.name = 'population'
obj4.index.name = 'state'

## 修改索引
obj.index = ['1', '4', '0', '3', '2']

1.2 DataFrame

## 通过字典创建DataFrame
data = {'state': ['Ohio', 'Ohio', 'Nevada'], 'year': [2000, 2001, 2001], 'pop': [1.5, 1.7, 2.4]}
frame = pd.DataFrame(data)

## 取前几行
frame.head()

## 按照顺序进行排列
pd.DataFrame(data, columns = ['year', 'state', 'pop'])

## 修改索引
frame2 = pd.DataFrame(data, columns = ['year', 'state', 'pop', 'debt'], index = ['one', 'two', 'three'])

## 获取列
frame2['state']
frame2.state

## 修改数据
fame2['debt'] = np.arange(6.)
val = pd.Series([-1.2, -1.5, -1.7], index = ['one', 'two', 'three'])
frame2['debt'] = val

## 删除列
frame2['eastern'] = frame2.state == 'Ohio'  ## 不能用frame2.eastern创建新的列
del frame2['eastern']

## 通过嵌套字典创建DataFrame
pop = {'Nevada': {2001: 2.4, 2002: 2.9}, 'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
frame3 = pd.DataFrame(pop)

## 转置
frame3.T

## 自己指定索引（会覆盖原先嵌套字典里的索引）
pd.DataFrame(pop, index = [2001, 2002, 2003])

## 设置name属性
frame3.index.name = 'year'
frame3.columns.name = 'state'
### 也可以在创建DataFrame时直接设置
data = pd.DataFrame(np.arange(6).reshape((2, 3)), index = pd.Index(['Ohio', 'Colorado'], name = 'state'), columns = pd.Index(['one', 'two', 'three'], name = 'number'))

## values属性
frame3.values

1.3 索引对象

obj = pd.Series(range(3), index = ['a', 'b', 'c'])
index = obj.index
index[1:]

labels = pd.Index(np.arange(3))
obj2 = pd.Series([1, 2, 3], index = labels)
obj2.index is labels

dup_labels = pd.Index(['foo', 'bar', 'foo'])

index对象是不可变的，用户不能对其进行修改。
与Python集合不同， pandas的Index可以包含重复的标签。

2. 基本功能

2.1 重新索引

## 创建Series对象
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index = ['d', 'b', 'a', 'c'])

## 根据索引重排
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])

## 插值（向前填充）
obj3 = pd.Series(['blue', 'purple', 'yellow'], index = [0, 2, 4])
obj3.reindex(range(6), method = 'ffill')

## 修改DataFrame的索引与列
frame = pd.DataFrame(np.arange(9).reshape((3, 3)), index = ['a', 'b', 'd'], columns = ['Ohio', 'Texas', 'California'])
frame2 = frame.reindex(['a', 'b', 'c', 'd'])
states = ['Texas', 'Utah', 'California']
frame.reindex(columns = states)

参数	说明
index	用作索引的新序列
method	插值填充方式
fill_value	需要引入缺失值时使用的替代值

注：这里的reindex千万不要理解为在数据集不变的情况下修改索引值，这个函数更像是按照给定的索引值对原数据进行重新的排列。修改索引值最方便是用下一章的rename函数。

2.2 删除指定行和列

## 创建数据
data = pd.DataFrame(np.arange(16).reshape((4, 4)), index = ['Ohio', 'Colorado', 'Utah', 'New York'], columns = ['one', 'two', 'three', 'four'])

## 按行名删除
data.drop(index = ['Ohio', 'New York'], axis = 0)  ## 默认是axis = 0

## 按行号删除
data.drop(index = data.index[[0, 3]], axis = 0)

## 按列名删除
data.drop(columns = ['two', 'three'], axis = 1)

## 按列号删除
data.drop(columns = df.columns[[1, 2]], axis = 1)

注：传递axis = 0是删除某个索引对应的那几行，传递axis = 1删除对应的列。不能搞混！

2.3 索引、选取和过滤

## Series基本索引
obj = pd.Series(np.arabge(4.), index = ['a', 'b', c', 'd'])
obj[2:4]
obj[['b', 'a', 'd']]
obj['b': 'c']  ## Series中利用标签的切片运算包含末端
obj[obj > 2]

## DataFrame基本索引
data = pd.DataFrame(np.arange(16).reshape((4, 4)), index = ['Ohio', 'Colorado', 'Utah', 'New York'], columns = ['one', 'two', 'three', 'four'])
data[['two', 'four']]
data[:2]  ## 切片选取
data[data['three'] > 5]  ## 布尔型数组选取

2.4 用loc和iloc进行选取

## 标签选择
data.loc['Colorado', '['two', 'three']]

## 整数选择
data.iloc[[1, 2], [3, 0, 1]]

## 多个标签的切片
data.loc[: 'Utah', 'two']
data.iloc[:, : 3], [data.three > 5]

类型	说明
`df[val]`	从DataFrame选取单列或一组列
`df.loc[val]`	通过标签，选取DataFrame的单个行或一组行
`df.loc[:, val]`	通过标签，选取单个列或列子集
`df.loc[val1, val2]`	通过标签，同时选取行和列
`df.iloc[where]`	通过整数位置，选取单个行或行子集
`df.iloc[:, where]`	通过整数位置，选取单个列或列子集
`df.iloc[where_i, where_j]`	通过整数位置，同时选取行和列

2.5 整数索引

如果轴索引含有整数，数据选取总会使用标签
比较准确的索引，如果使用标签利用df.loc，如果使用整数利用df.iloc

2.6 函数应用和映射

f = lambda x: x.max() - x.min()
frame.apply()
frame.apply(f, axis = 'columns')  
## 传入axis = 'columns'是按行执行，也就是在每行中最大值减最小值，不是按列，切记！

def f(x):
    return pd.Series([x.min(), x.max()])
frame.apply(f)

format = lambda x: '%.2f' % x
frame.applymap(format)
frame.loc[:, 'e'].map(format)

注意：
1. applymap只能用于DataFrame对象，而map只能用于Series对象；apply两者都可

2. applymap是对DataFrame的每个元素应用函数； map就是对Series对象的每个元素应用函数。

3. apply应用到DataFrame时，若指定按行或按列，就是分组应用函数；若没有指定，就是对每个元素应用函数。apply应用到Series时，就是对所有元素应用函数。

2.7 排序和排名

## 对行或列索引进行排序
obj = pd.Series(range(4), index = ['d', 'a', 'b', 'c'])
obj.sort_index()

frame = pd.DataFrame(np.arange(8).reshape((2, 4)), index = ['three', 'one'], columns = ['d', 'a', 'b', 'c'])
frame.sort_index()  ## 默认按行来进行排序
frame.sort_index(axis = 1, ascending = False)  ## 设定按照降序排列

## 对值进行排序
obj = pd.Series([4, 7, -3, 2])
obj.sort_values()

frame = pd.DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})
frame.sort_values(by = ['a', 'b'])

obj = pd.Series([7, -5, 7, 4, 2, 0, 4])
obj.rank(method = 'first')  ## 根据值在原数据中出现的顺序排名
obj.rank(ascending = False, method = 'max')  ## 按降序排名

frame.rank(axis = 'columns')

2.8 带有重复标签的轴索引

obj = pd.Series(range(5), index = ['a', 'a', 'b', 'b', 'c'])
obj.index.is_unique
obj['a']  ## 返回一个Series
obj['c']  ## 返回一个标量值

3. 汇总和计算描述统计

## 达到最值的索引
df.idmax()

## 汇总统计
df.describe()

方法	说明
`count`	非NA值的数量
`describe`	汇总统计
`min, max`	最小值，最大值
`argmin, argmax`	最小值，最大值的索引位置
`idmin, idmax`	最小值，最大值的索引值
`quantile`	样本分位数
`median`	中位数
`std`	标准差
`skew`	偏度
`kurt`	峰度
`cummin, cummax`	累计最大值，累计最小值
`cumprod`	累计积

3.1 相关系数与协方差

## 两列之间的相关系数、协方差
returns['MSFT'].corr(returns['IBM'])  ## returns为DataFrame
returns.['MSFT'].cov(returns['IBM'])

## 相关系数矩阵、协方差矩阵
returns.corr()
returns.cov()

## 某个列与其他所有列的相关系数
returns.corrwith(returns.IBM)

3.2 唯一值、值计数以及成员资格

## 计算唯一值
obj = pd.Series(['c', 'a', 'd', 'a'])
uniques = obj.unique()
uniques.sort()  ## 排序

## 计算各值频率
obj.value_counts()  ## 方法1（默认按照频率降序排列）
pd.value_counts(obj.values, sort = False)  ## 方法2

## 判断矢量化集合的成员资格
mask = obj.isin(['b', 'c'])
obj[mask]

## Index.get_indexer方法
to_match = pd.Series(['c', 'a', 'b', 'b', 'a', 'c'])
unique_vals = pd.Series(['c', 'b', 'c'])
pd.Index(unique_vals).get_indexer(to_match)

Drakens_Africa

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
第5章 pandas入门

Python数据分析再学习  1. pandas数据结构介绍1.1 Series## 生成Seriesobj = pd.Series([4, 7, -5, 3])## 数组表示形式、索引对象obj.valuesobj.index## 对个数据进行标记obj2 = pd.Series([4, 7, -5, 3], index = ['d', 'b', 'a', 'c'])## 索引obj2['a']obj2[['c', 'a', 'd']]## 通过字
复制链接

扫一扫

专栏目录