《利用python进行数据分析》第二版第12章-高阶pandas 学习笔记

KikuWong

于 2021-06-26 18:42:55 发布

阅读量255

点赞数

文章标签： python 数据分析

本文链接：https://blog.csdn.net/KikuWong/article/details/118253036

版权

文章目录

一、分类数据
二、高阶GroupBy应用
- 分组转换和 “展开” GroupBy
- 分组的时间重新采样

一、分类数据

本节学习 pandas 的 Categorical 类型，学习在使用 pandas 进行某些操作时如何获取更好的性能和内存使用，及一些在统计和机器学习中使用分类数据的工具。

背景和目标

一个列经常会包含重复值，这些重复值是一个小型的不同值的集合，unique 和 value_counts 函数可从一个数组中提取不同值并分别计算这些不同值的频率
为了提高性能，用分类的方法（或者称为字典编码）来呈现数据，即用编码的方法将数据以整数的方式呈现；同时，存在一个类别（或字典）保存了数值所代表的含义

import numpy as np
import pandas as pd

values = pd.Series(['apple', 'orange', 'apple', 'apple'] * 2)
values
'''
0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
dtype: object
'''

# 查看values列有哪些不同的值
pd.unique(values)
'''array(['apple', 'orange'], dtype=object)'''

# 为values列不同的值计数
pd.value_counts(values)

'''
apple     6
orange    2
dtype: int64
'''

# 用重复的值表示数据
# 将主要观测值存储为引用维度表的整数键
values = pd.Series([0, 1, 0, 0] * 2)
values
# 数据的类别、字典/层级
'''
0    0
1    1
2    0
3    0
4    0
5    1
6    0
7    0
dtype: int64
'''

dim = pd.Series(['apple', 'orange'])
dim
'''
0     apple
1    orange
dtype: object
'''

# 分类数据和类别均为Series，在类别上用take方法，传入分类数据，即可还原原始数据
# 用take方法恢复原来的字符串Series
dim.take(values)
'''
0     apple
1    orange
0     apple
0     apple
0     apple
1    orange
0     apple
0     apple
dtype: object
'''

pandas 中的 Categorical 类型

pandas 拥有特殊的 Categorical 类型，用于承载基于整数的类别展示或编码的数据，

astype()方法将Series的值转换为Categorical对象

fruits = ['apple', 'orange', 'apple', 'apple'] * 2
N = len(fruits)
df = pd.DataFrame({'fruit': fruits,
                   'basket_id': np.arange(N),
                   'count': np.random.randint(3, 15, size=N),
                   'weight': np.random.uniform(0, 4, size=N)},
                  columns=['basket_id', 'fruit', 'count', 'weight'])
df

basket_id	fruit	count	weight
0	0	apple	5	3.858058
1	1	orange	8	2.612708
2	2	apple	4	2.995627
3	3	apple	7	2.614279
4	4	apple	12	2.990859
5	5	orange	8	3.845227
6	6	apple	5	0.033553
7	7	apple	4	0.425778

# 用astype('category')将fruit水果列数据类型dtype设置为category，原来为object
fruit_cat = df['fruit'].astype('category')
fruit_cat
'''
0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
Name: fruit, dtype: category
Categories (2, object): ['apple', 'orange']
'''

# 得到的为pandas.Categorical实例
c = fruit_cat.values
type(c)
'''pandas.core.arrays.categorical.Categorical'''

# 参看Categorical对象的类别Categorical.categories
c.categories
'''Index(['apple', 'orange'], dtype='object')'''

c.codes
'''array([0, 1, 0, 0, 0, 1, 0, 0], dtype=int8)'''

# 更改原DataFrmae中fruit列的类别为Categorical
df['fruit'] = df['fruit'].astype('category')
df.fruit
'''
0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
Name: fruit, dtype: category
Categories (2, object): ['apple', 'orange']
'''

用pd.Categorical()直接从其他python序列生成Categorical对象

# 用pd.Categorical()直接从其他python序列生成Categorical对象
my_categories = pd.Categorical(['foo', 'bar', 'baz', 'foo', 'bar'])
my_categories
'''
['foo', 'bar', 'baz', 'foo', 'bar']
Categories (3, object): ['bar', 'baz', 'foo']
'''

在已经获得分类编码数据时，用pd.Categorical.from_codes()构造函数

# 在已经获得分类编码数据时，用pd.Categorical.from_codes()构造函数
categories = ['foo', 'bar', 'baz']
codes = [0, 1, 2, 0, 0, 1]
my_cats_2 = pd.Categorical.from_codes(codes, categories)
my_cats_2
'''
['foo', 'bar', 'baz', 'foo', 'foo', 'bar']
Categories (3, object): ['foo', 'bar', 'baz']
'''

默认情况下，分类转换没有指定类别的顺序，因此 categories 数组可能和输入的顺序不同，在使用 from_codes 或其他任意构造函数时，通过传递 ordered=True 来严格指定顺序
一个未排序的分类实例可用 as_ordered() 方法进行排序

ordered_cat = pd.Categorical.from_codes(codes, categories,
                                        ordered=True)
ordered_cat
'''
['foo', 'bar', 'baz', 'foo', 'foo', 'bar']
Categories (3, object): ['foo' < 'bar' < 'baz']
'''

# 对未排序的分类实例使用as_ordered()排序
my_cats_2.as_ordered()
'''
['foo', 'bar', 'baz', 'foo', 'foo', 'bar']
Categories (3, object): ['foo' < 'bar' < 'baz']
'''

使用 Categorical 对象进行计算

np.random.seed(12345)
draws = np.random.randn(1000)
draws[:5]
'''array([-0.2047,  0.4789, -0.5194, -0.5557,  1.9658])'''

# 计算1000个随机数字的四分位分箱
bins = pd.qcut(draws, 4)
bins
'''
[(-0.684, -0.0101], (-0.0101, 0.63], (-0.684, -0.0101], (-0.684, -0.0101], (0.63, 3.928], ..., (-0.0101, 0.63], (-0.684, -0.0101], (-2.9499999999999997, -0.684], (-0.0101, 0.63], (0.63, 3.928]]
Length: 1000
Categories (4, interval[float64]): [(-2.9499999999999997, -0.684] < (-0.684, -0.0101] < (-0.0101, 0.63] < (0.63, 3.928]]
'''

# 使用label添加四分位数名称
bins = pd.qcut(draws, 4, labels=['Q1', 'Q2', 'Q3', 'Q4'])
bins
'''
['Q2', 'Q3', 'Q2', 'Q2', 'Q4', ..., 'Q3', 'Q2', 'Q1', 'Q3', 'Q4']
Length: 1000
Categories (4, object): ['Q1' < 'Q2' < 'Q3' < 'Q4']
'''

bins.codes[:10]
'''array([1, 2, 1, 1, 3, 3, 2, 2, 3, 3], dtype=int8)'''

# 用 groupby 提取汇总统计值
bins = pd.Series(bins, name='quartile')
results = (pd.Series(draws).groupby(bins).agg(['count', 'min', 'max']).reset_index())
results

	quartile	count	min	max
0	Q1	250	-2.949343	-0.685484
1	Q2	250	-0.683066	-0.010115
2	Q3	250	-0.010032	0.628894
3	Q4	250	0.634238	3.927528

# 结果中'quartile'列保留了bins中原始的分类信息，包括顺序
results['quartile']
'''
0    Q1
1    Q2
2    Q3
3    Q4
Name: quartile, dtype: category
Categories (4, object): ['Q1' < 'Q2' < 'Q3' < 'Q4']
'''

数据量比较大时，使用分类能够获得更好的性能

N = 10000000
draws = pd.Series(np.random.randn(N))
labels = pd.Series(['foo', 'bar', 'baz', 'qux'] * (N // 4))

categories = labels.astype('category')

# dtype: objects占用的内存
labels.memory_usage()
'''80000128'''

# dtype: category占用的内存
categories.memory_usage()
'''10000320'''

# 分类转换耗时
%time _ = labels.astype('category')
'''Wall time: 988 ms'''

分类方法

pandas 中 Series 的分类方法：涉及对Categorical 类型的增删改及重命名、排序

方法(Series.cat.)	描述
add_categories	将新的类别（未使用过的）添加到已有类别的尾部
as_ordered	对类别排序
as_unordered	使类别无序
remove_categories	去除类别，将被移除的值置为 null
remove_unused_categories	去除所有没有出现在数据中的类别
rename_categories	使用新的类别名称替代现有的类别，不会改变类别的数量
reorder_categories	与 rename_categories 类似，但结果是经过排序的类别
set_categories	用指定的一组新类别替换现有类别，可以添加或删除类别

s = pd.Series(['a', 'b', 'c', 'd'] * 2)
cat_s = s.astype('category')
cat_s
'''
0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (4, object): ['a', 'b', 'c', 'd']
'''

cat_s.cat.codes
'''
0    0
1    1
2    2
3    3
4    0
5    1
6    2
7    3
dtype: int8
'''

# 用属性categories查看有哪些分类
cat_s.cat.categories
'''
Index(['a', 'b', 'c', 'd'], dtype='object')
'''

# 用set_categories()重新设置分类
actual_categories = ['a', 'b', 'c', 'd', 'e']
cat_s2 = cat_s.cat.set_categories(actual_categories)
cat_s2
'''
0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (5, object): ['a', 'b', 'c', 'd', 'e']
'''

# 以上结果看起来并未改变，雇佣value_counts()验证

cat_s.value_counts()
'''
d    2
c    2
b    2
a    2
dtype: int64
'''

cat_s2.value_counts()
'''
d    2
c    2
b    2
a    2
e    0
dtype: int64
'''

# 对于cat_s3类别c、d并没有出现在结果中
# 可用remove_unused_categories()去除未使用的类别
cat_s3 = cat_s[cat_s.isin(['a', 'b'])]
cat_s3
'''
0    a
1    b
4    a
5    b
dtype: category
Categories (4, object): ['a', 'b', 'c', 'd']
'''

cat_s3.cat.remove_unused_categories()
'''
0    a
1    b
4    a
5    b
dtype: category
Categories (2, object): ['a', 'b']
'''

创建用于建模的虚拟变量

在用统计数据或机器学习工具时，通常会将分类数据转换为虚拟变量，也称为 one-hot 编码
pandas.get_dummies() 函数将一维的分类数据转换为一个包含虚拟变量的 DataFrame，每个不同的类别都是它的一列，这些列包含特定类别的出现次数，否则为 0

cat_s
'''
0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (4, object): ['a', 'b', 'c', 'd']
'''

cat_s = pd.Series(['a', 'b', 'c', 'd'] * 2, dtype='category')
pd.get_dummies(cat_s)

	a	b	c	d
0	1	0	0	0
1	0	1	0	0
2	0	0	1	0
3	0	0	0	1
4	1	0	0	0
5	0	1	0	0
6	0	0	1	0
7	0	0	0	1

二、高阶GroupBy应用

分组转换和 “展开” GroupBy

GroupBy的transform()方法，与apply 类似；但对可以使用的函数种类有更多的限制：

transform 可以产生一个标量值，并广播到各分组的尺寸数据中，即结果的shape与原shape相同
transform 可以产生一个与输入分组尺寸相同的对象
transform 不可改变它的输入

df = pd.DataFrame({'key': ['a', 'b', 'c'] * 4,
                   'value': np.arange(12.)})
df

	key	value
0	a	0.0
1	b	1.0
2	c	2.0
3	a	3.0
4	b	4.0
5	c	5.0
6	a	6.0
7	b	7.0
8	c	8.0
9	a	9.0
10	b	10.0
11	c	11.0

# 按key分类后value的均值
g = df.groupby('key').value
g.mean()
'''key
a    4.5
b    5.5
c    6.5
Name: value, dtype: float64'''

# 用transform()，传入匿名函数，
# 可得到一个尺寸和df[‘value’]一样，但值都被按‘key’分组的均值替代的 Series
# 即transform()返回标量值会分别广播道对应的组
g.transform(lambda x: x.mean())
'''
0     4.5
1     5.5
2     6.5
3     4.5
4     5.5
5     6.5
6     4.5
7     5.5
8     6.5
9     4.5
10    5.5
11    6.5
Name: value, dtype: float64
'''

# 用transform()，传入字符串别名（仅针对内建的聚合函数）,
# 可得到一个尺寸和df[‘value’]一样，但值都被按‘key’分组的均值替代的 Series
g.transform('mean')
'''
0     4.5
1     5.5
2     6.5
3     4.5
4     5.5
5     6.5
6     4.5
7     5.5
8     6.5
9     4.5
10    5.5
11    6.5
Name: value, dtype: float64
'''

# 与apply类似，transform可与返回series的函数一起使用，但结果必须和输入有相同的大小
g.transform(lambda x: x * 2)
'''
0      0.0
1      2.0
2      4.0
3      6.0
4      8.0
5     10.0
6     12.0
7     14.0
8     16.0
9     18.0
10    20.0
11    22.0
Name: value, dtype: float64
'''

g.transform(lambda x: x.rank(ascending=False))
'''
0     4.0
1     4.0
2     4.0
3     3.0
4     3.0
5     3.0
6     2.0
7     2.0
8     2.0
9     1.0
10    1.0
11    1.0
Name: value, dtype: float64
'''

# 有简单聚合构成的分组转换函数传递给transform()/apply()结果等价
def normalize(x):
    return (x - x.mean()) / x.std()

g.apply(normalize)
'''
0    -1.161895
1    -1.161895
2    -1.161895
3    -0.387298
4    -0.387298
5    -0.387298
6     0.387298
7     0.387298
8     0.387298
9     1.161895
10    1.161895
11    1.161895
Name: value, dtype: float64
'''

# 但与apply()相比，transform()更快
g.transform(normalize)
'''
0    -1.161895
1    -1.161895
2    -1.161895
3    -0.387298
4    -0.387298
5    -0.387298
6     0.387298
7     0.387298
8     0.387298
9     1.161895
10    1.161895
11    1.161895
Name: value, dtype: float64
'''

normalized = (df['value'] - g.transform('mean')) / g.transform('std')
normalized
'''
0    -1.161895
1    -1.161895
2    -1.161895
3    -0.387298
4    -0.387298
5    -0.387298
6     0.387298
7     0.387298
8     0.387298
9     1.161895
10    1.161895
11    1.161895
Name: value, dtype: float64
'''

分组的时间重新采样

对时间序列数据，resample 方法在语义上是一种基于时间分段的分组操作。

N = 15
times = pd.date_range('2017-05-20 00:00', freq='1min', periods=N)
df = pd.DataFrame({'time': times,
                   'value': np.arange(N)})
df

	time	value
0	2017-05-20 00:00:00	0
1	2017-05-20 00:01:00	1
2	2017-05-20 00:02:00	2
3	2017-05-20 00:03:00	3
4	2017-05-20 00:04:00	4
5	2017-05-20 00:05:00	5
6	2017-05-20 00:06:00	6
7	2017-05-20 00:07:00	7
8	2017-05-20 00:08:00	8
9	2017-05-20 00:09:00	9
10	2017-05-20 00:10:00	10
11	2017-05-20 00:11:00	11
12	2017-05-20 00:12:00	12
13	2017-05-20 00:13:00	13
14	2017-05-20 00:14:00	14

# 对df，按'time'进行索引，再进行5min重采样
df.set_index('time').resample('5min').count()

	value
time
2017-05-20 00:00:00	5
2017-05-20 00:05:00	5
2017-05-20 00:10:00	5

# 同时包含'time'与分组键'key'
df2 = pd.DataFrame({'time': times.repeat(3),
                    'key': np.tile(['a', 'b', 'c'], N),
                    'value': np.arange(N * 3.)})
df2[:7]

	time	key	value
0	2017-05-20 00:00:00	a	0.0
1	2017-05-20 00:00:00	b	1.0
2	2017-05-20 00:00:00	c	2.0
3	2017-05-20 00:01:00	a	3.0
4	2017-05-20 00:01:00	b	4.0
5	2017-05-20 00:01:00	c	5.0
6	2017-05-20 00:02:00	a	6.0

# pd.Grouper()对每个'key'值进行相同的重新采样，传递freq='5min'
# pd.Grouper()限制时间序列必须是Series或DataFrame
time_key = pd.Grouper(freq='5min')

resampled = (df2.set_index('time').groupby(['key', time_key]).sum())
resampled

		value
key	time
a	2017-05-20 00:00:00	30.0
	2017-05-20 00:05:00	105.0
	2017-05-20 00:10:00	180.0
b	2017-05-20 00:00:00	35.0
	2017-05-20 00:05:00	110.0
	2017-05-20 00:10:00	185.0
c	2017-05-20 00:00:00	40.0
	2017-05-20 00:05:00	115.0
	2017-05-20 00:10:00	190.0

resampled.reset_index()

	key	time	value
0	a	2017-05-20 00:00:00	30.0
1	a	2017-05-20 00:05:00	105.0
2	a	2017-05-20 00:10:00	180.0
3	b	2017-05-20 00:00:00	35.0
4	b	2017-05-20 00:05:00	110.0
5	b	2017-05-20 00:10:00	185.0
6	c	2017-05-20 00:00:00	40.0
7	c	2017-05-20 00:05:00	115.0
8	c	2017-05-20 00:10:00	190.0

KikuWong

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
《利用python进行数据分析》第二版第12章-高阶pandas 学习笔记

文章目录一、分类数据背景和目标pandas 中的 Categorical 类型使用 Categorical 对象进行计算分类方法创建用于建模的虚拟变量二、高阶GroupBy应用分组转换和 “展开” GroupBy分组的时间重新采样一、分类数据本节学习 pandas 的 Categorical 类型，学习在使用 pandas 进行某些操作时如何获取更好的性能和内存使用，及一些在统计和机器学习中使用分类数据的工具。背景和目标一个列经常会包含重复值，这些重复值是一个小型的不同值的集合，unique
复制链接

扫一扫