《利用python进行数据分析》读书笔记之分类（Categorical）数据

最新推荐文章于 2024-03-05 09:24:20 发布

pnd237

最新推荐文章于 2024-03-05 09:24:20 发布

阅读量2.2k

点赞数 1

分类专栏：数据分析文章标签： python 数据分析大数据机器学习人工智能

本文链接：https://blog.csdn.net/pnd237/article/details/105495214

版权

数据分析专栏收录该内容

21 篇文章 4 订阅

订阅专栏

分类数据

背景和目标
pandas中的Categorical类型
使用Categorical对象进行计算
- 使用分类获得更高的性能
分类方法
- 创建用于建模的虚拟变量

本文中可能使用的数据集来自：《利用python进行数据分析》数据集

背景和目标

在处理数据的时候，我们经常会遇到一个列中部分数据完全相同，比如下面这个Series对象：

import pandas as pd

values = pd.Series(['apple','orange','apple',
                    'orange'] * 2)
print(values)
# 0     apple
# 1    orange
# 2     apple
# 3    orange
# 4     apple
# 5    orange
# 6     apple
# 7    orange
# dtype: object

我们可以使用values_conuts方法来统计每个值出现的次数，可以使用unique来提取出所有的值（多个相同的数据只提取出一个）：

print(values.unique())
# ['apple' 'orange']
print(values.value_counts())
# orange    4
# apple     4
# dtype: int64

对于一些数据系统，在出现重复值的列表中，通常使用维度表来表示数值，比如对于上面的列子，可以使用0表示苹果（apple），1表示橘子（orange）。考虑下面这个使用维度表的小例子：

values = pd.Series([0,1,0,0] * 2)
#将重复的数据用离散的数字表示，称为分类展现
dim = pd.Series(["orange","apple"])
#维度表
print(values)
# 0    0
# 1    1
# 2    0
# 3    0
# 4    0
# 5    1
# 6    0
# 7    0
# dtype: int64
print(dim)
# 0    orange
# 1     apple
# dtype: object

我们可以使用维度表的take方法，将原始的数据传入，将离散的数字替换为数据：

print(dim.take(values))
# 0    orange
# 1     apple
# 0    orange
# 0    orange
# 0    orange
# 1     apple
# 0    orange
# 0    orange
# dtype: object

这种按照整数展现的方式被称为分类或字典编码展现。不同值的数组可以被成为数据的类别、字典或者层级。之后我们将使用分类和类别加以称呼。
在做数据分析的时候，使用分类展示会占用更小的资源。比如：

在进行数据的重命名时，你可以只用改变类别（维度表）中的内容即可。
可以在不改变已有类别的顺序的情况下添加一个新的类别

pandas中的Categorical类型

现在我们介绍pandas中的Categorical类型。首先我们考虑一下下面这个DataFrame：

import pandas as pd
import numpy as np

np.random.seed(1234)

fruits = ['apple','orange','apple','apple'] * 2
N = len(fruits)
df = pd.DataFrame({"fruit":fruits,"basket_id":np.arange(N),
                   "count":np.random.randint(3,15,size = N),
                   "weight":np.random.uniform(0,4,size=N)})
print(df)
#     fruit  basket_id  count    weight
# 0   apple          0      6  0.602548
# 1  orange          1      9  0.794075
# 2   apple          2      8  3.260652
# 3   apple          3      7  0.635261
# 4   apple          4     11  0.464551
# 5  orange          5     12  0.051630
# 6   apple          6      4  1.947334
# 7   apple          7     10  1.324062

对于df[‘fruit’]中的数据，默认数据类型是numpy.ndarray，我们可以调用这一列的astype方法，将里面的数据转化为Categorical类型：

fruit_cat = df['fruit'].astype("category")
print(fruit_cat)
# 0     apple
# 1    orange
# 2     apple
# 3     apple
# 4     apple
# 5    orange
# 6     apple
# 7     apple
# Name: fruit, dtype: category
# Categories (2, object): [apple, orange]

此时,这一列的类型已经都变成了pandas.Categorical实例，而不是numpy.ndarray类型。Categorical对象拥有categories和codes属性:

c = fruit_cat.values
print(type(c))
# <class 'pandas.core.series.Series'>
print(c.categories)
# Index(['apple', 'orange'], dtype='object')
print(c.codes)
# [0 1 0 0 0 1 0 0]

当然，想要我们亦可以直接创建Categorical对象：

my_categories = pd.Categorical(['foo','bar','baz','foo','bar'])
print(my_categories)
# [foo, bar, baz, foo, bar]
# Categories (3, object): [bar, baz, foo]

现在假设我们已经有了分类编码数据（类似于维度表），我们可以使用from_codes来构造对象（类似于分类和维度表）：

categories = ['foo','bar','baz']
codes = [0,1,2,0,0,1]
my_cats_2 = pd.Categorical.from_codes(codes,categories)
print(my_cats_2)
# [foo, bar, baz, foo, foo, bar]
# Categories (3, object): [foo, bar, baz]

使用此方法进行分类转换的时候，除非显式指定，否则不会自动指定顺序。因此此时categories数组可能会与输入数据的顺序不同。当进行实例化Categorical对象的时候我们可以指定顺序，对于未排序的实例，可以使用as_ordered进行排序：

print(ordered_cat)
# [foo, bar, baz, foo, foo, bar]
# Categories (3, object): [foo < bar < baz]
print(my_cats_2.as_ordered())
# [foo, bar, baz, foo, foo, bar]
# Categories (3, object): [foo < bar < baz]

[foo < bar < baz]代表“foo”是第一个（序号0），“bar”是第二个，“baz”是第三个.

使用Categorical对象进行计算

在pandas中，使用Categorical与其他非编码对象相比，用法基本上是一样的，但是在使用一些函数的时候，Categorical对象的性能则更好！
比如说，现在我们想要对1000个随机数进行分箱（分箱教程），结果会返回Categorical对象：

draws = np.random.randn(1000)
print(draws[:5])
# [ 0.47143516 -1.19097569  1.43270697 -0.3126519  -0.72058873]
bins = pd.qcut(draws,4)
#将数据按分位数分为4分（25%、50%、75%）
print(bins)
# [(0.0178, 0.669], (-3.565, -0.624], (0.669, 2.764], (-0.624, 0.0178], (-3.565, -0.624], ..., (0.0178, 0.669], (0.669, 2.764], (0.0178, 0.669], (0.669, 2.764], (-3.565, -0.624]]
# Length: 1000
# Categories (4, interval[float64]): [(-3.565, -0.624] < (-0.624, 0.0178] < (0.0178, 0.669] <
#                                     (0.669, 2.764]]

当然，我们看到Categorical对象的categories是数字区间，非常的不直观，可以使用labels选项来进行改进：

bins = pd.qcut(draws,4,labels=['Q1','Q2','Q3','Q4'])
#将数据按分位数分为4分（25%、50%、75%）
print(bins)
# [Q3, Q1, Q4, Q2, Q1, ..., Q3, Q4, Q3, Q4, Q1]
# Length: 1000
# Categories (4, object): [Q1 < Q2 < Q3 < Q4]

我们可以使用分好箱的数据使用groupby方法对原数据进行聚合：

res = pd.Series(draws).groupby(bins).\
    agg(["count","min","max"]).\
    reset_index()
print(res)
#   index  count       min       max
# 0    Q1    250 -3.563517 -0.624589
# 1    Q2    250 -0.624230  0.017467
# 2    Q3    250  0.018055  0.668488
# 3    Q4    250  0.669760  2.763844

结果中index列保留了原始的分类信息，包括顺序：

print(res['index'])
# 0    Q1
# 1    Q2
# 2    Q3
# 3    Q4
# Name: index, dtype: category
# Categories (4, object): [Q1 < Q2 < Q3 < Q4]

使用分类获得更高的性能

对于大量的数据集，将其转化为分类数据可以产生大幅的性能提升，也会使用更少的内存。现在让我们考虑一个一千万元素的Series：

N=10000000

labels = pd.Series(["foo",'bar','baz','qux'] * (N // 4))
categories = labels.astype("category")

#比较两者使用的内存
print(labels.memory_usage())
# 80000080
print(categories.memory_usage())
# 10000272

当然，我们在分类转化的时候会花费一些时间开销。
另外使用分类对象进行groupby操作会明显的快，这是因为底层使用了基于整数代码的数组（类似于维度表）而不是字符串数组。

分类方法

Series包含的分类数据拥有一些特殊的方法，这些方法提供了快捷访问类别和代码的方式。在介绍他们之前，先考虑下面这个Series对象：

s = pd.Series(list('abcd') * 2)
cat_s = s.astype("category")
print(cat_s)
# 0    a
# 1    b
# 2    c
# 3    d
# 4    a
# 5    b
# 6    c
# 7    d
# dtype: category
# Categories (4, object): [a, b, c, d]

之前提到过，使用cat_s的code属性可以查看其分类方法，在此不再赘述。
假设我们知道数据中可能一共有4个以上的类别（及时在该数据集中是有4个），我们可以使用set_categories方法来改变：

actual_categories = list("abcde")
cat_s2 = cat_s.cat.set_categories(actual_categories)
print(cat_s2)
# 0    a
# 1    b
# 2    c
# 3    d
# 4    a
# 5    b
# 6    c
# 7    d
# dtype: category
# Categories (5, object): [a, b, c, d, e]

我们注意到，改变分类方法之后，实际的Series并没有发生改变，这是因为该数据集中并未出现第五种数据（即序号4）。
在大型数据集中，分类数据对内存的节省尤为明显。当对数据进行了过滤之后，很多类别的数据不会出现在数据中，此时可以使用remove_unused_categories方法来除去未被观察到的类别：

cat_s3 = cat_s[cat_s.isin(['a','b'])]
print(cat_s3)
# 0    a
# 1    b
# 4    a
# 5    b
# dtype: category
# Categories (4, object): [a, b, c, d]
print(cat_s3.cat.remove_unused_categories())
# 0    a
# 1    b
# 4    a
# 5    b
# dtype: category
# Categories (2, object): [a, b]

pandas中Series的分类方法：

方法	描述
add_categories	将新的类别（未使用过的）添加到已有类别的尾部
as_ordered	对类别排序
remove_categories	去除类别，将被移除的值设置为null
remove_unused_categories	去除所有没有被观察到的类别
rename_categories	使用新的类别名称替代现有的类别，不会改变类别的数量
reorder_categories	与rename_categories类似，但是结果是经过排序的类别
set_categories	用指定的一组新类别代替现有的类别，可以添加或者删除类别

创建用于建模的虚拟变量

在进行数据统计或者机器学习的时候，通常会将分类数据转换为虚拟变量，也成为one-hot编码。每个不同的类别都是它的类别。这些列包含一个特定类别的出现次数，否则为0.
考虑之前的例子：

s = pd.Series(list('abcd') * 2)
cat_s = s.astype("category")
print(cat_s)
# 0    a
# 1    b
# 2    c
# 3    d
# 4    a
# 5    b
# 6    c
# 7    d
# dtype: category
# Categories (4, object): [a, b, c, d]

可以使用pandas.get_dummies函数，将一维的分类数据转化为一个包含虚拟变量的DataFrame：

print(pd.get_dummies(cat_s))
#    a  b  c  d
# 0  1  0  0  0
# 1  0  1  0  0
# 2  0  0  1  0
# 3  0  0  0  1
# 4  1  0  0  0
# 5  0  1  0  0
# 6  0  0  1  0
# 7  0  0  0  1