【Pandas】Pandas数据分类

最新推荐文章于 2024-05-20 06:44:33 发布

ChenVast

最新推荐文章于 2024-05-20 06:44:33 发布

阅读量7.8k

点赞数 8

分类专栏： Big Data Analysis 数据科学文章标签： pandas 数据分类

本文链接：https://blog.csdn.net/ChenVast/article/details/83652677

版权

数据科学同时被 2 个专栏收录

75 篇文章 13 订阅

订阅专栏

Big Data Analysis

57 篇文章 30 订阅

订阅专栏

分类是与统计中的分类变量对应的pandas数据类型。分类变量采用有限的，通常是固定的可能值（类别 ; R中的级别）。例如性别，社会阶层，血型，国家归属，观察时间或通过李克特量表评级。

与统计分类变量相比，分类数据可能有一个顺序（例如“强烈同意”与“同意”或“第一次观察”与“第二次观察”），但数值运算（加法，除法......）是不可能的。

分类数据的所有值都是类别或np.nan。顺序由类别的顺序定义，而不是值的词汇顺序。在内部，数据结构由类别数组和整数代码数组组成，这些代码指向类别数组中的实际值。

分类数据类型在以下情况下很有用：

字符串变量，仅包含几个不同的值。将这样的字符串变量转换为分类变量将节省一些内存，请参见此处。
变量的词法顺序与逻辑顺序（“一”，“二”，“三”）不同。通过转换为分类并在类别上指定顺序，排序和最小/最大将使用逻辑顺序而不是词法顺序，请参见此处。
作为其他Python库的信号，该列应被视为分类变量（例如，使用合适的统计方法或绘图类型）。

另请参阅有关分类的API文档。

目录

对象创建

系列创作

DataFrame创建

控制行为

恢复原始数据

分类型

平等语义

说明

使用类别

重命名类别

追加新类别

删除类别

删除未使用的类别

设定类别

排序和订单

重新排序

多列排序

比较

操作

数据修改

获得

字符串和日期时间访问器

设定

合并

联合

连接

获取数据进/出

缺少数据

与R 因子的差异

陷阱

内存使用

分类不是一个numpy数组

dtype在应用

分类索引

副作用

对象创建

系列创作

a中的分类Series或列DataFrame可以通过多种方式创建：

通过指定dtype="category"构造时间Series：

In [1]: s = pd.Series(["a","b","c","a"], dtype="category")

In [2]: s
Out[2]: 
0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): [a, b, c]

通过将现有Series或列转换为categorydtype：

In [3]: df = pd.DataFrame({"A":["a","b","c","a"]})

In [4]: df["B"] = df["A"].astype('category')

In [5]: df
Out[5]: 
   A  B
0  a  a
1  b  b
2  c  c
3  a  a

通过使用特殊功能，例如将cut()数据分组到离散箱中。请参阅文档中关于平铺的示例。

In [6]: df = pd.DataFrame({'value': np.random.randint(0, 100, 20)})

In [7]: labels = ["{0} - {1}".format(i, i + 9) for i in range(0, 100, 10)]

In [8]: df['group'] = pd.cut(df.value, range(0, 105, 10), right=False, labels=labels)

In [9]: df.head(10)
Out[9]: 
   value    group
0     65  60 - 69
1     49  40 - 49
2     56  50 - 59
3     43  40 - 49
4     43  40 - 49
5     91  90 - 99
6     32  30 - 39
7     87  80 - 89
8     36  30 - 39
9      8    0 - 9

通过将pandas.Categorical对象传递给a Series或将其分配给a DataFrame。

In [10]: raw_cat = pd.Categorical(["a","b","c","a"], categories=["b","c","d"],
   ....:                          ordered=False)
   ....: 

In [11]: s = pd.Series(raw_cat)

In [12]: s
Out[12]: 
0    NaN
1      b
2      c
3    NaN
dtype: category
Categories (3, object): [b, c, d]

In [13]: df = pd.DataFrame({"A":["a","b","c","a"]})

In [14]: df["B"] = raw_cat

In [15]: df
Out[15]: 
   A    B
0  a  NaN
1  b    b
2  c    c
3  a  NaN

分类数据具有特定的category dtype：

In [16]: df.dtypes
Out[16]: 
A      object
B    category
dtype: object

DataFrame创建

与上一节中单个列转换为分类的类似，a中的所有列 DataFrame都可以在构造期间或之后批量转换为分类。

这可以在施工期间通过指定完成dtype="category"在DataFrame构造函数：

In [17]: df = pd.DataFrame({'A': list('abca'), 'B': list('bccd')}, dtype="category")

In [18]: df.dtypes
Out[18]: 
A    category
B    category
dtype: object

请注意，每列中的类别不同; 转换是逐列完成的，因此只有给定列中的标签才是类别：

In [19]: df['A']
Out[19]: 
0    a
1    b
2    c
3    a
Name: A, dtype: category
Categories (3, object): [a, b, c]

In [20]: df['B']
Out[20]: 
0    b
1    c
2    c
3    d
Name: B, dtype: category
Categories (3, object): [b, c, d]

版本0.23.0中的新功能。

类似地，现有的所有列DataFrame都可以使用DataFrame.astype()以下方式进行批量转换：

In [21]: df = pd.DataFrame({'A': list('abca'), 'B': list('bccd')})

In [22]: df_cat = df.astype('category')

In [23]: df_cat.dtypes
Out[23]: 
A    category
B    category
dtype: object

这种转换同样是逐列完成的：

In [24]: df_cat['A']
Out[24]: 
0    a
1    b
2    c
3    a
Name: A, dtype: category
Categories (3, object): [a, b, c]

In [25]: df_cat['B']
Out[25]: 
0    b
1    c
2    c
3    d
Name: B, dtype: category
Categories (3, object): [b, c, d]

控制行为

在我们传递的上述示例中dtype='category'，我们使用了默认行为：

从数据推断出类别。
类别是无序的。

要控制这些行为，而不是传递'category'，请使用。的实例CategoricalDtype。

In [26]: from pandas.api.types import CategoricalDtype

In [27]: s = pd.Series(["a", "b", "c", "a"])

In [28]: cat_type = CategoricalDtype(categories=["b", "c", "d"],
   ....:                             ordered=True)
   ....: 

In [29]: s_cat = s.astype(cat_type)

In [30]: s_cat
Out[30]: 
0    NaN
1      b
2      c
3    NaN
dtype: category
Categories (3, object): [b < c < d]

同样，a CategoricalDtype可以与a DataFrame一起使用，以确保所有列中的类别一致。

In [31]: df = pd.DataFrame({'A': list('abca'), 'B': list('bccd')})

In [32]: cat_type = CategoricalDtype(categories=list('abcd'),
   ....:                             ordered=True)
   ....: 

In [33]: df_cat = df.astype(cat_type)

In [34]: df_cat['A']
Out[34]: 
0    a
1    b
2    c
3    a
Name: A, dtype: category
Categories (4, object): [a < b < c < d]

In [35]: df_cat['B']
Out[35]: 
0    b
1    c
2    c
3    d
Name: B, dtype: category
Categories (4, object): [a < b < c < d]

注意：要执行逐表转换，其中整个中的所有标签DataFrame都用作每列的类别，categories可以通过编程方式确定参数。categories = pd.unique(df.values.ravel())

如果您已经拥有codes和categories，则可以使用 from_codes()构造函数在正常构造函数模式下保存factorize步骤：

In [36]: splitter = np.random.choice([0,1], 5, p=[0.5,0.5])

In [37]: s = pd.Series(pd.Categorical.from_codes(splitter, categories=["train", "test"]))

恢复原始数据

要返回原始数据Series或NumPy数组，请使用 Series.astype(original_dtype)或np.asarray(categorical)：

In [38]: s = pd.Series(["a","b","c","a"])

In [39]: s
Out[39]: 
0    a
1    b
2    c
3    a
dtype: object

In [40]: s2 = s.astype('category')

In [41]: s2
Out[41]: 
0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): [a, b, c]

In [42]: s2.astype(str)
Out[42]: 
0    a
1    b
2    c
3    a
dtype: object

In [43]: np.asarray(s2)
Out[43]: array(['a', 'b', 'c', 'a'], dtype=object)

注意：与R的因子函数相反，分类数据不是将输入值转换为字符串; 类别将与原始值结束相同的数据类型。

注意：与R的因子函数相比，目前无法在创建时分配/更改标签。使用类别在创建时间后更改类别。

分类型

在版本0.21.0中更改。

分类的类型由完整描述

categories：一系列唯一值，没有缺失值
ordered：布尔值

这些信息可以存储在CategoricalDtype。该categories参数是可选的，这意味着实际的类别应该从当存在时无论是在数据来推断 pandas.Categorical被创建。默认情况下，假定类别是无序的。

In [44]: from pandas.api.types import CategoricalDtype

In [45]: CategoricalDtype(['a', 'b', 'c'])
Out[45]: CategoricalDtype(categories=['a', 'b', 'c'], ordered=None)

In [46]: CategoricalDtype(['a', 'b', 'c'], ordered=True)
Out[46]: CategoricalDtype(categories=['a', 'b', 'c'], ordered=True)

In [47]: CategoricalDtype()
Out[47]: CategoricalDtype(categories=None, ordered=None)

A CategoricalDtype可以在任何地方使用pandas期望dtype。例如pandas.read_csv()， pandas.DataFrame.astype()或在Series构造函数中。

注意：为方便起见，您可以使用字符串'category'代替a， CategoricalDtype如果您希望类别的默认行为是无序的，并且等于数组中存在的设置值。换句话说，dtype='category'相当于 dtype=CategoricalDtype()。

平等语义

CategoricalDtype只要具有相同的类别和顺序，两个比较实例相等。比较两个无序分类时，categories不考虑其顺序。

In [48]: c1 = CategoricalDtype(['a', 'b', 'c'], ordered=False)

# Equal, since order is not considered when ordered=False
In [49]: c1 == CategoricalDtype(['b', 'c', 'a'], ordered=False)
Out[49]: True

# Unequal, since the second CategoricalDtype is ordered
In [50]: c1 == CategoricalDtype(['a',  'b', 'c'], ordered=True)
Out[50]: False

所有CategoricalDtype比较实例都等于字符串'category'。

In [51]: c1 == 'category'
Out[51]: True

警告：由于dtype='category'本质上是，并且因为所有实例都比较等于，所以比较的所有实例都等于a ，而不管是或。CategoricalDtype(None, False)CategoricalDtype'category'CategoricalDtypeCategoricalDtype(None,False)categoriesordered

说明

describe()在分类数据上使用将产生类似于a Series或DataFrame类型的输出string。

In [52]: cat = pd.Categorical(["a", "c", "c", np.nan], categories=["b", "a", "c"])

In [53]: df = pd.DataFrame({"cat":cat, "s":["a", "c", "c", np.nan]})

In [54]: df.describe()
Out[54]: 
       cat  s
count    3  3
unique   2  2
top      c  c
freq     2  2

In [55]: df["cat"].describe()
Out[55]: 
count     3
unique    2
top       c
freq      2
Name: cat, dtype: object

使用类别

分类数据具有类别和有序属性，列出了它们的可能值以及顺序是否重要。这些属性显示为s.cat.categories和s.cat.ordered。如果您不手动指定类别和排序，则从传递的参数中推断出它们。

In [56]: s = pd.Series(["a","b","c","a"], dtype="category")

In [57]: s.cat.categories
Out[57]: Index(['a', 'b', 'c'], dtype='object')

In [58]: s.cat.ordered
Out[58]: False

也可以按特定顺序传递类别：

In [59]: s = pd.Series(pd.Categorical(["a","b","c","a"], categories=["c","b","a"]))

In [60]: s.cat.categories
Out[60]: Index(['c', 'b', 'a'], dtype='object')

In [61]: s.cat.ordered
Out[61]: False

注意：新的分类数据不会自动排序。您必须明确传递ordered=True以指示已订购Categorical。

注意：结果unique()并不总是相同的Series.cat.categories，因为Series.unique()它有几个保证，即它按照外观的顺序返回类别，并且它只包括实际存在的值。

In [62]: s = pd.Series(list('babc')).astype(CategoricalDtype(list('abcd')))

In [63]: s
Out[63]: 
0    b
1    a
2    b
3    c
dtype: category
Categories (4, object): [a, b, c, d]

# categories
In [64]: s.cat.categories
Out[64]: Index(['a', 'b', 'c', 'd'], dtype='object')

# uniques
In [65]: s.unique()
Out[65]: 
[b, a, c]
Categories (3, object): [b, a, c]

重命名类别

通过为Series.cat.categories属性分配新值或使用以下 rename_categories()方法来重命名类别：

In [66]: s = pd.Series(["a","b","c","a"], dtype="category")

In [67]: s
Out[67]: 
0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): [a, b, c]

In [68]: s.cat.categories = ["Group %s" % g for g in s.cat.categories]

In [69]: s
Out[69]: 
0    Group a
1    Group b
2    Group c
3    Group a
dtype: category
Categories (3, object): [Group a, Group b, Group c]

In [70]: s.cat.rename_categories([1,2,3])
Out[70]: 
0    1
1    2
2    3
3    1
dtype: category
Categories (3, int64): [1, 2, 3]

In [71]: s
Out[71]: 
0    Group a
1    Group b
2    Group c
3    Group a
dtype: category
Categories (3, object): [Group a, Group b, Group c]

# You can also pass a dict-like object to map the renaming
In [72]: s.cat.rename_categories({1: 'x', 2: 'y', 3: 'z'})
Out[72]: 
0    Group a
1    Group b
2    Group c
3    Group a
dtype: category
Categories (3, object): [Group a, Group b, Group c]

In [73]: s
Out[73]: 
0    Group a
1    Group b
2    Group c
3    Group a
dtype: category
Categories (3, object): [Group a, Group b, Group c]

注意：与R的因子相反，分类数据可以具有除字符串之外的其他类型的类别。

注意：请注意，分配新类别是一个就地操作，而Series.cat每个默认情况下的大多数其他操作都会返回一个新Series的dtype 类别。

类别必须是唯一的或引发ValueError：

In [74]: try:
   ....:     s.cat.categories = [1,1,1]
   ....: except ValueError as e:
   ....:     print("ValueError: " + str(e))
   ....: 
ValueError: Categorical categories must be unique

类别也必须不是NaN或引发ValueError：

In [75]: try:
   ....:     s.cat.categories = [1,2,np.nan]
   ....: except ValueError as e:
   ....:     print("ValueError: " + str(e))
   ....: 
ValueError: Categorial categories cannot be null

追加新类别

可以使用以下add_categories()方法添加类别：

In [76]: s = s.cat.add_categories([4])

In [77]: s.cat.categories
Out[77]: Index(['Group a', 'Group b', 'Group c', 4], dtype='object')

In [78]: s
Out[78]: 
0    Group a
1    Group b
2    Group c
3    Group a
dtype: category
Categories (4, object): [Group a, Group b, Group c, 4]

删除类别

可以使用该remove_categories()方法删除类别。删除的值将替换为np.nan。：

In [79]: s = s.cat.remove_categories([4])

In [80]: s
Out[80]: 
0    Group a
1    Group b
2    Group c
3    Group a
dtype: category
Categories (3, object): [Group a, Group b, Group c]

删除未使用的类别

删除未使用的类别也可以完成：

In [81]: s = pd.Series(pd.Categorical(["a","b","a"], categories=["a","b","c","d"]))

In [82]: s
Out[82]: 
0    a
1    b
2    a
dtype: category
Categories (4, object): [a, b, c, d]

In [83]: s.cat.remove_unused_categories()
Out[83]: 
0    a
1    b
2    a
dtype: category
Categories (2, object): [a, b]

设定类别

如果要在一个步骤中删除并添加新类别（具有一定的速度优势），或者只是将类别设置为预定义的比例，请使用set_categories()。

In [84]: s = pd.Series(["one","two","four", "-"], dtype="category")

In [85]: s
Out[85]: 
0     one
1     two
2    four
3       -
dtype: category
Categories (4, object): [-, four, one, two]

In [86]: s = s.cat.set_categories(["one","two","three","four"])

In [87]: s
Out[87]: 
0     one
1     two
2    four
3     NaN
dtype: category
Categories (4, object): [one, two, three, four]

注意：请注意，Categorical.set_categories()由于类型不同（例如，NumPy S1 dtype和Python字符串），无法知道某些类别是故意省略还是因为拼写错误或（在Python3下）。这可能会导致令人惊讶的行为！

排序和订单

如果分类数据是有序的（），则类别的顺序具有含义，并且某些操作是可能的。如果分类是无序的，则会提出一个。s.cat.ordered == True.min()/.max()TypeError

In [88]: s = pd.Series(pd.Categorical(["a","b","c","a"], ordered=False))

In [89]: s.sort_values(inplace=True)

In [90]: s = pd.Series(["a","b","c","a"]).astype(
   ....:     CategoricalDtype(ordered=True)
   ....: )
   ....: 

In [91]: s.sort_values(inplace=True)

In [92]: s
Out[92]: 
0    a
3    a
1    b
2    c
dtype: category
Categories (3, object): [a < b < c]

In [93]: s.min(), s.max()
Out[93]: ('a', 'c')

您可以使用使用as_ordered()或无序设置要排序的分类数据as_unordered()。这些将默认返回一个新对象。

In [94]: s.cat.as_ordered()
Out[94]: 
0    a
3    a
1    b
2    c
dtype: category
Categories (3, object): [a < b < c]

In [95]: s.cat.as_unordered()
Out[95]: 
0    a
3    a
1    b
2    c
dtype: category
Categories (3, object): [a, b, c]

排序将使用按类别定义的顺序，而不是数据类型上存在的任何词汇顺序。对于字符串和数字数据，情况甚至如此：

In [96]: s = pd.Series([1,2,3,1], dtype="category")

In [97]: s = s.cat.set_categories([2,3,1], ordered=True)

In [98]: s
Out[98]: 
0    1
1    2
2    3
3    1
dtype: category
Categories (3, int64): [2 < 3 < 1]

In [99]: s.sort_values(inplace=True)

In [100]: s
Out[100]: 
1    2
2    3
0    1
3    1
dtype: category
Categories (3, int64): [2 < 3 < 1]

In [101]: s.min(), s.max()
Out[101]: (2, 1)

重新排序

可以通过Categorical.reorder_categories()和Categorical.set_categories()方法重新排序类别。因为Categorical.reorder_categories()，所有旧类别必须包含在新类别中，并且不允许新类别。这必然会使排序顺序与类别顺序相同。

In [102]: s = pd.Series([1,2,3,1], dtype="category")

In [103]: s = s.cat.reorder_categories([2,3,1], ordered=True)

In [104]: s
Out[104]: 
0    1
1    2
2    3
3    1
dtype: category
Categories (3, int64): [2 < 3 < 1]

In [105]: s.sort_values(inplace=True)

In [106]: s
Out[106]: 
1    2
2    3
0    1
3    1
dtype: category
Categories (3, int64): [2 < 3 < 1]

In [107]: s.min(), s.max()
Out[107]: (2, 1)

注意：请注意分配新类别和重新排序类别之间的区别：第一个重命名类别，因此重命名类别Series，但如果第一个位置最后排序，重命名的值仍将最后排序。重新排序意味着值的排序方式之后是不同的，但不会Series改变其中的单个值。

注意：如果Categorical不订购，Series.min()并Series.max()会提高 TypeError。数字操作，例如+，-，*，/和基于它们的操作（例如Series.median()，这将需要计算两个值之间的平均值，如果一个数组的长度为偶数）不工作，提高一个TypeError。

多列排序

分类的dtyped列将以与其他列类似的方式参与多列排序。分类的顺序由该categories列的顺序决定。

In [108]: dfs = pd.DataFrame({'A' : pd.Categorical(list('bbeebbaa'), categories=['e','a','b'], ordered=True),
   .....:                     'B' : [1,2,1,2,2,1,2,1] })
   .....: 

In [109]: dfs.sort_values(by=['A', 'B'])
Out[109]: 
   A  B
2  e  1
3  e  2
7  a  1
6  a  2
0  b  1
5  b  1
1  b  2
4  b  2

将categories更改重新排序为将来的排序。

In [110]: dfs['A'] = dfs['A'].cat.reorder_categories(['a','b','e'])

In [111]: dfs.sort_values(by=['A','B'])
Out[111]: 
   A  B
7  a  1
6  a  2
0  b  1
5  b  1
1  b  2
4  b  2
2  e  1
3  e  2

比较

在以下三种情况下，可以将分类数据与其他对象进行比较：

将等式（==和!=）与类似列表的对象（列表，系列，数组......）进行比较，其长度与分类数据相同。
所有的比较（==，!=，>，>=，<，和<=分类数据的）到另一分类系列中，当ordered==True和所述类别是相同的。
将分类数据与标量进行的所有比较。

所有其他比较，特别是对具有不同类别的两个分类或具有任何类似列表的对象的分类的“非等同”比较，将提出一个TypeError。

注意：用分类数据的任何“非平等”的比较Series，np.array，list或者与不同类别或排序分类数据将引发TypeError因为自定义类别排序可以通过两种方式来解释：一种是考虑到订货，一个没有。

In [112]: cat = pd.Series([1,2,3]).astype(
   .....:     CategoricalDtype([3, 2, 1], ordered=True)
   .....: )
   .....: 

In [113]: cat_base = pd.Series([2,2,2]).astype(
   .....:     CategoricalDtype([3, 2, 1], ordered=True)
   .....: )
   .....: 

In [114]: cat_base2 = pd.Series([2,2,2]).astype(
   .....:     CategoricalDtype(ordered=True)
   .....: )
   .....: 

In [115]: cat
Out[115]: 
0    1
1    2
2    3
dtype: category
Categories (3, int64): [3 < 2 < 1]

In [116]: cat_base
Out[116]: 
0    2
1    2
2    2
dtype: category
Categories (3, int64): [3 < 2 < 1]

In [117]: cat_base2
Out[117]: 
0    2
1    2
2    2
dtype: category
Categories (1, int64): [2]

与具有相同类别和排序的分类或标量作品相比：

In [118]: cat > cat_base
Out[118]: 
0     True
1    False
2    False
dtype: bool

In [119]: cat > 2
Out[119]: 
0     True
1    False
2    False
dtype: bool

等式比较适用于任何长度和标量相同的类似列表的对象：

In [120]: cat == cat_base
Out[120]: 
0    False
1     True
2    False
dtype: bool

In [121]: cat == np.array([1,2,3])
Out[121]: 
0    True
1    True
2    True
dtype: bool

In [122]: cat == 2
Out[122]: 
0    False
1     True
2    False
dtype: bool

这不起作用，因为类别不一样：

In [123]: try:
   .....:     cat > cat_base2
   .....: except TypeError as e:
   .....:      print("TypeError: " + str(e))
   .....: 
TypeError: Categoricals can only be compared if 'categories' are the same. Categories are different lengths

如果要对分类系列与非分类数据的类似列表的对象进行“非相等”比较，则需要明确并将分类数据转换回原始值：

In [124]: base = np.array([1,2,3])

In [125]: try:
   .....:     cat > base
   .....: except TypeError as e:
   .....:      print("TypeError: " + str(e))
   .....: 
TypeError: Cannot compare a Categorical for op __gt__ with type <class 'numpy.ndarray'>.
If you want to compare values, use 'np.asarray(cat) <op> other'.

In [126]: np.asarray(cat) > base
Out[126]: array([False, False, False], dtype=bool)

当您比较具有相同类别的两个无序分类时，不会考虑订单：

In [127]: c1 = pd.Categorical(['a', 'b'], categories=['a', 'b'], ordered=False)

In [128]: c2 = pd.Categorical(['a', 'b'], categories=['b', 'a'], ordered=False)

In [129]: c1 == c2
Out[129]: array([ True,  True], dtype=bool)

操作

除了Series.min()，Series.max()并且Series.mode()，下面的操作是可能的分类数据：

Series类似的方法Series.value_counts()将使用所有类别，即使数据中不存在某些类别：

In [130]: s = pd.Series(pd.Categorical(["a","b","c","c"], categories=["c","a","b","d"]))

In [131]: s.value_counts()
Out[131]: 
c    2
b    1
a    1
d    0
dtype: int64

Groupby还将显示“未使用”类别：

In [132]: cats = pd.Categorical(["a","b","b","b","c","c","c"], categories=["a","b","c","d"])

In [133]: df = pd.DataFrame({"cats":cats,"values":[1,2,2,2,3,4,5]})

In [134]: df.groupby("cats").mean()
Out[134]: 
      values
cats        
a        1.0
b        2.0
c        4.0
d        NaN

In [135]: cats2 = pd.Categorical(["a","a","b","b"], categories=["a","b","c"])

In [136]: df2 = pd.DataFrame({"cats":cats2,"B":["c","d","c","d"], "values":[1,2,3,4]})

In [137]: df2.groupby(["cats","B"]).mean()
Out[137]: 
        values
cats B        
a    c     1.0
     d     2.0
b    c     3.0
     d     4.0
c    c     NaN
     d     NaN

数据透视表：

In [138]: raw_cat = pd.Categorical(["a","a","b","b"], categories=["a","b","c"])

In [139]: df = pd.DataFrame({"A":raw_cat,"B":["c","d","c","d"], "values":[1,2,3,4]})

In [140]: pd.pivot_table(df, values='values', index=['A', 'B'])
Out[140]: 
     values
A B        
a c       1
  d       2
b c       3
  d       4

数据修改

优化的Pandas数据访问方法 .loc，.iloc，.at，和.iat，工作正常。唯一的区别是返回类型（用于获取），并且只能分配类别中已有的值。

获得

如果切片操作返回或者是DataFrame或类型的列 Series中，category Dtype细胞被保留。

In [141]: idx = pd.Index(["h","i","j","k","l","m","n",])

In [142]: cats = pd.Series(["a","b","b","b","c","c","c"], dtype="category", index=idx)

In [143]: values= [1,2,2,2,3,4,5]

In [144]: df = pd.DataFrame({"cats":cats,"values":values}, index=idx)

In [145]: df.iloc[2:4,:]
Out[145]: 
  cats  values
j    b       2
k    b       2

In [146]: df.iloc[2:4,:].dtypes
Out[146]: 
cats      category
values       int64
dtype: object

In [147]: df.loc["h":"j","cats"]
Out[147]: 
h    a
i    b
j    b
Name: cats, dtype: category
Categories (3, object): [a, b, c]

In [148]: df[df["cats"] == "b"]
Out[148]: 
  cats  values
i    b       2
j    b       2
k    b       2

不保留类别类型的示例是如果您选择一行：结果Series是dtype object：

# get the complete "h" row as a Series
In [149]: df.loc["h", :]
Out[149]: 
cats      a
values    1
Name: h, dtype: object

从分类数据返回单个项目也将返回值，而不是长度为“1”的分类。

In [150]: df.iat[0,0]
Out[150]: 'a'

In [151]: df["cats"].cat.categories = ["x","y","z"]

In [152]: df.at["h","cats"] # returns a string
Out[152]: 'x'

注意：这与R的因子函数形成对比，后者factor(c(1,2,3))[1] 返回单个值因子。

要获取单个值Series类型category，请传入包含单个值的列表：

In [153]: df.loc[["h"],"cats"]
Out[153]: 
h    x
Name: cats, dtype: category
Categories (3, object): [x, y, z]

字符串和日期时间访问器

访问器 .dt和.str如果将工作s.cat.categories是一个合适的类型：

In [154]: str_s = pd.Series(list('aabb'))

In [155]: str_cat = str_s.astype('category')

In [156]: str_cat
Out[156]: 
0    a
1    a
2    b
3    b
dtype: category
Categories (2, object): [a, b]

In [157]: str_cat.str.contains("a")
Out[157]: 
0     True
1     True
2    False
3    False
dtype: bool

In [158]: date_s = pd.Series(pd.date_range('1/1/2015', periods=5))

In [159]: date_cat = date_s.astype('category')

In [160]: date_cat
Out[160]: 
0   2015-01-01
1   2015-01-02
2   2015-01-03
3   2015-01-04
4   2015-01-05
dtype: category
Categories (5, datetime64[ns]): [2015-01-01, 2015-01-02, 2015-01-03, 2015-01-04, 2015-01-05]

In [161]: date_cat.dt.day
Out[161]: 
0    1
1    2
2    3
3    4
4    5
dtype: int64

注意：返回的Series（或DataFrame）与您在该类型上使用 .str.<method>/ （而不是类型！）的类型相同。.dt.<method>Seriescategory

这意味着，来自a的访问器上的方法和属性 Series的返回值以及从此Series变换为类型类别之一的访问器上的方法和属性的返回值将是相等的：

In [162]: ret_s = str_s.str.contains("a")

In [163]: ret_cat = str_cat.str.contains("a")

In [164]: ret_s.dtype == ret_cat.dtype
Out[164]: True

In [165]: ret_s == ret_cat
Out[165]: 
0    True
1    True
2    True
3    True
dtype: bool

注意：工作完成后categories，然后Series构建一个新的。如果您有一个Series类型字符串，其中有许多元素被重复（即，其中的唯一元素的数量Series远小于其长度），则这具有一些性能影响Series。在这种情况下，将原始文件转换Series 为类型category和使用版本.str.<method>或其中之一可以更快.dt.<property>。

设定

Series只要值包含在类别中，设置分类列（或）中的值就会起作用：

In [166]: idx = pd.Index(["h","i","j","k","l","m","n"])

In [167]: cats = pd.Categorical(["a","a","a","a","a","a","a"], categories=["a","b"])

In [168]: values = [1,1,1,1,1,1,1]

In [169]: df = pd.DataFrame({"cats":cats,"values":values}, index=idx)

In [170]: df.iloc[2:4,:] = [["b",2],["b",2]]

In [171]: df
Out[171]: 
  cats  values
h    a       1
i    a       1
j    b       2
k    b       2
l    a       1
m    a       1
n    a       1

In [172]: try:
   .....:     df.iloc[2:4,:] = [["c",3],["c",3]]
   .....: except ValueError as e:
   .....:     print("ValueError: " + str(e))
   .....: 
ValueError: Cannot setitem on a Categorical with a new category, set the categories first

通过分配分类数据来设置值还将检查类别是否匹配：

In [173]: df.loc["j":"k","cats"] = pd.Categorical(["a","a"], categories=["a","b"])

In [174]: df
Out[174]: 
  cats  values
h    a       1
i    a       1
j    a       2
k    a       2
l    a       1
m    a       1
n    a       1

In [175]: try:
   .....:     df.loc["j":"k","cats"] = pd.Categorical(["b","b"], categories=["a","b","c"])
   .....: except ValueError as e:
   .....:     print("ValueError: " + str(e))
   .....: 
ValueError: Cannot set a Categorical with another, without identical categories

将a分配Categorical给其他类型的列的部分将使用以下值：

In [176]: df = pd.DataFrame({"a":[1,1,1,1,1], "b":["a","a","a","a","a"]})

In [177]: df.loc[1:2,"a"] = pd.Categorical(["b","b"], categories=["a","b"])

In [178]: df.loc[2:3,"b"] = pd.Categorical(["b","b"], categories=["a","b"])

In [179]: df
Out[179]: 
   a  b
0  1  a
1  b  a
2  b  b
3  1  b
4  1  a

In [180]: df.dtypes
Out[180]: 
a    object
b    object
dtype: object

合并

您可以将DataFrames包含分类数据的两个连接在一起，但这些分类的类别必须相同：

In [181]: cat = pd.Series(["a","b"], dtype="category")

In [182]: vals = [1,2]

In [183]: df = pd.DataFrame({"cats":cat, "vals":vals})

In [184]: res = pd.concat([df,df])

In [185]: res
Out[185]: 
  cats  vals
0    a     1
1    b     2
0    a     1
1    b     2

In [186]: res.dtypes
Out[186]: 
cats    category
vals       int64
dtype: object

在这种情况下，类别不一样，因此会引发错误：

In [187]: df_different = df.copy()

In [188]: df_different["cats"].cat.categories = ["c","d"]

In [189]: try:
   .....:     pd.concat([df,df_different])
   .....: except ValueError as e:
   .....:     print("ValueError: " + str(e))
   .....:

这同样适用于df.append(df_different)。

有关保留合并dtypes和性能的说明，另请参阅合并dtypes一节。

联合

版本0.19.0中的新功能。

如果要组合不一定具有相同类别的分类，则该union_categoricals()函数将组合类似列表的列表。新类别将是所组合类别的联合。

In [190]: from pandas.api.types import union_categoricals

In [191]: a = pd.Categorical(["b", "c"])

In [192]: b = pd.Categorical(["a", "b"])

In [193]: union_categoricals([a, b])
Out[193]: 
[b, c, a, b]
Categories (3, object): [b, c, a]

默认情况下，生成的类别将按照它们在数据中的显示进行排序。如果要将类别设为lexsorted，请使用sort_categories=True参数。

In [194]: union_categoricals([a, b], sort_categories=True)
Out[194]: 
[b, c, a, b]
Categories (3, object): [a, b, c]

union_categoricals也适用于结合两个相同类别和订单信息的分类的“简单”情况（例如，您也可以append使用）。

In [195]: a = pd.Categorical(["a", "b"], ordered=True)

In [196]: b = pd.Categorical(["a", "b", "a"], ordered=True)

In [197]: union_categoricals([a, b])
Out[197]: 
[a, b, a, b, a]
Categories (2, object): [a < b]

以下提出TypeError因为类别是有序的而且不相同。

In [1]: a = pd.Categorical(["a", "b"], ordered=True)
In [2]: b = pd.Categorical(["a", "b", "c"], ordered=True)
In [3]: union_categoricals([a, b])
Out[3]:
TypeError: to union ordered Categoricals, all categories must be the same

版本0.20.0中的新功能。

可以使用ignore_ordered=True参数组合具有不同类别或排序的有序分类。

In [198]: a = pd.Categorical(["a", "b", "c"], ordered=True)

In [199]: b = pd.Categorical(["c", "b", "a"], ordered=True)

In [200]: union_categoricals([a, b], ignore_order=True)
Out[200]: 
[a, b, c, c, b, a]
Categories (3, object): [a, b, c]

union_categoricals()也适用于a CategoricalIndex或Series包含分类数据，但请注意，生成的数组将始终为plain Categorical：

In [201]: a = pd.Series(["b", "c"], dtype='category')

In [202]: b = pd.Series(["a", "b"], dtype='category')

In [203]: union_categoricals([a, b])
Out[203]: 
[b, c, a, b]
Categories (3, object): [b, c, a]

注意：union_categoricals可以在组合分类时重新编码类别的整数代码。这可能是您想要的，但如果您依赖于类别的确切编号，请注意。

In [204]: c1 = pd.Categorical(["b", "c"])

In [205]: c2 = pd.Categorical(["a", "b"])

In [206]: c1
Out[206]: 
[b, c]
Categories (2, object): [b, c]

# "b" is coded to 0
In [207]: c1.codes
Out[207]: array([0, 1], dtype=int8)

In [208]: c2
Out[208]: 
[a, b]
Categories (2, object): [a, b]

# "b" is coded to 1
In [209]: c2.codes
Out[209]: array([0, 1], dtype=int8)

In [210]: c = union_categoricals([c1, c2])

In [211]: c
Out[211]: 
[b, c, a, b]
Categories (3, object): [b, c, a]

# "b" is coded to 0 throughout, same as c1, different from c2
In [212]: c.codes
Out[212]: array([0, 1, 2, 0], dtype=int8)

连接

本节介绍特定于categorydtype的连接。有关一般说明，请参见连接对象。

默认情况下，Series或DataFrame包含相同类别的串联会产生categorydtype，否则会产生objectdtype。使用.astype或union_categoricals获得category结果。

# same categories
In [213]: s1 = pd.Series(['a', 'b'], dtype='category')

In [214]: s2 = pd.Series(['a', 'b', 'a'], dtype='category')

In [215]: pd.concat([s1, s2])
Out[215]: 
0    a
1    b
0    a
1    b
2    a
dtype: category
Categories (2, object): [a, b]

# different categories
In [216]: s3 = pd.Series(['b', 'c'], dtype='category')

In [217]: pd.concat([s1, s3])
Out[217]: 
0    a
1    b
0    b
1    c
dtype: object

In [218]: pd.concat([s1, s3]).astype('category')
Out[218]: 
0    a
1    b
0    b
1    c
dtype: category
Categories (3, object): [a, b, c]

In [219]: union_categoricals([s1.values, s3.values])
Out[219]: 
[a, b, b, c]
Categories (3, object): [a, b, c]

下表总结了Categoricals相关连接的结果。

ARG1	ARG2	结果
类别	类别（相同类别）	类别
类别	类别（不同类别，均未订购）	对象（推断出dtype）
类别	类别（不同的类别，任何一个订购）	对象（推断出dtype）
类别	不是类别	对象（推断出dtype）

获取数据进/出

您可以将包含categorydtypes的数据写入a HDFStore。请参阅此处以获取示例和警告。

也可以将数据写入Stata格式文件并从Stata格式文件中读取数据。请参阅此处以获取示例和警告。

写入CSV文件将转换数据，有效地删除有关分类（类别和排序）的任何信息。因此，如果您回读CSV文件，则必须将相关列转换回类别并分配正确的类别和类别排序。

In [220]: s = pd.Series(pd.Categorical(['a', 'b', 'b', 'a', 'a', 'd']))

# rename the categories
In [221]: s.cat.categories = ["very good", "good", "bad"]

# reorder the categories and add missing categories
In [222]: s = s.cat.set_categories(["very bad", "bad", "medium", "good", "very good"])

In [223]: df = pd.DataFrame({"cats":s, "vals":[1,2,3,4,5,6]})

In [224]: csv = StringIO()

In [225]: df.to_csv(csv)

In [226]: df2 = pd.read_csv(StringIO(csv.getvalue()))

In [227]: df2.dtypes
Out[227]: 
Unnamed: 0     int64
cats          object
vals           int64
dtype: object

In [228]: df2["cats"]
Out[228]: 
0    very good
1         good
2         good
3    very good
4    very good
5          bad
Name: cats, dtype: object

# Redo the category
In [229]: df2["cats"] = df2["cats"].astype("category")

In [230]: df2["cats"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"],
   .....:                                inplace=True)
   .....: 

In [231]: df2.dtypes
Out[231]: 
Unnamed: 0       int64
cats          category
vals             int64
dtype: object

In [232]: df2["cats"]
Out[232]: 
0    very good
1         good
2         good
3    very good
4    very good
5          bad
Name: cats, dtype: category
Categories (5, object): [very bad, bad, medium, good, very good]

用于写入SQL数据库也是如此to_sql。

缺少数据

pandas主要使用值np.nan来表示缺失的数据。它默认不包含在计算中。请参阅缺失数据部分。

缺失值应不列入范畴的categories，只有在values。相反，据了解，NaN是不同的，并且始终是一种可能性。使用Categorical时codes，缺失值将始终具有代码-1。

In [233]: s = pd.Series(["a", "b", np.nan, "a"], dtype="category")

# only two categories
In [234]: s
Out[234]: 
0      a
1      b
2    NaN
3      a
dtype: category
Categories (2, object): [a, b]

In [235]: s.cat.codes
Out[235]: 
0    0
1    1
2   -1
3    0
dtype: int8

对于丢失的数据，工作方法例如isna()，fillna()， dropna()，所有正常工作：

In [236]: s = pd.Series(["a", "b", np.nan], dtype="category")

In [237]: s
Out[237]: 
0      a
1      b
2    NaN
dtype: category
Categories (2, object): [a, b]

In [238]: pd.isna(s)
Out[238]: 
0    False
1    False
2     True
dtype: bool

In [239]: s.fillna("a")
Out[239]: 
0    a
1    b
2    a
dtype: category
Categories (2, object): [a, b]

与R 因子的差异

可以观察到R因子函数的以下差异：

R的级别被命名为类别。
R的级别始终是字符串类型，而pandas中的类别可以是任何dtype。
在创建时无法指定标签。s.cat.rename_categories(new_labels) 之后使用。
与R的因子函数相比，使用分类数据作为创建新分类系列的唯一输入将不会删除未使用的类别，但会创建一个新的分类系列，该系列等于传入的一个！
R允许将缺失值包含在其级别（pandas的类别）中。Pandas不允许NaN类别，但缺失的值仍然可以在值中。

陷阱

内存使用

a的内存使用量与Categorical类别数量加上数据长度成比例。相反，objectdtype是数据长度的常数。

In [240]: s = pd.Series(['foo','bar']*1000)

# object dtype
In [241]: s.nbytes
Out[241]: 16000

# category dtype
In [242]: s.astype('category').nbytes
Out[242]: 2016

注意：如果类别的数量接近数据的长度，Categorical则将使用与等效的objectdtype表示几乎相同或更多的存储器。

In [243]: s = pd.Series(['foo%04d' % i for i in range(2000)])

# object dtype
In [244]: s.nbytes
Out[244]: 16000

# category dtype
In [245]: s.astype('category').nbytes
Out[245]: 20000

分类不是一个numpy数组

目前，分类数据和底层Categorical实现为Python对象，而不是低级NumPy数组dtype。这导致一些问题。

NumPy本身不知道新的dtype：

In [246]: try:
   .....:     np.dtype("category")
   .....: except TypeError as e:
   .....:     print("TypeError: " + str(e))
   .....: 
TypeError: data type "category" not understood

In [247]: dtype = pd.Categorical(["a"]).dtype

In [248]: try:
   .....:     np.dtype(dtype)
   .....: except TypeError as e:
   .....:      print("TypeError: " + str(e))
   .....: 
TypeError: data type not understood

Dtype比较工作：

In [249]: dtype == np.str_
Out[249]: False

In [250]: np.str_ == dtype
Out[250]: False

要检查系列是否包含分类数据，请使用：hasattr(s, 'cat')

In [251]: hasattr(pd.Series(['a'], dtype='category'), 'cat')
Out[251]: True

In [252]: hasattr(pd.Series(['a']), 'cat')
Out[252]: False

在Series类型上使用NumPy函数category不应该起作用，因为分类不是数字数据（即使在.categories数字的情况下）。

In [253]: s = pd.Series(pd.Categorical([1,2,3,4]))

In [254]: try:
   .....:     np.sum(s)
   .....: except TypeError as e:
   .....:      print("TypeError: " + str(e))
   .....: 
TypeError: Categorical cannot perform the operation sum

注意：如果这样的功能有效，请在https://github.com/pandas-dev/pandas上提交错误！

dtype在应用

Pandas目前在应用功能不保留Dtype：如果你沿着行申请你得到一个系列的object Dtype（与获取行- >获得一个元素将返回一个基本型），并沿列应用也将转换为对象。

In [255]: df = pd.DataFrame({"a":[1,2,3,4],
   .....:                    "b":["a","b","c","d"],
   .....:                    "cats":pd.Categorical([1,2,3,2])})
   .....: 

In [256]: df.apply(lambda row: type(row["cats"]), axis=1)
Out[256]: 
0    <class 'int'>
1    <class 'int'>
2    <class 'int'>
3    <class 'int'>
dtype: object

In [257]: df.apply(lambda col: col.dtype, axis=0)
Out[257]: 
a          int64
b         object
cats    category
dtype: object

分类索引

CategoricalIndex是一种索引，可用于支持带有重复项的索引。这是一个围绕a的容器Categorical ，允许有效索引和存储具有大量重复元素的索引。有关更详细的说明，请参阅高级索引文档。

设置索引将创建一个CategoricalIndex：

In [258]: cats = pd.Categorical([1,2,3,4], categories=[4,2,3,1])

In [259]: strings = ["a","b","c","d"]

In [260]: values = [4,2,3,1]

In [261]: df = pd.DataFrame({"strings":strings, "values":values}, index=cats)

In [262]: df.index
Out[262]: CategoricalIndex([1, 2, 3, 4], categories=[4, 2, 3, 1], ordered=False, dtype='category')

# This now sorts by the categories order
In [263]: df.sort_index()
Out[263]: 
  strings  values
4       d       1
2       b       2
3       c       3
1       a       4

副作用

构建Series从Categorical不会拷贝输入 Categorical。这意味着Series在大多数情况下，对遗嘱的更改会更改原始内容Categorical：

In [264]: cat = pd.Categorical([1,2,3,10], categories=[1,2,3,4,10])

In [265]: s = pd.Series(cat, name="cat")

In [266]: cat
Out[266]: 
[1, 2, 3, 10]
Categories (5, int64): [1, 2, 3, 4, 10]

In [267]: s.iloc[0:2] = 10

In [268]: cat
Out[268]: 
[10, 10, 3, 10]
Categories (5, int64): [1, 2, 3, 4, 10]

In [269]: df = pd.DataFrame(s)

In [270]: df["cat"].cat.categories = [1,2,3,4,5]

In [271]: cat
Out[271]: 
[5, 5, 3, 5]
Categories (5, int64): [1, 2, 3, 4, 5]

使用copy=True防止这种行为，或者干脆不重用Categoricals：

In [272]: cat = pd.Categorical([1,2,3,10], categories=[1,2,3,4,10])

In [273]: s = pd.Series(cat, name="cat", copy=True)

In [274]: cat
Out[274]: 
[1, 2, 3, 10]
Categories (5, int64): [1, 2, 3, 4, 10]

In [275]: s.iloc[0:2] = 10

In [276]: cat
Out[276]: 
[1, 2, 3, 10]
Categories (5, int64): [1, 2, 3, 4, 10]

注意：在某些情况下，当您提供NumPy数组而不是a时，也会发生这种情况Categorical：使用int数组（例如np.array([1,2,3,4])）将表现出相同的行为，而使用字符串数组（例如np.array(["a","b","c","a"])）则不会。

参考：http://pandas.pydata.org/pandas-docs/stable/categorical.html#

ChenVast

关注

8
点赞
踩
21

收藏

觉得还不错? 一键收藏
0
评论
【Pandas】Pandas数据分类

分类是与统计中的分类变量对应的pandas数据类型。分类变量采用有限的，通常是固定的可能值（类别 ; R中的级别）。例如性别，社会阶层，血型，国家归属，观察时间或通过李克特量表评级。与统计分类变量相比，分类数据可能有一个顺序（例如“强烈同意”与“同意”或“第一次观察”与“第二次观察”），但数值运算（加法，除法......）是不可能的。分类数据的所有值都是类别或np.nan。顺序由类别的顺序...
复制链接

扫一扫