python astype category_利用Python进行数据分析(11)-高阶应用category

本文介绍了如何使用Python的pandas库处理分类数据,包括利用`astype('category')`转换数据类型,使用`pd.Categorical`创建Categorical实例,以及通过`qcut()`进行四分位数切割。此外,还探讨了Categorical类型如何提高性能,并展示了如何通过`get_dummies()`生成虚拟变量。
摘要由CSDN通过智能技术生成

本文中介绍的是pandas的高阶应用-分类数据category​

image

分裂数据Categorical

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

使用背景和目标

一个列中经常会包含重复值,这些重复值是一个小型的不同值的集合。

unique()和value_counts()能够从数组中提取到不同的值并分别计算它们的频率

values = pd.Series(["apple","orange","apple","apple"] * 2)

values

0 apple

1 orange

2 apple

3 apple

4 apple

5 orange

6 apple

7 apple

dtype: object

pd.unique(values) # 查看不同的取值情况

array(['apple', 'orange'], dtype=object)

pd.value_counts(values) # 查看每个值的个数

apple 6

orange 2

dtype: int64

维度表

维度表包含了不同的值,将主要观测值存储为引用维度表的整数键

values = pd.Series([0,1,0,0] * 2)

dim = pd.Series(["apple","orange"])

values

0 0

1 1

2 0

3 0

4 0

5 1

6 0

7 0

dtype: int64

dim

0 apple

1 orange

dtype: object

take方法-分类(字典编码展现)

不同值的数组被称之为数据的类别、字典或者层级

dim.take(values)

0 apple

1 orange

0 apple

0 apple

0 apple

1 orange

0 apple

0 apple

dtype: object

使用Categorical类型

fruits = ["apple","orange","apple","apple"] * 2

N = len(fruits)

df = pd.DataFrame({"fruit":fruits, # 指定每列的取值内容

"basket_id":np.arange(N),

"count":np.random.randint(3,15,size=N),

"weight":np.random.uniform(0,4,size=N)},

columns=["basket_id","fruit","count","weight"]) # 4个属性值

df

image.png

df["fruit"]

0 apple

1 orange

2 apple

3 apple

4 apple

5 orange

6 apple

7 apple

Name: fruit, dtype: object

如何生成Categorical实例

fruit_cat = df["fruit"].astype("category") # 调用函数改变

fruit_cat # 变成pd.Categorical的实例

0 apple

1 orange

2 apple

3 apple

4 apple

5 orange

6 apple

7 apple

Name: fruit, dtype: category

Categories (2, object): [apple, orange]

c = fruit_cat.values

c

[apple, orange, apple, apple, apple, orange, apple, apple]

Categories (2, object): [apple, orange]

两个属性:categories + codes

print(c.categories)

print("-----")

print(c.codes)

Index(['apple', 'orange'], dtype='object')

-----

[0 1 0 0 0 1 0 0]

# 将DF的一列转成Categorical对象

df["fruit"] = df["fruit"].astype("category")

df.fruit

0 apple

1 orange

2 apple

3 apple

4 apple

5 orange

6 apple

7 apple

Name: fruit, dtype: category

Categories (2, object): [apple, orange]

从其他序列生成pd.Categorical对象

my_categories = pd.Categorical(['foo','bar','baz','foo','bar'])

my_categories

[foo, bar, baz, foo, bar]

Categories (3, object): [bar, baz, foo]

已知分类编码数据的情况:from_codes

categories = ["foo","bar","baz"]

codes = [0,1,0,0,1,0,1,0]

my_code = pd.Categorical.from_codes(codes,categories)

my_code

[foo, bar, foo, foo, bar, foo, bar, foo]

Categories (3, object): [foo, bar, baz]

显式指定分类顺序:ordered = True

如果不指定顺序,分类转换是无序的。我们可以自己显式地指定

ordered_cat = pd.Categorical.from_codes(codes,categories # 指定分类用的数据

,ordered=True)

ordered_cat

[foo, bar, foo, foo, bar, foo, bar, foo]

Categories (3, object): [foo < bar < baz]

未排序的实例通过as_ordered排序

# 未排序的实例通过as_ordered来进行排序

my_categories.as_ordered()

[foo, bar, baz, foo, bar]

Categories (3, object): [bar < baz < foo]

Categorical对象来进行计算

np.random.seed(12345) # 设置随机种子

draws = np.random.randn(1000)

draws[:5]

array([-0.20470766, 0.47894334, -0.51943872, -0.5557303 , 1.96578057])

qcut()函数-四分位数

# 计算四位分箱

bins = pd.qcut(draws,4)

bins

[(-0.684, -0.0101], (-0.0101, 0.63], (-0.684, -0.0101], (-0.684, -0.0101], (0.63, 3.928], ..., (-0.0101, 0.63], (-0.684, -0.0101], (-2.9499999999999997, -0.684], (-0.0101, 0.63], (0.63, 3.928]]

Length: 1000

Categories (4, interval[float64]): [(-2.9499999999999997, -0.684] < (-0.684, -0.0101] < (-0.0101, 0.63] < (0.63, 3.928]]

四分位数名称 labels

bins = pd.qcut(draws,4,labels=["Q1","Q2","Q3","Q4"])

bins

[Q2, Q3, Q2, Q2, Q4, ..., Q3, Q2, Q1, Q3, Q4]

Length: 1000

Categories (4, object): [Q1 < Q2 < Q3 < Q4]

bins.codes[:10]

array([1, 2, 1, 1, 3, 3, 2, 2, 3, 3], dtype=int8)

结合groupby提取汇总信息

bins = pd.Series(bins, name="quartile")

results = (pd.Series(draws)

.groupby(bins)

.agg(["count","min","max"]).reset_index()

)

results

image.png

results["quartile"] # 保留原始中的分类信息

0 Q1

1 Q2

2 Q3

3 Q4

Name: quartile, dtype: category

Categories (4, object): [Q1 < Q2 < Q3 < Q4]

分类提高性能

如果在特定的数据集上做了大量的数据分析,将数据转成分类数据有大大提高性能

N = 10000000

draws = pd.Series(np.random.randn(N))

labels = pd.Series(["foo","bar","baz","qux"] * (N // 4))

labels

0 foo

1 bar

2 baz

3 qux

4 foo

...

9999995 qux

9999996 foo

9999997 bar

9999998 baz

9999999 qux

Length: 10000000, dtype: object

转成分类数据

# 转成分类数据

categories = labels.astype("category")

categories

0 foo

1 bar

2 baz

3 qux

4 foo

...

9999995 qux

9999996 foo

9999997 bar

9999998 baz

9999999 qux

Length: 10000000, dtype: category

Categories (4, object): [bar, baz, foo, qux]

内存比较

labels.memory_usage()

80000128

categories.memory_usage()

10000320

分类转换的开销

%time _ = labels.astype("category")

CPU times: user 374 ms, sys: 34.8 ms, total: 409 ms

Wall time: 434 ms

分类方法

s = pd.Series(["a","b","c","d"] * 2)

cat_s = s.astype("category")

cat_s

0 a

1 b

2 c

3 d

4 a

5 b

6 c

7 d

dtype: category

Categories (4, object): [a, b, c, d]

cat属性

特殊属性cat提供了对分类方法的访问

codes

categories

set_categories

cat_s.cat.codes

0 0

1 1

2 2

3 3

4 0

5 1

6 2

7 3

dtype: int8

cat_s.cat.categories

Index(['a', 'b', 'c', 'd'], dtype='object')

数据的实际类别超出给定的个数

actual_categories = ["a","b","c","d","e"]

cat_s2 = cat_s.cat.set_categories(actual_categories)

cat_s2

0 a

1 b

2 c

3 d

4 a

5 b

6 c

7 d

dtype: category

Categories (5, object): [a, b, c, d, e]

cat_s2.value_counts()

d 2

c 2

b 2

a 2

e 0

dtype: int64

去除不在数据中的类别

cat_s3 = cat_s[cat_s.isin(["a","b"])]

cat_s3

0 a

1 b

4 a

5 b

dtype: category

Categories (4, object): [a, b, c, d]

# c、d没有出现,直接删除

cat_s3.cat.remove_unused_categories()

0 a

1 b

4 a

5 b

dtype: category

Categories (2, object): [a, b]

如何创建虚拟变量:get_dummies()

在机器学习或统计数据中,通常需要将分类数据转成虚拟变量,也称之为one-hot编码

cat_s = pd.Series(["a","b","c","d"] * 2, dtype="category")

cat_s

0 a

1 b

2 c

3 d

4 a

5 b

6 c

7 d

dtype: category

Categories (4, object): [a, b, c, d]

pd.get_dummies(cat_s)

image.png

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值