Pandas的Categorical Data相关函数 (17)

本章在前两章内容的理解之上研究一下和Categorical Data相关的一些函数和属性。

17.1 修改Categorical的categories

Categorical Data数据的categories是可以通过赋值或者rename函数被修替换改掉的。

  • 通过赋值的方式改变categories或者用set_categories() 函数。
import pandas as pd
idx = [1,2,3,5,6,7,9,4,8]
name = ["apple","pearl","orange", "apple","orange","orange","apple","pearl","orange"]
price = [5.20,3.50,7.30,5.00,7.50,7.30,5.20,3.70,7.30]
df = pd.DataFrame({ "fruit": name , "price" : price}, index = idx)
df['fruit'] = df['fruit'].astype('category')
print df,"\n"
print df.fruit.values.categories
print df.fruit.values.codes
df.fruit.values.categories = ["Pearl", "Orange", "Apple"]
print df.fruit.values.categories
print df.fruit.values.codes
print df

程序执行结果

    fruit  price
1   apple    5.2
2   pearl    3.5
3  orange    7.3
5   apple    5.0
6  orange    7.5
7  orange    7.3
9   apple    5.2
4   pearl    3.7
8  orange    7.3 

Index([u'apple', u'orange', u'pearl'], dtype='object')
[0 2 1 0 1 1 0 2 1]
Index([u'Pearl', u'Orange', u'Apple'], dtype='object')
[0 2 1 0 1 1 0 2 1]
    fruit  price
1   Pearl    5.2
2   Apple    3.5
3  Orange    7.3
5   Pearl    5.0
6  Orange    7.5
7  Orange    7.3
9   Pearl    5.2
4   Apple    3.7
8  Orange    7.3

对比一下改变了categorical date的categories后数据的变化情况,这里categorical的codes并未改变,但最后的df.fruit的输出值values的值却发生了变化。

供货商水果价格替换前后水果价格
1apple5.20<==>Pearl5.20
2pearl3.50<==>Apple3.50
3orange7.30<==>Orange7.30
5apple5.00<==>Pearl5.00
6orange7.50<==>Orange7.50
7orange7.30<==>Orange7.30
9apple5.20<==>Pearl5.20
4pearl3.70<==>Apple3.70
8orange7.30<==>Orange7.30

原因是原categorical data变量df.fruit的categories是["apple","orange","pearl" ]被变成了["Pearl", "Orange", "Apple"],注意此函数有个参数inplace默认是False即不影响原数据,如果想影响原categorical data数据则需将inplace设置为True。

  • 使用rename_categories函数来修改categories值。
import pandas as pd
idx = [1,2,3,5,6,7,9,4,8]
name = ["apple","pearl","orange", "apple","orange","orange","apple","pearl","orange"]
price = [5.20,3.50,7.30,5.00,7.50,7.30,5.20,3.70,7.30]
df = pd.DataFrame({ "fruit": name , "price" : price}, index = idx)
df['fruit'] = df['fruit'].astype('category')
print df[:4],"\n"
print df.fruit.values.categories
print df.fruit.values.codes
df.fruit.values.rename_categories(["Pearl", "Orange", "Apple"],inplace = True)
print df.fruit.values.categories
print df.fruit.values.codes
print df[:4]

程序的执行结果:

    fruit  price
1   apple    5.2
2   pearl    3.5
3  orange    7.3
5   apple    5.0 

Index([u'apple', u'orange', u'pearl'], dtype='object')
[0 2 1 0 1 1 0 2 1]
Index([u'Pearl', u'Orange', u'Apple'], dtype='object')
[0 2 1 0 1 1 0 2 1]
    fruit  price
1   Pearl    5.2
2   Apple    3.5
3  Orange    7.3
5   Pearl    5.0

17.2 增加categories

增加categories即增加分类个数,可以使用add_categories函数。下面给示例增加一个供应水果种类watermelon西瓜。

import pandas as pd
idx = [1,2,3,5,6,7,9,4,8]
name = ["apple","pearl","orange", "apple","orange","orange","apple","pearl","orange"]
price = [5.20,3.50,7.30,5.00,7.50,7.30,5.20,3.70,7.30]
df = pd.DataFrame({ "fruit": name , "price" : price}, index = idx)
df['fruit'] = df['fruit'].astype('category')
df_new = pd.DataFrame({"fruit":["watermelon"] * 3, 
                       "price":[2.75, 2.60, 2.55]},
                       index = [11, 12, 13])
df.fruit.values.add_categories("watermelon", inplace = True)
print "insert datas->\n",df_new
df = df.append(df_new)
print "after insert->\n", df
print "categories->\n",df.fruit.values.categories
print "codes->\n",df.fruit.values.codes

程序执行结果如下:

insert datas->
         fruit  price
11  watermelon   2.75
12  watermelon   2.60
13  watermelon   2.55
after insert->
         fruit  price
1        apple   5.20
2        pearl   3.50
3       orange   7.30
5        apple   5.00
6       orange   7.50
7       orange   7.30
9        apple   5.20
4        pearl   3.70
8       orange   7.30
11  watermelon   2.75
12  watermelon   2.60
13  watermelon   2.55
categories->
Index([u'apple', u'orange', u'pearl', u'watermelon'], dtype='object')
codes->
[0 2 1 0 1 1 0 2 1 3 3 3]

这里需要注意的是add_categories函数需要在插入数据之前调用,否则数据增加进去了但是codes并未更新都是-1。 如果将df.fruit.values.add_categories("watermelon", inplace = True)放在添加数据语句df = df.append(df_new)之后:

import pandas as pd
idx = [1,2,3,5,6,7,9,4,8]
name = ["apple","pearl","orange", "apple","orange","orange","apple","pearl","orange"]
price = [5.20,3.50,7.30,5.00,7.50,7.30,5.20,3.70,7.30]
df = pd.DataFrame({ "fruit": name , "price" : price}, index = idx)
df['fruit'] = df['fruit'].astype('category')
df_new = pd.DataFrame({"fruit":["watermelon"] * 3, 
                       "price":[2.75, 2.60, 2.55]},
                       index = [11, 12, 13])

print "insert datas->\n",df_new
df = df.append(df_new)
df.fruit.values.add_categories("watermelon", inplace = True)
print "after insert->\n", df
print "categories->\n",df.fruit.values.categories
print "codes->\n",df.fruit.values.codes

结果则是:

insert datas->
         fruit  price
11  watermelon   2.75
12  watermelon   2.60
13  watermelon   2.55
after insert->
     fruit  price
1    apple   5.20
2    pearl   3.50
3   orange   7.30
5    apple   5.00
6   orange   7.50
7   orange   7.30
9    apple   5.20
4    pearl   3.70
8   orange   7.30
11     NaN   2.75
12     NaN   2.60
13     NaN   2.55
categories->
Index([u'apple', u'orange', u'pearl', u'watermelon'], dtype='object')
codes->
[ 0  2  1  0  1  1  0  2  1 -1 -1 -1]

17.3 删除categories

如果水果点不卖苹果了apple那么fruit下得删除所有的apple记录,种类categories里也得去掉apple。

import pandas as pd
idx = [1,2,3,5,6,7,9,4,8]
name = ["apple","pearl","orange", "apple","orange","orange","apple","pearl","orange"]
price = [5.20,3.50,7.30,5.00,7.50,7.30,5.20,3.70,7.30]
df = pd.DataFrame({ "fruit": name , "price" : price}, index = idx)
df['fruit'] = df['fruit'].astype('category')
print "before del 'apple'\n", df
df = df[df.fruit != "apple"]
df.fruit.values.remove_categories("apple", inplace = True)
print "after del 'apple'\n", df
print "categories->\n",df.fruit.values.categories
print "codes->\n",df.fruit.values.codes
print df.fruit

程序的执行结果:

before del 'apple'
    fruit  price
1   apple    5.2
2   pearl    3.5
3  orange    7.3
5   apple    5.0
6  orange    7.5
7  orange    7.3
9   apple    5.2
4   pearl    3.7
8  orange    7.3
after del 'apple'
    fruit  price
2   pearl    3.5
3  orange    7.3
6  orange    7.5
7  orange    7.3
4   pearl    3.7
8  orange    7.3
categories->
Index([u'orange', u'pearl'], dtype='object')
codes->
[1 0 0 0 1 0]
2     pearl
3    orange
6    orange
7    orange
4     pearl
8    orange
Name: fruit, dtype: category
Categories (2, object): [orange, pearl]

代码里df = df[df.fruit != "apple"]是利用布尔选择删除了所有的"apple"的记录,而df.fruit.values.remove_categories("apple", inplace = True)则是删除了df的fruit这个categorical data的categories里的种类"apple",如果注释掉此语句,codes则还是用原categories进行编码。

import pandas as pd
idx = [1,2,3,5,6,7,9,4,8]
name = ["apple","pearl","orange", "apple","orange","orange","apple","pearl","orange"]
price = [5.20,3.50,7.30,5.00,7.50,7.30,5.20,3.70,7.30]
df = pd.DataFrame({ "fruit": name , "price" : price}, index = idx)
df['fruit'] = df['fruit'].astype('category')
print "before del 'apple'\n", df
df = df[df.fruit != "apple"]
#df.fruit.values.remove_categories("apple", inplace = True)
print "after del 'apple'\n", df
print "categories->\n",df.fruit.values.categories
print "codes->\n",df.fruit.values.codes
print df.fruit

结果为:

before del 'apple'
    fruit  price
1   apple    5.2
2   pearl    3.5
3  orange    7.3
5   apple    5.0
6  orange    7.5
7  orange    7.3
9   apple    5.2
4   pearl    3.7
8  orange    7.3
after del 'apple'
    fruit  price
2   pearl    3.5
3  orange    7.3
6  orange    7.5
7  orange    7.3
4   pearl    3.7
8  orange    7.3
categories->
Index([u'apple', u'orange', u'pearl'], dtype='object')
codes->
[2 1 1 1 2 1]
2     pearl
3    orange
6    orange
7    orange
4     pearl
8    orange
Name: fruit, dtype: category
Categories (3, object): [apple, orange, pearl]

删除了categories后的codes为[1 0 0 0 1 0],没执行删除categories的codes为[2 1 1 1 2 1]

17.4 删除未使用的categories

删除未使用的categories的意思是数据里没有那么的分类,那么可以将categories没有用到的categories删除。

import pandas as pd
cat = ["watermelon","pearl","orange", "apple"]
idx = [1,2,3,5,6,7,9,4,8]
name = ["apple","pearl","orange", "apple","orange","orange","apple","pearl","orange"]
price = [5.20,3.50,7.30,5.00,7.50,7.30,5.20,3.70,7.30]
df = pd.DataFrame({ "fruit": name , "price" : price}, index = idx)
df['fruit'] = df['fruit'].astype('category')
print "1_initial-->"
print "dataframe:\n",df[:3]
print "categories:\n",df.fruit.values.categories
print "codes:\n",df.fruit.values.codes
df.fruit.cat.set_categories(cat, inplace = True)
print "\n2_after set_catgories()-->"
print "dataframe:\n",df[:3]
print "categories:\n",df.fruit.values.categories
print "codes:\n",df.fruit.values.codes
df.fruit.cat.remove_unused_categories(inplace = True)
print "\n3_after remove used categories-->"
print "categories:\n",df.fruit.values.categories
print "codes:\n",df.fruit.values.codes
print "dataframe:\n",df[:3]

程序的执行结果:

1_initial-->
dataframe:
    fruit  price
1   apple    5.2
2   pearl    3.5
3  orange    7.3
categories:
Index([u'apple', u'orange', u'pearl'], dtype='object')
codes:
[0 2 1 0 1 1 0 2 1]

2_after set_catgories()-->
dataframe:
    fruit  price
1   apple    5.2
2   pearl    3.5
3  orange    7.3
categories:
Index([u'watermelon', u'pearl', u'orange', u'apple'], dtype='object')
codes:
[3 1 2 3 2 2 3 1 2]

3_after remove used categories-->
categories:
Index([u'pearl', u'orange', u'apple'], dtype='object')
codes:
[2 0 1 2 1 1 2 0 1]
dataframe:
    fruit  price
1   apple    5.2
2   pearl    3.5
3  orange    7.3

从程序的执行结果可以看出,数据并未发生变化,变化的只是数据的categories。df['fruit'] = df['fruit'].astype('category')设置fruit列为categorical data型数据,创建了其categories为['pearl', 'orange', 'apple'];语句df.fruit.cat.set_categories(cat, inplace = True)改变了其categories为['watermelon', 'pearl', 'orange', 'apple'];而语句df.fruit.cat.remove_unused_categories(inplace = True)删除了尚未使用的watermelon分类回到了['pearl', 'orange', 'apple']

17.5 value_counts函数

value_counts函数可以统计categorical data的各个categories数据出现的次数,算式categorical data的一种典型应用。

import pandas as pd
cat = ["watermelon","pearl","orange", "apple"]
idx = [1,2,3,5,6,7,9,4,8]
name = ["apple","pearl","orange", "apple","orange","orange","apple","pearl","orange"]
price = [5.20,3.50,7.30,5.00,7.50,7.30,5.20,3.70,7.30]
df = pd.DataFrame({ "fruit": name , "price" : price}, index = idx)
df['fruit'] = df['fruit'].astype('category')
print "dataframe:\n",df
print "categories:\n",df.fruit.values.categories
print "codes:\n",df.fruit.values.codes
print "value_counts()\n", df.fruit.value_counts()

程序的执行结果:

dataframe:
    fruit  price
1   apple    5.2
2   pearl    3.5
3  orange    7.3
5   apple    5.0
6  orange    7.5
7  orange    7.3
9   apple    5.2
4   pearl    3.7
8  orange    7.3
categories:
Index([u'apple', u'orange', u'pearl'], dtype='object')
codes:
[0 2 1 0 1 1 0 2 1]
value_counts()
orange    4
apple     3
pearl     2
dtype: int64

Next  Previous

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
分享产生价值! A valuable new edition of a standard reference "A 'must-have' book for anyone expecting to do research and/or applications in categorical data analysis." –Statistics in Medicine on Categorical Data Analysis, First Edition The use of statistical methods for categorical data has increased dramatically, particularly for applications in the biomedical and social sciences. Responding to new developments in the field as well as to the needs of a new generation of professionals and students, this new edition of the classic Categorical Data Analysis offers a comprehensive introduction to the most important methods for categorical data analysis. Designed for statisticians and biostatisticians as well as scientists and graduate students practicing statistics, Categorical Data Analysis, Second Edition summarizes the latest methods for univariate and correlated multivariate categorical responses. Readers will find a unified generalized linear models approach that connects logistic regression and Poisson and negative binomial regression for discrete data with normal regression for continuous data. Adding to the value in the new edition is coverage of: Three new chapters on methods for repeated measurement and other forms of clustered categorical data, including marginal models and associated generalized estimating equations (GEE) methods, and mixed models with random effects Stronger emphasis on logistic regression modeling of binary and multicategory data An appendix showing the use of SAS for conducting nearly all analyses in the book Prescriptions for how ordinal variables should be treated differently than nominal variables Discussion of exact small-sample procedures More than 100 analyses of real data sets to illustrate application of the methods, and more than 600 exercises An Instructor's Manual presenting detailed solutions to all the problems in the book is available from the Wiley editorial department.
Pandas 中,cut() 函数用于将一组数值数据分成多个离散的区间(bins),并将每个数据点分配到对应的区间中。cut() 函数的基本语法如下: ```python pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates='raise') ``` 其中,各参数的含义如下: - x:要进行划分的一维数组或 Series 对象。 - bins:用于划分的区间列表或整数,表示将数据分为几个区间。如果 bins 是一个整数,则表示将数据均匀分为 bins 个区间。 - right:布尔值,表示区间是否包含右端点。默认为 True,即包含右端点。 - labels:用于替换区间的标签,必须是与 bins 长度相同的列表或数组。如果未指定,则默认使用区间的整数编码。 - retbins:布尔值,表示是否返回区间的左右端点。默认为 False,即不返回区间端点。 - precision:整数,表示区间的精度。默认为 3,即小数点后保留 3 位。 - include_lowest:布尔值,表示是否包含最低区间。默认为 False,即不包含最低区间。 - duplicates:字符串,表示如何处理重复的区间。默认为 'raise',即抛出异常,也可以设置为 'drop' 或 'raise'。 例如,假设我们有一个包含 10 个数值的 Series 对象: ```python import pandas as pd data = pd.Series([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) ``` 我们可以使用 cut() 函数将这些数据分成 3 个区间: ```python bins = [0, 5, 8, 10] labels = ['low', 'medium', 'high'] cuts = pd.cut(data, bins=bins, labels=labels) print(cuts) ``` 输出结果如下: ``` 0 low 1 low 2 low 3 low 4 low 5 medium 6 medium 7 medium 8 high 9 high dtype: category Categories (3, object): ['low' < 'medium' < 'high'] ``` 可以看到,cut() 函数返回一个 Categorical 类型的对象,其中每个数值被分配到了对应的区间中,并用指定的标签进行了替换。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值