pandas学习 task09

最新推荐文章于 2023-08-24 14:58:24 发布

晃晃我的半瓶水

最新推荐文章于 2023-08-24 14:58:24 发布

阅读量214

点赞数

本文链接：https://blog.csdn.net/qq_41834327/article/details/112340723

版权

import pandas as pd
import numpy as np

cat对象

cat对象的属性

pandas提供了category类型以便于用户处理分类类型的变量,将一个序列转换成分类变量可以使用astype方法.分类类型的Series中定义了cat对象,该对象类似于str定义了一些属性和方法来进行分类类别的操作.对于一个具体的分类,有两个组成部分,其一为类别的本身,它以Index类型存储,其二为是否有序,它们都可以通过cat的属性被访问.除此之外,每一个序列的类别都会被赋予唯一的整数编号,它们的编号取决于cat.categories中的顺序,该属性可以通过codes访问.对于一个具体的分类,有两个组成部分,其一为类别的本身,它以Index类型存储,其二为是否有序.

df = pd.read_csv(r'C:\Users\yongx\Desktop\data\learn_pandas.csv',
                usecols = ['Grade','Name','Gender','Height','Weight'])
s = df.Grade.astype('category')
s.head()

0     Freshman
1     Freshman
2       Senior
3    Sophomore
4    Sophomore
Name: Grade, dtype: category
Categories (4, object): ['Freshman', 'Junior', 'Senior', 'Sophomore']

s.cat

<pandas.core.arrays.categorical.CategoricalAccessor object at 0x000002949E72ACC0>

s.cat.categories#查看具体类别

Index(['Freshman', 'Junior', 'Senior', 'Sophomore'], dtype='object')

s.cat.ordered#判断是否有序

False

类别的增加,删除和修改

通过cat对象的categories属性能够完成对类别的查询,需要注意的是类别不得直接修改,索引Index类型是无法用index_obj[0] = item来修改的,而categories被存储在Index中,因此Pandas在cat属性上定义了若干方法来达到相同的目的.
add_categories:增加类别的函数方法.
remove_categories:删除某一个类别,删除某一个类别之后,原来序列中的该类会被设置为缺失.
set_categories:直接设置序列的新类别,若原来的序列中存在元素不属于新类别,则会被设置为缺失.
remove_unused_categories:删除未出现在序列中的类别
rename_categories:修改类,该方法会对原序列的对应值进行相应的修改

s = s.cat.add_categories('Graduate')
s.cat.categories

Index(['Freshman', 'Junior', 'Senior', 'Sophomore', 'Graduate'], dtype='object')

s = s.cat.remove_categories('Freshman')
s.cat.categories

Index(['Junior', 'Senior', 'Sophomore', 'Graduate'], dtype='object')

s.head()

0          NaN
1          NaN
2       Senior
3    Sophomore
4    Sophomore
Name: Grade, dtype: category
Categories (4, object): ['Junior', 'Senior', 'Sophomore', 'Graduate']

s = s.cat.set_categories(['Sophomore', 'PhD'])
s.cat.categories
s.head()

0          NaN
1          NaN
2          NaN
3    Sophomore
4    Sophomore
Name: Grade, dtype: category
Categories (2, object): ['Sophomore', 'PhD']

print(s.cat.categories,'\n')
s = s.cat.remove_unused_categories()
print(s.cat.categories)

Index(['Sophomore', 'PhD'], dtype='object') 

Index(['Sophomore'], dtype='object')

s = s.cat.rename_categories({'Sophomore':'本科二年级学生'}) #类别名重新更改
s.head()

0        NaN
1        NaN
2        NaN
3    本科二年级学生
4    本科二年级学生
Name: Grade, dtype: category
Categories (1, object): ['本科二年级学生']

有序分类

序的建立

通过as_unordered和reorder_categories实现有序类别和无序类别的互相转化,需要注意的是后者传入的参数必须是有当前序列的无序类别构成的列表,不能够增加新的类别,也不能缺少原来的类别,并且指定参数ordered=True,否则方法将无效.

**注意:**类别不得直接修改,若不想指定ordered=True参数,则可先用s.cat.as_ordered()转化为有序序列,之后再利用reorder_categories进行具体的相对大小调整.

#首先转为有序状态
s = df.Grade.astype('category')
s = s.cat.reorder_categories(['Freshman','Sophomore','Junior','Senior'], ordered=True)
s.head()

0     Freshman
1     Freshman
2       Senior
3    Sophomore
4    Sophomore
Name: Grade, dtype: category
Categories (4, object): ['Freshman' < 'Sophomore' < 'Junior' < 'Senior']

#将有序状态转为无序状态
s.cat.as_unordered().head()

0     Freshman
1     Freshman
2       Senior
3    Sophomore
4    Sophomore
Name: Grade, dtype: category
Categories (4, object): ['Freshman', 'Sophomore', 'Junior', 'Senior']

排序和查找

只需要把列的类型修改为category之后,再赋予相应的大小关系,就能正常地使用sort_index和sort_values进行排序.在顺序地建立之后比较操作同时也可进行操作.对于分类变量的比较操作分为两类:

==或!=关系的比较:比较的对象可以是标量或者通常都的Series或者List
>,>=,<,<=关系的比较: 比较的对象和第一种类似,但是所有参与比较的元素必须输入原序列的categories,同时要具有同原序列相同的索引

df.Grade = df.Grade.astype('category')
df.Grade = df.Grade.cat.reorder_categories(['Freshman','Sophomore',
                                           'Junior','Senior'], ordered=True)
#通过Grade列的值进行排序
df.sort_values('Grade').head()

	Grade	Name	Gender	Height	Weight
0	Freshman	Gaopeng Yang	Female	158.9	46.0
105	Freshman	Qiang Shi	Female	164.5	52.0
96	Freshman	Changmei Feng	Female	163.8	56.0
88	Freshman	Xiaopeng Han	Female	164.1	53.0
81	Freshman	Yanli Zhang	Female	165.1	52.0

#将Grade设置为索引后进行排序
df.set_index('Grade').sort_index().head()

	Name	Gender	Height	Weight
Grade
Freshman	Gaopeng Yang	Female	158.9	46.0
Freshman	Qiang Shi	Female	164.5	52.0
Freshman	Changmei Feng	Female	163.8	56.0
Freshman	Xiaopeng Han	Female	164.1	53.0
Freshman	Yanli Zhang	Female	165.1	52.0

#使用 == 号进行比较操作,默认使用广播状态
res1 = df.Grade == 'Sophomore'
res1.head()

0    False
1    False
2    False
3     True
4     True
Name: Grade, dtype: bool

#Series和列表之间进行比较
res2 = df.Grade == ['PhD']*df.shape[0]
res2.head()

0    False
1    False
2    False
3    False
4    False
Name: Grade, dtype: bool

#使用大小比较操作符进行比较
res3 = df.Grade <= 'Sophomore'
res3.head()

0     True
1     True
2    False
3     True
4     True
Name: Grade, dtype: bool

# 打乱后再进行比较操作
res4 = df.Grade <= df.Grade.sample(frac = 1).reset_index(drop = True)
res4.head()

0     True
1     True
2     True
3     True
4    False
Name: Grade, dtype: bool

区间类别

利用cut和qcut进行区间构造

区间序列可通过cut和qcut方法进行构造,该两个函数能够将原序列的数值特征进行装箱,即用区间位置来代替原来具体的数值.具体用法如下:

bin:主要参数,传入整数n时代表传入整个数组的按照最大和最小值等间距地分为n段.因为默认区间为左开右闭,需要进行调整把最小值包含进去,再pandas中地解决方法是在值最小的区间左端点减去0.001*(max-min).即有,当对序列[1,2]划分为两个箱子是,第一个箱子的范围为[1 - 0.001 * (2-1), 1.5],第二个箱子的范围是(-1.5, 2],如果需要指定左闭右开时,需要将right参数设置为False,相应的区间调整方法时在值最大的区间右端点再加上0.001*(max-min) ; bin也可用传入指定区间分割点的列表,此时需要使用np.infty来指代无穷大.
labels:区间的名字
retbins: 是否返回分割点,默认状态下为不返回

qcut和cut用法大致上并没有区别,参数也只是改变了下形参名, 如cut中的bins参数在qcut中为q参数,q为quantile的缩写. 但是当q为整数时,区间构造方法为按照n等分位将数据分箱,同时支持传入浮点列表指代对应的分位数分割点

s = pd.Series([1,2])
#将数据分箱成两个区间左边界为 1 - 0.001 * (2-1)
pd.cut(s,bins = 2)

0    (0.999, 1.5]
1      (1.5, 2.0]
dtype: category
Categories (2, interval[float64]): [(0.999, 1.5] < (1.5, 2.0]]

#设置区间为左闭右开
pd.cut(s, bins = 2, right = False)

0      [1.0, 1.5)
1    [1.5, 2.001)
dtype: category
Categories (2, interval[float64]): [[1.0, 1.5) < [1.5, 2.001)]

#使用bins指定区间分割点的列表, 左边界 1 负无穷与1.2之间, 右边界在1.8与2.2之间
pd.cut(s, bins=[-np.infty, 1.2, 1.8, 2.2, np.infty])

0    (-inf, 1.2]
1     (1.8, 2.2]
dtype: category
Categories (4, interval[float64]): [(-inf, 1.2] < (1.2, 1.8] < (1.8, 2.2] < (2.2, inf]]

#分别定义区间的名字,并且设置返回分割点
s = pd.Series([1,2])
res = pd.cut(s, bins = 2, labels = ['small', 'big'], retbins = True)
res[0]

0    small
1      big
dtype: category
Categories (2, object): ['small' < 'big']

#查看分割点
res[1]

array([0.999, 1.5  , 2.   ])

#使用qcut进行按3分位数划分区间
s = df.Weight
pd.qcut(s, q = 3).head()

0    (33.999, 48.0]
1      (55.0, 89.0]
2      (55.0, 89.0]
3    (33.999, 48.0]
4      (55.0, 89.0]
Name: Weight, dtype: category
Categories (3, interval[float64]): [(33.999, 48.0] < (48.0, 55.0] < (55.0, 89.0]]

#使用qcut按照指定的分位点进行划分区间
pd.qcut(s, q = [0, 0.2, 0.8, 1]).head()

0      (44.0, 69.4]
1      (69.4, 89.0]
2      (69.4, 89.0]
3    (33.999, 44.0]
4      (69.4, 89.0]
Name: Weight, dtype: category
Categories (3, interval[float64]): [(33.999, 44.0] < (44.0, 69.4] < (69.4, 89.0]]

一般区间的构造

构造区间时需要同时具备左端点,右端点和端点的开闭状态,开闭状态可以指定rihgt,left,both,neither中的一类.
包含属性:

mid:中点;
length:长度;
right:右端点;
left:左端点;
closed:开闭状态

区间判断函数:

in:判断元素是否属于区间
overlaps:判断两个区间是否有交集

pd.IntervalIndex对象生成的四类方法:

from_break:功能类似于cut或qcut,但from_break传入参数为自定义的分割点,cut和qcut的分割点为通过计算得到.
from_arrays:传入参数为左端点的列表和右端点的列表,该功能适合于有交集并且起点和终点已知的情况.
from_tuples: 传入参数为由起点和终点组成的列表.
interval_range: 通过传入区间序列的四个构造元素:起点(start),终点(end),区间个数(periods)和区间长度(freq)来构造出相应的区间

**注意:**当直接使用pd.IntervalIndex([...],closed=...)把Interval类型的列表组成的传入其中转为区间索引,所有的区间会被强制转为指定的closed类型, 因为pd.IntervalIndex只允许存放同一种开闭区间的Interval对象

#练一练
#回顾等差数列中的首项,末项,项数和公差的联系,写出interval_range中四个参数之间的恒等关系

my_interval = pd.Interval(0, 1, 'right')
my_interval

Interval(0, 1, closed='right')

0.5 in my_interval

True

#验证区间是否重叠
my_interval_2 = pd.Interval(0.5, 1.5, 'left')
my_interval.overlaps(my_interval)

True

# 使用from_breaks构造分区间
pd.IntervalIndex.from_breaks([1,3,6,10], closed = 'both')

IntervalIndex([[1, 3], [3, 6], [6, 10]],
              closed='both',
              dtype='interval[int64]')

#使用from_arrays 构造指定区间分块
pd.IntervalIndex.from_arrays(left=[1, 3, 6, 10],
                             right=[5, 4, 9, 11],
                             closed='neither')

IntervalIndex([(1, 5), (3, 4), (6, 9), (10, 11)],
              closed='neither',
              dtype='interval[int64]')

#使用from_tuples发明通过起点终点元组构造区间
pd.IntervalIndex.from_tuples([(1,5),(3,4), (6,9),(10,11)], closed = 'neither')

IntervalIndex([(1, 5), (3, 4), (6, 9), (10, 11)],
              closed='neither',
              dtype='interval[int64]')

# 使用pd.interval_range方法及区间构成四个变量直接构成区间
pd.interval_range(start = 1, end = 5, periods = 8)

IntervalIndex([(1.0, 1.5], (1.5, 2.0], (2.0, 2.5], (2.5, 3.0], (3.0, 3.5], (3.5, 4.0], (4.0, 4.5], (4.5, 5.0]],
              closed='right',
              dtype='interval[float64]')

pd.interval_range(end = 5, periods = 8, freq = 0.5)

IntervalIndex([(1.0, 1.5], (1.5, 2.0], (2.0, 2.5], (2.5, 3.0], (3.0, 3.5], (3.5, 4.0], (4.0, 4.5], (4.5, 5.0]],
              closed='right',
              dtype='interval[float64]')

#直接使用Interval类型的列表构造区间时,所有的区间都会被强制转为指定的`closed`类型
pd.IntervalIndex([my_interval, my_interval_2], closed = 'left')

IntervalIndex([[0.0, 1.0), [0.5, 1.5)],
              closed='left',
              dtype='interval[float64]')

区间的属性与方法

当需要具体的利用cut和qcut的结果进行分析时,需要先将其转为该种索引类型.
IntervalIndex常用属性有:

left : 左端点
right : 右端点
mid : 两端均值
length : 区间长度

IntervalIndex常用的两个方法:

contains:逐个判断每个区间是否包含某元素.
overlaps:逐个判断每个区间是否和一个pd.Interval对象有交集.

id_interval = pd.IntervalIndex(pd.cut(s, 3))

#查看id_interval的前五个元素
id_demo = id_interval[:5]
id_demo

IntervalIndex([(33.945, 52.333], (52.333, 70.667], (70.667, 89.0], (33.945, 52.333], (70.667, 89.0]],
              closed='right',
              name='Weight',
              dtype='interval[float64]')

#查看左端点
id_demo.left

Float64Index([33.945, 52.333, 70.667, 33.945, 70.667], dtype='float64')

#查看右端点
id_demo.right

Float64Index([52.333, 70.667, 89.0, 52.333, 89.0], dtype='float64')

#查看左右端点的均值
id_demo.mid

Float64Index([43.138999999999996, 61.5, 79.8335, 43.138999999999996, 79.8335], dtype='float64')

#查看区间的长度
id_demo.length

Float64Index([18.387999999999998, 18.334000000000003, 18.333,
              18.387999999999998, 18.333],
             dtype='float64')

#判断每个区间是否包含某元素
id_demo.contains(4)

array([False, False, False, False, False])

#判断每个区间是否和另一个Interval对象有交集
id_demo.overlaps(pd.Interval(40,60))

array([ True,  True, False,  True, False])

在这里插入图片描述

df = pd.DataFrame({'A':['a','b','c','a'],
                  'B':['cat', 'cat', 'dog','cat']})
pd.crosstab(df.A, df.B)

B	cat	dog
A
a	2	0
b	1	0
c	0	1

df.B = df.B.astype('category').cat.add_categories('sheep')
pd.crosstab(df.A, df.B, dropna = False)

B	cat	dog	sheep
A
a	2	0	0
b	1	0	0
c	0	1	0

df.A.unique()

array(['a', 'b', 'c'], dtype=object)

df.B.unique()

array(['cat', 'dog'], dtype=object)

for (x,y) in zip(df.A.unique(),df.B.unique()):
    print(x,y)

a cat
b dog

#参考答案解析
def my_crosstab(s1, s2, dropna = False):
    #若s1,s2数据类型为类别则通过categories方法获取类别, 其他对象类型则通过unique直接获取类别
    index1 = (s1.cat.categories if s1.dtype.name=='category' and not
             dropna else s1.unique())
    index2 = (s2.cat.categories if s2.dtype.name=='category' and not
             dropna else s2.unique())
    
    #构造以s1类别长度为行, s2类别长度为列的零型矩阵
    res = pd.DataFrame(np.zeros((index1.shape[0], index2.shape[0])),
                      index=index1, columns=index2)
    
    #将已构建好元素索引设置为1
    for i,j in zip(s1,s2):
        res.at[i,j] += 1
    #将DataFrame列变更为需要返回格式
    res = res.rename_axis(index = s1.name, columns = s2.name).astype('int')
    return res

df = pd.DataFrame({'A':['a','b','c','a'],
                  'B':['cat','cat','dog','cat']})

df.B = df.B.astype('category').cat.add_categories('sheep')
my_crosstab(df.A, df.B)

B	cat	dog	sheep
A
a	2	0	0
b	1	0	0
c	0	1	0

在这里插入图片描述

df = pd.read_csv(r'C:\Users\yongx\Desktop\data\diamonds.csv')
df.head(3)

	carat	cut	clarity	price
0	0.23	Ideal	SI2	326
1	0.21	Premium	SI1	326
2	0.23	Good	VS1	327

s_obj, s_cat = df.cut, df.cut.astype('category')

#测试运行速度语法  %timeit -n (次数) (func:函数)
%timeit -n 30 s_obj.nunique()

3.02 ms ± 390 µs per loop (mean ± std. dev. of 7 runs, 30 loops each)

%timeit -n 30 s_cat.nunique()

1.15 ms ± 490 µs per loop (mean ± std. dev. of 7 runs, 30 loops each)

df.cut = df.cut.astype('category').cat.reorder_categories(['Fair', 'Good','Very Good','Premium','Ideal'], ordered = True)
df.clarity = df.clarity.astype('category').cat.reorder_categories(['I1','SI2','SI1','VS2','VS1','VVS2','VVS1','IF'], ordered = True)
res = df.sort_values(['cut','clarity'], ascending=[False,True])
res.head()

	carat	cut	clarity	price
315	0.96	Ideal	I1	2801
535	0.96	Ideal	I1	2826
551	0.97	Ideal	I1	2830
653	1.01	Ideal	I1	2844
718	0.97	Ideal	I1	2856

df.cut = df.cut.cat.reorder_categories(df.cut.cat.categories[::-1])
df.clarity = df.clarity.cat.reorder_categories(df.clarity.cat.categories[::-1])

#通过使用code获取编码序列编码
df.cut.cat.codes

#使用replace函数获得村精度的编码
clarity_cat = df.clarity.cat.categories
df.clarity = df.clarity.replace(dict(zip(clarity_cat, np.arange(len(clarity_cat)))))

df.head(3)

	carat	cut	clarity	price
0	0.23	Ideal	6	326
1	0.21	Premium	5	326
2	0.23	Good	3	327

#使用分位数是需要注意两个边界, 0, 1的存在
q = [0, 0.2, 0.4, 0.6, 0.8, 1]
#使用分割区间时需要注意上下边界的问题 -np.infty和np.infty
point = [-np.infty, 1000, 3500, 5500,18000, np.infty]
avg = df.price/df.carat

#通过分区点分箱
df['avg_cut'] = pd.cut(avg, bins=point, labels=[
    'Very Low','Low','Mid','Height','Very High'])
#通过分位数分箱
df['avg_qcut'] = pd.qcut(avg, q=q,labels=[
    'Very Low','Low','Mid','Height','Very High'])

df.head()

	carat	cut	clarity	price	avg_cut	avg_qcut
0	0.23	Ideal	6	326	Low	Very Low
1	0.21	Premium	5	326	Low	Very Low
2	0.23	Good	3	327	Low	Very Low
3	0.29	Premium	4	334	Low	Very Low
4	0.31	Good	6	335	Low	Very Low

print(df.avg_cut.unique(),'\n\n')
print(df.avg_cut.cat.categories, '\n\n')
df.avg_cut = df.avg_cut.cat.remove_categories(['Very Low', 'Very High'])
df.avg_cut.head()

['Low', 'Mid', 'Height']
Categories (3, object): ['Low' < 'Mid' < 'Height']

Index([‘Very Low’, ‘Low’, ‘Mid’, ‘Height’, ‘Very High’], dtype=‘object’)

0    Low
1    Low
2    Low
3    Low
4    Low
Name: avg_cut, dtype: category
Categories (3, object): ['Low' < 'Mid' < 'Height']

interval_avg = pd.IntervalIndex(pd.qcut(avg,q = q))
right = interval_avg.right.to_series().reset_index(drop=True)
left = interval_avg.left.to_series().reset_index(drop=True)
length = interval_avg.length.to_series().reset_index(drop=True)

晃晃我的半瓶水

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
1
评论
pandas学习 task09

import pandas as pdimport numpy as npcat对象cat对象的属性pandas提供了category类型以便于用户处理分类类型的变量,将一个序列转换成分类变量可以使用astype方法.分类类型的Series中定义了cat对象,该对象类似于str定义了一些属性和方法来进行分类类别的操作.对于一个具体的分类,有两个组成部分,其一为类别的本身,它以Index类型存储,其二为是否有序,它们都可以通过cat的属性被访问.除此之外,每一个序列的类别都会被赋予唯一的整数编号,它
复制链接

扫一扫