Python数据分析-pandas基础-4-其他数据类型操作

Caspian�

于 2021-11-26 09:50:32 发布

阅读量501

点赞数 1

本文链接：https://blog.csdn.net/weixin_44020827/article/details/121552006

版权

Pandas 文本操作 category数据类型正则表达式数据预处理

关键词由CSDN通过智能技术生成

一、文本操作

1.文本方法

基本语法格式：pd.Series.str.方法

series=pd.Series(['a','ab','abb'])
series.str.upper()
>
0      A
1     AB
2    ABB
dtype: object

pandas中的文本处理方法大部分与python里内建的方法名称与功能都一样，但是稍有不同：

replace,split方法等可以接收正则表达式，而内建函数则不能。

series.str.replace(r'^ab','f')
#首先通过正则表达式匹配以ab开头的元素，再将其替换为f
>
0     a
1     f
2    fb
dtype: object

2.特有文本方法

下面是内建函数中没有的文本方法

cat	元素级str的连接操作。
get	抽取指定字符串位置的元素。
get_dummies	通过分隔符分割str，返回哑变量构成的DataFrame.
contains	返回表示各str是否含有指定模式的bool数组。
repeat	对每个str重复指定次数。
pad	在str的左边，右边或者两边添加空白符。
wrap	按照指定长度分割字符。
slice	对各个str进行子串提取。
slice_replace	替换截取的str
findall	找出所有匹配模式所匹配的值。
match	根据指定的正则表达式对各元素进行re.match
extract	将正则表达式匹配的第一组提取出来。
extractall	将正则表达式匹配的所有组提取出来。
len	计算字符长度。
normalize	返回Unicode标准形式。

series.str.cat(sep='-')
>'a-ab-abb'
series.str.get(0)
>
0    a
1    a
2    a
dtype: object

series.str.contains('ab')
>
0    False
1     True
2     True
dtype: bool

series.str.repeat(2)
>
0        aa
1      abab
2    abbabb
dtype: object

3.文本索引操作

是定位到数组中每一个元素对应位置的值。

series.str[0]
>
0    a
1    a
2    a
dtype: object

series.str[0:3]
>
0      a
1     ab
2    abb
dtype: object

二、category操作

category为分类型数据，对应于统计学中的分类型变量。category可以有一个顺序，但是不能进行数字操作。

1.创建

（1）指定Series的数据类型创建

series=pd.Series(['a','b','c','c'],dtype='category')
series
>
0    a
1    b
2    c
3    c
dtype: category
Categories (3, object): ['a', 'b', 'c']

（2）转换Series数据类型创建

对于已有的series，将其数据类型转换为category即可。

series=pd.Series(['a','b','c','c'])
series1=series.astype('category')
series1
>
0    a
1    b
2    c
3    c
dtype: category
Categories (3, object): ['a', 'b', 'c']

#使用CategoricalDtype创建，它可以锁定排序。
from pandas.api.types import CategoricalDtype
cat_type=CategoricalDtype(categories=['b','c','d'],ordered=True)
series1=series.astype(cat_type)
series1
>
0    NaN
1      b
2      c
3      c
dtype: category
Categories (3, object): ['b' < 'c' < 'd']

（3）使用Categorical对象直接创建

raw_cat=pd.Categorical(['a','b','c','c'],categories=['a','b','c'],ordered=False)
series=pd.Series(raw_cat)
series
>
0    a
1    b
2    c
3    c
dtype: category
Categories (3, object): ['a', 'b', 'c']

（4）用cut函数创建

cut函数可以将Series的连续数据分成不同的区间变为离散数据，一般用于直方图显示。

series=pd.Series(range(5))
series1=pd.cut(series,[0,2,4,6],right=False)
series1
#左边为元素，右边为其对应的组别。
>
0    [0, 2)
1    [0, 2)
2    [2, 4)
3    [2, 4)
4    [4, 6)
dtype: category
Categories (3, interval[int64]): [[0, 2) < [2, 4) < [4, 6)]

2.操作

(1）重命名

series=pd.Series(['a','b','c','c'],dtype='category')
series1=series.cat.rename_categories([1,2,3])
series1.cat.categories
>Int64Index([1, 2, 3], dtype='int64')

（2）增删改

series.cat.add_categories(['e'])
>
0    a
1    b
2    c
3    c
dtype: category
Categories (4, object): ['a', 'b', 'c', 'e']

series.cat.remove_categories(['c'])
>
0      a
1      b
2    NaN
3    NaN
dtype: category
Categories (2, object): ['a', 'b']

#一步实现类别增加或删除以及改变，同时可以设置是否有序
series.cat.set_categories(['c','d'],ordered=True)
>
0    NaN
1    NaN
2      c
3      c
dtype: category
Categories (2, object): ['c' < 'd']

#排序
series.sort_values()
>
0    a
1    b
2    c
3    c
dtype: category
Categories (3, object): ['a', 'b', 'c']