pandas_文本数据

最新推荐文章于 2023-06-22 09:27:18 发布

我叫陈叉叉叉叉

最新推荐文章于 2023-06-22 09:27:18 发布

阅读量175

点赞数

分类专栏： pandas 文章标签： python

本文链接：https://blog.csdn.net/wwqnmdhmp/article/details/106921239

版权

pandas 专栏收录该内容

5 篇文章 2 订阅

订阅专栏

import pandas as pd 
import numpy as np 
pd.set_option('display.max_columns', 1000)  # 显示的最大列数（避免列显示不全）
pd.set_option("display.max_colwidth", 1000)  # 每一列最大的宽度（避免属性值或列名显示不全）
pd.set_option('display.width', 1000)  # 每一行的宽度（避免换行）

1.string 才是未来

# pd.Series([1,'1']).astype('string') 使用这个会报错
pd.Series([1,'1']).astype('str').astype('string')

0    1
1    1
dtype: string

2. str.split 拆分

s.str.split(pat=None, n=-1, expand=False)
pat= ‘拆分的字符’ n= ‘拆分的次数’ expand = ‘是否分列’

s = pd.Series(['a_b_c', 'c_d_e', np.nan, 'f_g_h'], dtype="string")
s

0    a_b_c
1    c_d_e
2     <NA>
3    f_g_h
dtype: string

s.str.split(pat='_',n=1,expand=True)

	0	1
0	a	b_c
1	c	d_e
2	<NA>	<NA>
3	f	g_h

dtype = ‘string’ 会报错，
str[0]，对于列表是选取第0个元素，对于字符是选取第0个字符

pd.Series(['a_b_c', ['a','b','c']], dtype="object").str[2]

0    b
1    c
dtype: object

3. str.cat() 字符串拼接

Signature: s.str.cat(others=None, sep=None, na_rep=None, join=‘left’)
other = ‘其他的列，none下是行拼接’；sep = ‘合并时候的分割符’； na_rep= ‘空缺值用什么替代’ ； join = ‘索引不同时链接方式’

s.str.cat()

'a_b_cc_d_ef_g_h'

s.str.cat(s+'n',sep='|',na_rep='*')

0    a_b_c|a_b_cn
1    c_d_e|c_d_en
2             *|*
3    f_g_h|f_g_hn
dtype: string

4.str.replace() 字符串替换

s.str.replace(pat, repl, n=-1, case=None, flags=0, regex=True)
pat = ‘被替代的’ ； repl = ‘替代用的样式’； n = ‘次数’ ； regex = ‘是否支持正则表达式，replace默认不支持’

s = pd.Series(['AAbb','aabb',pd.NA,'ddff','12ba'],dtype='string')
s

0    AAbb
1    aabb
2    <NA>
3    ddff
4    12ba
dtype: string

s.str.replace('a','**')

0      AAbb
1    ****bb
2      <NA>
3      ddff
4     12b**
dtype: string

5.str.extract

pd.Series(['10-87', '10-88', '10-89'],dtype="string").str.extract(r'([\d]{2})-([\d]{2})')

	0	1
0	10	87
1	10	88
2	10	89

6.str.extractall

与extract只匹配第一个符合条件的表达式不同，extractall会找出所有符合条件的字符串，并建立多级索引（即使只找到一个）

s = pd.Series(["a1a2", "b1", "c1"], index=["A", "B", "C"],dtype="string")
two_groups = '(?P<letter>[a-z])(?P<digit>[0-9])'
s.str.extract(two_groups, expand=True)
s.str.extractall(two_groups)

		letter	digit
	match
A	0	a	1
A	1	a	2
B	0	b	1
C	0	c	1

7.str.contains和str.match

前者的作用为检测是否包含某种正则模式

pd.Series(['1', None, '3a', '3b', '03c'], dtype="string").str.contains('a', na=False)

0    False
1    False
2     True
3    False
4    False
dtype: boolean

8.others

str.strip 过滤两端的空格
str.lower str.upper 小写大写
str.swapcase和str.capitalize 交换大小写和首字母大写
str.isnumeric方法

pd.Series(['1', None, ' 3a', '3 b', ' 03c  '], dtype="string").str.strip()

0       1
1    <NA>
2      3a
3     3 b
4     03c
dtype: string

pd.Series(['a1', None, 'b 3a', '3 b', 'B 03c  '], dtype="string").str.capitalize()

0         A1
1       <NA>
2       B 3a
3        3 b
4    B 03c  
dtype: string

pd.Series(['a1', None, 'b 3a', '3 b', 'B 03c  '], dtype="string").str.swapcase()

0         A1
1       <NA>
2       B 3A
3        3 B
4    b 03C  
dtype: string

pd.Series(['a1', None, '30 ', '3 b', 'B 03c  '], dtype="string").str.strip().str.isnumeric()

0    False
1     <NA>
2     True
3    False
4    False
dtype: boolean

df = pd.read_csv('data/String_data_one.csv',index_col='人员编号',dtype='string')
df['姓名'].str.cat([':'+df['国籍']+'国人，','性别'+df['性别']])

人员编号
1         aesfd:2国人，性别男
2        fasefa:5国人，性别女
3         aeagd:4国人，性别女
4           aef:4国人，性别男
5           eaf:1国人，性别女
             ...       
1996        sdf:5国人，性别男
1997         hx:1国人，性别男
1998        drg:5国人，性别女
1999    zfgzdrg:5国人，性别男
2000       fsdf:3国人，性别女
Name: 姓名, Length: 2000, dtype: string

df = pd.read_csv('data/String_data_two.csv')
df

	col1	col2	col3
0	鄂尔多斯市第2例确诊患者治愈出院	19	363.6923
1	云南新增2例，累计124例	-67	-152.281
2	武汉协和医院14名感染医护出院	-86	325.6221
3	山东新增9例，累计307例	-74	-204.9313
4	上海开学日期延至3月	-95	4.05
...	...	...	...
495	四川新增34例，累计142例	-55	55.8904
496	广西新增20例，累计78例	-99	133.6509
497	河北新增17例，累计65例	-54	69.6604
498	全国31省区市新增确诊1737例，累计7711例	70	-336.9622
499	上海新增5例，累计101例	-95	157.951

500 rows × 3 columns

df[df['col1'].str.contains((r'[北京]{2}|[上海]{2}'))].head()

	col1	col2	col3
4	上海开学日期延至3月	-95	4.05
5	北京新增25例确诊病例，累计确诊253例	-4	-289.1719
6	上海新增10例，累计243例	2	-73.7105
36	上海新增14例累计233例	-55	-83
40	上海新增14例累计233例	-88	-99

df['col2'][~(df['col2'].str.replace(r'-?\d+','True')=='True')]  # 将数值用true替代 ，返回是不true的值 ，确定非数值类型

309    0-
396    9`
485    /7
Name: col2, dtype: object

我叫陈叉叉叉叉

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
pandas_文本数据

import pandas as pd import numpy as np pd.set_option('display.max_columns', 1000) # 显示的最大列数（避免列显示不全）pd.set_option("display.max_colwidth", 1000) # 每一列最大的宽度（避免属性值或列名显示不全）pd.set_option('display.width', 1000) # 每一行的宽度（避免换行）1.string 才是未来# pd.Series([1
复制链接

扫一扫

专栏目录