python按行数分割文本,根据两列中的文本拆分行（Python，Pandas）

最新推荐文章于 2023-01-14 20:57:57 发布

晓暮云君

最新推荐文章于 2023-01-14 20:57:57 发布

阅读量378

点赞数

文章标签： python按行数分割文本

This is my dataframe (with many more letters and a length of ~35.5k) and stuff where the – are other relevant strings). All the variables are strings and ['C1','C2'] is the MultiIndex.

tmp

C1 C2 C3 C4 C5 Start End C8

A 1 - - - 12 14 -

A 2 - - - 1,4,7 3,6,10 -

A 3 - - - 16,19 17,21 -

A 4 - - - 22 24 -

I need it to become this (split every row that contains commas maintaining everything else):

C1 C2 C3 C4 C5 Start End C8 Appearance

A 1 - - - 12 14 - 1

A 2 - - - 1 3 - 1

A 2 - - - 4 6 - 2

A 2 - - - 7 10 - 3

A 3 - - - 16 17 - 1

A 3 - - - 19 21 - 2

A 4 - - - 22 24 - 1

as

s = tmp['Start'].str.split(',').apply(Series, 1).stack()

s.index = s.index.droplevel(-1)

s.name = 'Start

del tmp['Start']

final = tmp.join(s)

But then the result is much larger than it should! I get thousands of repeats and this is just trying to split 'Start'. I can't even imagine trying to do so for both Start and End (every comma in 'Start' implies a comma in 'End'.

Lengths:

tmp = 35568

s = 35676

final = 293408

解决方案

You can create new df from s1 and s2 and then join. Also better is use parameter expand=True in str.split and delete multiple columns by drop:

For creating column Appearance use groupby by index with cumcount.

s1 = tmp['Start'].str.split(',', expand=True).stack()

s1.index = s1.index.droplevel(-1)

s1.name = 'Start'

s2 = tmp['End'].str.split(',', expand=True).stack()

s2.index = s2.index.droplevel(-1)

s2.name = 'End'

tmp.drop(['Start', 'End'], inplace=True, axis=1)

df = pd.DataFrame({'s1':s1, 's2':s2}, index=s1.index)

final = tmp.join(df)

final['Appearance'] = final.groupby(final.index).cumcount() + 1

print (final)

C1 C2 C3 C4 C5 C8 s1 s2 Appearance

0 A 1 - - - - 12 14 1

1 A 2 - - - - 1 3 1

1 A 2 - - - - 4 6 2

1 A 2 - - - - 7 10 3

2 A 3 - - - - 16 17 1

2 A 3 - - - - 19 21 2

3 A 4 - - - - 22 24 1

EDIT by comment:

You can try reset_index first:

print (tmp)

C3 C4 C5 Start End C8

C1 C2

A 1 - - - 12 14 -

2 - - - 1,4,7 3,6,10 -

3 - - - 16,19 17,21 -

4 - - - 22 24 -

tmp.reset_index(inplace=True)

print (tmp)

C1 C2 C3 C4 C5 Start End C8

0 A 1 - - - 12 14 -

1 A 2 - - - 1,4,7 3,6,10 -

2 A 3 - - - 16,19 17,21 -

3 A 4 - - - 22 24 -

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python按行数分割文本,根据两列中的文本拆分行（Python，Pandas）

This is my dataframe (with many more letters and a length of ~35.5k) and stuff where the – are other relevant strings). All the variables are strings and ['C1','C2'] is the MultiIndex.tmpC1 C2 ...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。