python按行数分割文本,根据两列中的文本拆分行(Python,Pandas)

This is my dataframe (with many more letters and a length of ~35.5k) and stuff where the – are other relevant strings). All the variables are strings and ['C1','C2'] is the MultiIndex.

tmp

C1 C2 C3 C4 C5 Start End C8

A 1 - - - 12 14 -

A 2 - - - 1,4,7 3,6,10 -

A 3 - - - 16,19 17,21 -

A 4 - - - 22 24 -

I need it to become this (split every row that contains commas maintaining everything else):

C1 C2 C3 C4 C5 Start End C8 Appearance

A 1 - - - 12 14 - 1

A 2 - - - 1 3 - 1

A 2 - - - 4 6 - 2

A 2 - - - 7 10 - 3

A 3 - - - 16 17 - 1

A 3 - - - 19 21 - 2

A 4 - - - 22 24 - 1

as

s = tmp['Start'].str.split(',').apply(Series, 1).stack()

s.index = s.index.droplevel(-1)

s.name = 'Start

del tmp['Start']

final = tmp.join(s)

But then the result is much larger than it should! I get thousands of repeats and this is just trying to split 'Start'. I can't even imagine trying to do so for both Start and End (every comma in 'Start' implies a comma in 'End'.

Lengths:

tmp = 35568

s = 35676

final = 293408

解决方案

You can create new df from s1 and s2 and then join. Also better is use parameter expand=True in str.split and delete multiple columns by drop:

For creating column Appearance use groupby by index with cumcount.

s1 = tmp['Start'].str.split(',', expand=True).stack()

s1.index = s1.index.droplevel(-1)

s1.name = 'Start'

s2 = tmp['End'].str.split(',', expand=True).stack()

s2.index = s2.index.droplevel(-1)

s2.name = 'End'

tmp.drop(['Start', 'End'], inplace=True, axis=1)

df = pd.DataFrame({'s1':s1, 's2':s2}, index=s1.index)

final = tmp.join(df)

final['Appearance'] = final.groupby(final.index).cumcount() + 1

print (final)

C1 C2 C3 C4 C5 C8 s1 s2 Appearance

0 A 1 - - - - 12 14 1

1 A 2 - - - - 1 3 1

1 A 2 - - - - 4 6 2

1 A 2 - - - - 7 10 3

2 A 3 - - - - 16 17 1

2 A 3 - - - - 19 21 2

3 A 4 - - - - 22 24 1

EDIT by comment:

You can try reset_index first:

print (tmp)

C3 C4 C5 Start End C8

C1 C2

A 1 - - - 12 14 -

2 - - - 1,4,7 3,6,10 -

3 - - - 16,19 17,21 -

4 - - - 22 24 -

tmp.reset_index(inplace=True)

print (tmp)

C1 C2 C3 C4 C5 Start End C8

0 A 1 - - - 12 14 -

1 A 2 - - - 1,4,7 3,6,10 -

2 A 3 - - - 16,19 17,21 -

3 A 4 - - - 22 24 -

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值