python正则表达式分割例子,python正则表达式拆分段落

How would one write a regular expression to use in python to split paragraphs?

A paragraph is defined by 2 linebreaks (\n). But one can have any amount of spaces/tabs together with the line breaks, and it still should be considered as a paragraph.

I am using python so the solution can use python's regular expression syntax which is extended. (can make use of (?P...) stuff)

Examples:

the_str = 'paragraph1\n\nparagraph2'

# splitting should yield ['paragraph1', 'paragraph2']

the_str = 'p1\n\t\np2\t\n\tstill p2\t \n \n\tp3'

# should yield ['p1', 'p2\t\n\tstill p2', 'p3']

the_str = 'p1\n\n\n\tp2'

# should yield ['p1', '\n\tp2']

The best I could come with is: r'[ \t\r\f\v]*\n[ \t\r\f\v]*\n[ \t\r\f\v]*', i.e.

import re

paragraphs = re.split(r'[ \t\r\f\v]*\n[ \t\r\f\v]*\n[ \t\r\f\v]*', the_str)

but that is ugly. Anything better?

EDIT:

Suggestions rejected:

r'\s*?\n\s*?\n\s*?' -> That would make example 2 and 3 fail, since \s includes \n, so it would allow paragraph breaks with more than 2 \ns.

解决方案

Unfortunately there's no nice way to write "space but not a newline".

I think the best you can do is add some space with the x modifier and try to factor out the ugliness a bit, but that's questionable: (?x) (?: [ \t\r\f\v]*? \n ){2} [ \t\r\f\v]*?

You could also try creating a subrule just for the character class and interpolating it three times.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值