Python单词边界匹配

cx元

已于 2022-04-21 18:05:44 修改

阅读量817

点赞数

分类专栏： Python 文章标签： python

于 2022-04-21 17:18:17 首次发布

本文链接：https://blog.csdn.net/qq_45436365/article/details/124326489

版权

Python 专栏收录该内容

8 篇文章

订阅专栏

最近做一个NLP项目，涉及到字符串处理。

需求

把一些英语表达缩写替换成全称，例如pls替换please，BTW替换为by the way。

需要注意的点

只能替换独立的单词，不能把单词间的字母误替换为全称。例如某个单词含有pls，plsgiocephalic，这种情况下应该保留原单词，而非替换成pleasegiocephalic。此时就需要进行python的边界匹配

解决方法

边界匹配采用\b，写在哪边就是匹配哪边的边界，例如

匹配左边边界

import re
# s是传入的需要处理的字符串
def replaceAcronyms(s):
	#将txs替换为thanks,re.I对大小写不敏感
	findAcro = re.compile(r"\btxs",re.I)
    s = re.sub(findAcro,"thanks",s)
    return s
test = "aaTXSbb,txsbb,TXS"
print(replaceAcronyms(test))

output:
aaTXSbb,thanksbb,thanks

匹配右边边界

import re
# s是传入的需要处理的字符串
def replaceAcronyms(s):
	#将txs替换为thanks,re.I对大小写不敏感
	findAcro = re.compile(r"txs\b",re.I)
    s = re.sub(findAcro,"thanks",s)
    return s
test = "aaTXSbb,aatxs,TXS"
print(replaceAcronyms(test))

output:
aaTXSbb,aathanks,thanks

匹配左右边界

# 需要引入re包
import re
# s是传入的需要处理的字符串
def replaceAcronyms(s):
	#将txs替换为thanks,re.I对大小写不敏感
	findAcro = re.compile(r"\btxs\b",re.I)
    s = re.sub(findAcro,"thanks",s)
    return s
test = "aaTXSbb,txs,TXS"
print(replaceAcronyms(test))

output:
aaTXSbb,thanks,thanks

*** 注意 ***
如果匹配左右边界失效，可以采用以下写法

str = "your str"
findAcro = re.compile(r"\b%s\b"%str,re.I)