python提取字符串中的　中文　日文　韩文

最新推荐文章于 2022-12-06 21:30:36 发布

luoganttcc

最新推荐文章于 2022-12-06 21:30:36 发布

阅读量5.2k

点赞数 2

分类专栏：自然语言处理

本文链接：https://blog.csdn.net/luoganttcc/article/details/100926800

版权

自然语言处理专栏收录该内容

35 篇文章 3 订阅

订阅专栏

import imp
imp.reload(sys)
 
s=""" 
 en: Regular expression is a powerful tool for manipulating text. 
 zh: 汉语是世界上最优美的语言，正则表达式是一个很有用的工具 
 jp: 正規表現は非常に役に立つツールテキストを操作することです。 
 jp-char: あアいイうウえエおオ 
 kr:정규 표현식은 매우 유용한 도구 텍스트를 조작하는 것입니다. 
 """ 
print ("原始utf8字符" )
#utf8 
print ("--------" )
print( repr(s) )
print( "--------\n" )

原始utf8字符
--------
' \n en: Regular expression is a powerful tool for manipulating text. \n zh: 汉语是世界上最优美的语言，正则表达式是一个很有用的工具 \n jp: 正規表現は非常に役に立つツールテキストを操作することです。 \n jp-char: あアいイうウえエおオ \n kr:정규 표현식은 매우 유용한 도구 텍스트를 조작하는 것입니다. \n '
--------

非ansi

#非ansi 
re_words=re.compile(r"[\x80-\xff]+") 
#m = re_words.search(s,0) 
m1=re.findall(re_words, s)


print ("非ansi字符" )
print ("--------" )
print (m1 )
#print (m.group() )
print ("--------\n" )

非ansi字符
--------
[]
--------

中文

re_words = re.compile(u"[\u4e00-\u9fa5]+") 
#m = re_words.search(s) 
m1=re.findall(re_words, s)
#print(''.join(m1))
print( "unicode 中文" )
print(m1)
print( "--------" )

unicode 中文
['汉语是世界上最优美的语言', '正则表达式是一个很有用的工具', '正規表現', '非常', '役', '立', '操作']
--------

韩文

 
#unicode korean 
re_words=re.compile(u"[\uac00-\ud7ff]+") 
#m = re_words.search(s,0) 
m1=re.findall(re_words, s)
print( "unicode 韩文" )
print(m1)
print( "--------\n" )

unicode 韩文
['정규', '표현식은', '매우', '유용한', '도구', '텍스트를', '조작하는', '것입니다']
--------

日文片假名

#unicode japanese katakana 
re_words=re.compile(u"[\u30a0-\u30ff]+") 
#m = re_words.search(s,0) 
m1=re.findall(re_words, s)
print( "unicode 日文 片假名" )
print ("--------" )

print(m1)
print( "--------\n" )

unicode 日文 片假名
--------
['ツールテキスト', 'ア', 'イ', 'ウ', 'エ', 'オ']
--------

日文平假名

#unicode japanese hiragana 
re_words=re.compile(u"[\u3040-\u309f]+") 
#m = re_words.search(s,0) 
m1=re.findall(re_words, s)
print( "unicode 日文 平假名" )
print ("--------" )

print(m1)
print( "--------\n" )

unicode 日文 平假名
--------
['は', 'に', 'に', 'つ', 'を', 'することです', 'あ', 'い', 'う', 'え', 'お']
--------

标点符号

#unicode cjk Punctuation 
re_words=re.compile(u"[\u3000-\u303f\ufb00-\ufffd]+") 
#m = re_words.search(s,0) 
m1=re.findall(re_words, s)
print( "unicode 标点符号" )
print ("--------" )

print(m1)
print( "--------\n" )

unicode 标点符号
--------
['，', '。']
--------

luoganttcc

关注

2
点赞
踩
12

收藏

觉得还不错? 一键收藏
0
评论
python提取字符串中的　中文　日文　韩文

import impimp.reload(sys) s=""" en: Regular expression is a powerful tool for manipulating text. zh: 汉语是世界上最优美的语言，正则表达式是一个很有用的工具 jp: 正規表現は非常に役に立つツールテキストを操作することです。 jp-char: あアいイうウえエおオ kr:정...
复制链接

扫一扫