NLP（四）正则表达式

最新推荐文章于 2024-02-07 02:47:19 发布

巷中人

最新推荐文章于 2024-02-07 02:47:19 发布

阅读量283

点赞数

文章标签：人工智能

原文链接：http://www.cnblogs.com/peng8098/p/nlp_4.html

版权

* + ?
* ：0个或多个
+ ：1个或多个
? ：0个或1个
re.search()函数，将str和re匹配，匹配正确返回True

import re

# 匹配函数，输入：文本，匹配模式（即re）
def text_match(text,patterns):
    if re.search(patterns,text):
        return 'Found a match!'
    else:
        return 'Not matched!'

# 测试
print(text_match('ac','ab?'))
print(text_match('abc','ab?'))
print(text_match('abbc','ab?'))

print(text_match('ac','ab*'))
print(text_match('abc','ab*'))
print(text_match('abbc','ab*'))

print(text_match('ac','ab+'))
print(text_match('abc','ab+'))
print(text_match('abbc','ab+'))

print(text_match('abbc','ab{2}'))

print(text_match('aabbbbc','ab{3,5}?'))

输出：

Found a match!
Found a match!
Found a match!
Found a match!
Found a match!
Found a match!
Not matched!
Found a match!
Found a match!
Found a match!
Found a match!

$ ^ .
$ ：结尾
^ ：开头
. ：除换行符以外的任何字符
\w ：字母，数字，下划线
\s ：空格符
\S ：非空格符
\b ：空格
\B ：非空格

import re
def text_match(text,patterns):
    if re.search(patterns,text):
        return 'Found a match!'
    else:
        return 'Not matched!'

# 任意以a开头，以c结尾
print(text_match('abbc','^a.*c$'))

# 以文本开始，后面有出现一次或多次的文本
print(text_match('Tuffy eats pie, Loki eats peas!','^\w+'))

# 文末一个或多个\w加上0个或多个非空字符，\S在\w后面表示标点符号
print(text_match('Tuffy eats pie, Loki eats peas!','\w+\S*$'))

# 含u在中间的单词
print(text_match('Tuffy eats pie, Loki eats peas!','\Bu\B'))

输出：

Found a match!
Found a match!
Found a match!
Found a match!

字符串匹配
re.search(pattern,text) ：判断text里是否有pattern
re.finditer(pattern,text) ：在text里找到pattern

import re

patterns = ['Tuffy','Pie','Loki']
text = 'Tuffy eats pie, Loki eats peas!'

# 匹配字符串
for pattern in patterns:
    print('Searching for "%s" in "%s" -&gt;' % (pattern,text))
    if re.search(pattern,text):
    # 如果不想区分大小写，加参数 flags=re.IGHORECASE
        print('Found!')
    else:
        print('Not Found!')

# 匹配字符串，并找到他的位置
pattern = 'eats'
for match in re.finditer(pattern,text):
    s = match.start()
    e = match.end()
    print('Found "%s" at %d:%d'%(text[s:e],s,e))

输出：

Searching for "Tuffy" in "Tuffy eats pie, Loki eats peas!" -&gt;
Found!
Searching for "Pie" in "Tuffy eats pie, Loki eats peas!" -&gt;
Not Found!
Searching for "Loki" in "Tuffy eats pie, Loki eats peas!" -&gt;
Found!
Found "eats" at 6:10
Found "eats" at 21:25

日期，一组字符集合(或字符范围)
\d ：数字
re.compile() ：string => RegexObject的对象
方括号[]内的所有内容都是OR关系

import re
url = 'http://www.awdawd.com/da/wda/2019/7/2/wda.html'

# YYYY/MM/DD
date_regex = '/(\d{4})/(\d{1,2})/(\d{1,2})'
print('Data found in the URL :',re.findall(date_regex,url))

# 有特殊字符返回Flase
def is_allowed_specific_char(string):
    charRe = re.compile(r'[^a-zA-Z0-9.]')
    string = charRe.search(string)
    return not bool(string)

print(is_allowed_specific_char('adIDHihdHDIh.'))
print(is_allowed_specific_char('*#$%^&!{}'))

输出：

Data found in the URL : [('2019', '7', '2')]
True
False

找到所有长度为5的单词，缩写替换单词

import re

# 用缩写替换
street = '21 Ramkrishna Road'
print(re.sub('Road','Rd',street))

# 找到长度为5的单词
text = 'Tuffy eats pie, Loki eats bread!'
print(re.findall(r'\b\w{5}\b',text))

输出：

21 Ramkrishna Rd
['Tuffy', 'bread']

基于RE的分词器

import re

raw = 'I am big!  It\'s the pictures that got small.'

# 用一个或多个空格分词
print(re.split(r' +',raw))

# 非 字母数字下划线 分词
print(re.split(r'\W+',raw))

# 匹配分词 ！
print(re.findall(r'\w+|\S\w*',raw))

输出：

['I', 'am', 'big!', "It's", 'the', 'pictures', 'that', 'got', 'small.']
['I', 'am', 'big', 'It', 's', 'the', 'pictures', 'that', 'got', 'small', '']
['I', 'am', 'big', '!', 'It', "'s", 'the', 'pictures', 'that', 'got', 'small', '.']

基于RE的词干提取器

import re

# 自己的词干提取器
def stem(word):
    split = re.findall(r'^(.*?)(ing|ly|ed|ies|ive|es|s|ment)?$',word)
    stem = split[0][0]
    return stem

# 上节中re分词
raw = 'Keep your friends close, but your enemies closer.'
tokens = re.findall(r'\w+|\S\w*',raw)
print(tokens)

# 测试
for t in tokens:
    print("'",stem(t),"'")

输出：

['Keep', 'your', 'friends', 'close', ',', 'but', 'your', 'enemies', 'closer', '.']
' Keep ' ' your ' ' friend ' ' close ' ' , ' ' but ' ' your ' ' enem ' ' closer ' ' . '

转载于:https://www.cnblogs.com/peng8098/p/nlp_4.html

巷中人

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
NLP（四）正则表达式

* + ?* ：0个或多个+ ：1个或多个? ：0个或1个re.search()函数，将str和re匹配，匹配正确返回Trueimport re# 匹配函数，输入：文本，匹配模式（即re）def text_match(text,patterns): if re.search(patterns,text): return 'Found a match!'...
复制链接

扫一扫