python REfO模块使用入门

python REfO模块使用入门

在国内基本找不到关于REfO的使用的博客,转载一篇外网的使用实例
链接: Regular Expressions for Objects.
For work I recently needed to do something that is very similar to regexes, but with a twist: it should operate on lists of objects, not only on strings. Luckily, Python came to the rescue with REfO, a library for doing just this.

My usecase was selecting phrases from Part-of-Speech (POS) annotated text. The text was lemmatized and tagged using SpaCy and it resulted in lists of the following form:

s = [['i', 'PRON'], ['look', 'VERB'], ['around', 'ADP'], ['me', 'PRON'], ['and', 'CCONJ'], ['see', 'VERB'], ['that', 'ADP'], ['everyone', 'NOUN'], ['be', 'VERB'], ['run', 'VERB'], ['around', 'ADV'], ['in', 'ADP'], ['a', 'DET'], ['hurry', 'NOUN']]

From these sentences we want to extract human action phrases and noun phrases, which are defined as follows, using regex-like notation:

human_action = ("he"|"she"|"i"|"they"|"we") ([VERB] [ADP])+ 
noun_phrase = [DET]? ([ADJ] [NOUN])+

Translated to English this means that human actions are defined as 1st and 3rd person, singular and plural pronouns followed by repeated groups of verbs and ad­po­si­tions (in, to, during). Noun phrases are composed of an optional determiner (a, an, the) followed by repeated groups of adjectives and nouns.

Most standard regex libraries won’t help you with this, because they work only on strings. But this problem is still perfectly well described by regular grammars, so after a bit of Googling I found REfO and it’s super simple to use, albeit you have to read the source code, because it doesn’t really have doc­u­men­ta­tion.

REfO is a bit more verbose than normal regular ex­pres­sions, but at least it tries to stay close to usual regex notions. Lazy repetition (*) is done using the refo.Star operator, while greedy one (+) is refo.Plus . The only new operator is refo.Predicate, which takes a function which takes a parameter and matches if that function returns true when called with the element at that position. Using this we will build the functions we need:

def pos(pos):    
	return refo.Predicate(lambda x: x[1] == pos) 
    
def humanpron():    
	return refo.Predicate(lambda x: x[1] == 'PRON' and x[0] in {'i', 'he', 'she', 'we', 'they'})

For matching POS, we use a helper to create a function that will match the given tag. For matching human pronouns, we also check the words, not just the POS tag.

np = refo.Question(pos('DET')) + refo.Plus(refo.Question(pos('ADJ')) + pos('NOUN'))
humanAction = humanpron() + refo.Plus(pos('VERB') + pos('ADP'))

Then we just compose our functions and con­cate­nate them and we got what we wanted. Using them is simple. You either call refo.search, which finds the first match or refo.finditer which returns an iterable over all matches.

for match in refo.finditer(humanAction, s):   
	start = match.start()    
    end = match.end()    
    print(s[start:end])
[[u'i', u'PRON'], [u'look', u'VERB'], [u'around', u'ADP']]

So, it’s always good to Google around for a solution, because my first instict to whip up a parser in Parsec would have lead to a much more com­pli­cat­ed solution. This is nice, elegant, short and efficient.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值