Chapter 2 Regular Expressions, Text Normalization, Edit Distance

最新推荐文章于 2024-04-16 09:44:30 发布

boywaiter

最新推荐文章于 2024-04-16 09:44:30 发布

阅读量1.1k

点赞数

分类专栏： Speech and Language Processing ed3 文章标签： NLP

本文链接：https://blog.csdn.net/boywaiter/article/details/89489207

版权

本文是《语音与语言处理》ed3的读书笔记，主要讨论了正则表达式的基本模式、组合与优先级，以及文本规范化中的词化、归一化和编辑距离的概念。正则表达式用于指定文本搜索字符串，而文本规范化涉及将文本转换为更方便的标准形式。编辑距离是一种衡量两个字符串相似性的指标，基于插入、删除和替换操作的数量。

摘要由CSDN通过智能技术生成

Chapter 2 Regular Expressions, Text Normalization, Edit Distance

Speech and Language Processing ed3读书笔记

text normalization: converting text to a more convenient, standard form.

tokenization: separate words within a sentence.
lemmatization: the task of determining that two words have the same root, despite their surface differences.
stemming: strip suffixes from the end of the word.
sentence segmentation: breaking up a text into individual sentences, using cues like
periods or exclamation points.

edit distance: measures how similar two strings are based on the number of edits (insertions, deletions, substitutions) it takes to change one string into the other.

2.1 Regular Expressions

regular expression (RE): an algebraic notation for characterizing a set of strings, a language for specifying text search strings.

python code

#find the first match
import re
key="<html><body><h1>hello world</h1></body></html>"
p1="(?<=<h1>).+?(?=</h1>)"
pattern1 =re.compile(p1)
matcher1=re.search(pattern1,key)
print(matcher1.group(0))

#find all matches
import re
key="Column 1 Column 2 Column 3 Columna"
p1="\\bColumn\\b"
pattern1 =re.compile(p1)
print(pattern1.findall(key))

2.1.1 Basic Regular Expression Patterns

Regular expressions are case sensitive.

[]: means or

/[wW]/: w or W

/[A-Z]/: an upper case letter

/[a-z]/: a lower case letter

When a caret ^ is the first symbol within a [], it means negation:

/[^A-Z]/: not an upper case letter

/[^Ss]/: neither ‘S’ nor ‘s’

/[^\.]/: not a period

/[e^]/: either ‘e’ or ‘^’

/a^b/: the pattern ‘a^b’

? means “the preceding character or nothing”:

/woodchucks?/: woodchuck or woodchucks

/colou?r/: color or colour

**Kleene *** (pronounced “cleany star”): zero or more occurences of the immediately previous character or regular expression

/a*/: any string of zero or more as

/aa*/

/[ab]*/

Kleene +: one or more occurrences of the immediately preceding character or regular expression

/./: a wildcard expression that matches any single character (except a carriage return)

/beg.n/: begin, begun …

/aardvark.*aardvark/: to find any line in which a particular word, for example, aardvark, appears twice.

Anchors are special characters that anchor regular expressions to particular places in a string. The most common anchors are the caret ^ and the dollar sign $. The caret ^ matches the start of a line. The pattern /^The/ matches the word The only at the start of a line.

Thus, the caret ^ has three uses:

to match the start of a line,

to indicate a negation inside of square brackets,

and just to mean a caret.

The dollar sign $ matches the end of a line. So the pattern $ is a useful pattern for matching a space at the end of a line, and /^The dog\.$/ matches a line that contains only the phrase The dog.

\b matches a word boundary. /\bthe\b/ matches the word the but not the word other.

\B matches a non-boundary

\b and \B seems not work in python. Use \\b and \\B

2.1.2 Disjunction, Grouping, and Precedence

/|/: /cat|dog/ to match either the string cat or the string dog.

/gupp(y|ies)/ to match the string guppy or the string guppies

/Column [0-9]+ *)*/ to match the string Column 1 Column 2 Column 3

operator precedence hierarchy: from highest to lowerest

Parenthesis	()
Counters	* + ? {}
Sequences and anchors	the ^my end$
Disjunction	\|

Thus, because counters have a higher precedence than sequences, /the*/ matches theeeee but not thethe. Because sequences have a higher precedence than disjunction, /the|any/ matches the or any but not theny.

we say that patterns are greedy, expanding to cover as much of a string as they can.
There are, however, ways to enforce non-greedy matching, using another meaning of the ? qualifier. The operator *? is a Kleene star that matches as little text as possible. The operator +? is a Kleene plus that matches as little text as possible.

2.1.3 A Simple Example

The process we just went through was based on fixing two kinds of errors: false positives, strings that we incorrectly matched like other or there, and false negatives, strings that we incorrectly missed, like The. Addressing these two kinds of errors comes up again and again in implementing speech and language processing systems. Reducing the overall error rate for an application thus involves two antagonistic efforts:

Increasing precision (minimizing false positives)
Increasing recall (minimizing false negatives)

2.1.4 A More Complex Example

2.1.5 More Operators

RE	Expansion	Match	First Matches
\d	[0-9]	any digit	Party of $\underline{5}$
\D	[^0-9]	any non-digit	Blue moon
\w	[a-zA-Z0-9 ]	any alphanumeric/underscore	Daiyu
\W	[^\w]	a non-alphanumeric	!!!!
\s	[ \r\t\n\f]	whitespace (space, tab)
\S	[^\s]	Non-whitespace	in Concord

RE	Match
*	zero or more occurrences of the previous char or expression
+	one or more occurrences of the previous char or expression
?	exactly zero or one occurrence of the previous char or expression
{n}	n occurrences of the previous char or expression
{n,m}	from n to m occurrences of the previous char or expression
{n,}	at least n occurrences of the previous char or expression
{,m}	up to m occurrences of the previous char or expression

2.1.6 Regular Expression Substitution, Capture Groups, and ELIZA

substitution

s/regexp1/pattern: replace a string characterized by a regular expression regexp1 with pattern.

number operator

s/([0-9]+)/<\1>: add angle brackets to integers. For example, change the 35 boxes to the <35> boxes.

/the (.*)er they were, the \1er they will be/

will match the bigger they were, the bigger they will be but not the bigger they were, the faster they will be.

\1 will be replaced by whatever string matched the first item in parentheses.

This use of parentheses to store a pattern in memory is called a capture group. Every time a capture group is used (i.e., parentheses surround a pattern), the resulting match is stored in a numbered register. If you match two different sets of parentheses, \2 means whatever matched the second capture group. Thus

/the (.*)er they (.*), the \1er they \2/

will match the faster they ran, the faster we ran but not the faster they ran, the faster we ate.

Parentheses thus have a double function in regular expressions; they are used to group terms for specifying the order in which operators should apply, and they are used to capture something in a register. Occasionally we might want to use parentheses for grouping, but don’t want to capture the resulting pattern in a register. In that case we use a non-capturing group, which is specified by putting the commands ?: after the open paren, in the form (?: pattern ).

/(?:some|a few) (people|cats) like some \1/

will match some cats like some cats but not some cats like some a few.

2.1.7 Lookahead assertions

The operator (?= pattern) is true if pattern zero-width occurs, but is zero-width, i.e. the match pointer doesn’t advance. The operator (?! pattern) only returns true if a pattern does not match, but again is zero-width and doesn’t advance the cursor.

/(?<= pattern)/ matches a string that begins with pattern

/(? = pattern)/ matches a string that ends with pattern

2.2 Words

corpus (plural corpora): a computer-readable collection of text or speech.

Punctuation is critical for finding boundaries of things (commas, periods, colons) and for identifying some aspects of meaning (question marks, exclamation marks, quotation marks). For some tasks, like part-of-speech tagging or parsing or speech synthesis, we sometimes treat punctuation marks as if they were separate words.

An utterance is the spoken correlate of a sentence.

This utterance has two kinds of disfluencies. The broken-off word main- is fragment called a fragment. Words like uh and um are called fillers or filled pauses.

We also sometimes keep disfluencies around. Disfluencies like uh or um are actually helpful in speech recognition in predicting the upcoming word, because they may signal that the speaker is restarting the clause or idea, and so for speech recognition they are treated as regular words. Because people use different disfluencies they can also be a cue to speaker identification.

A lemma is a set of lexical forms having the same stem, the same major part-of-speech, and the same word sense. The wordform is the full inflected or derived form of the word.

Types are the number of distinct words in a corpus; if the set of words in the vocabulary is V , the number of types is the vocabulary size $∣ V ∣$ . Tokens are the total number $N$ of running words.

The relationship between the number of types $∣ V ∣$ and number of tokens $N$ is called Herdan’s Law (Herdan, 1960) or Heaps’ Law (Heaps, 1978) Heaps’ Law after its discoverers (in linguistics and information retrieval respectively). It is shown in Eq. 2.1, where $k$ and $b$ are positive constants, and $\beta < 1$ .
$kN^\beta$

最低0.47元/天解锁文章