Note1: Basic Text Processing

Regular Expressions: Disjunctions

Online regular expression test

1. Letters inside square brackets []

Pattern

Matches

[wW]oodchuck

Woodchuck, woodchuck

[1234567890] 

Any digit


2. Ranges [ A - Z ]

Pattern

Matches

[A-Z]

An upper case letter

Drenched Blossoms

[a-z]

A lower case letter

my beans were impatient

[0-9]

A single digit

Chapter 1: Down the Rabbit Hole


3. Negations [^Ss]  

    Carat means negation only when first in []

Pattern

Matches

[^A-Z]

Not an upper case letter

Oyfn pripetchik

[^Ss

Neither ‘S’ nor ‘s’

I have no exquisite reason”

[^e^]

Neither e nor ^

Look here

a^b

The pattern a carat b

Look up a^bnow


4. The pipe | for disjunction

Pattern

Matches

groundhog|woodchuck

yours|mine

yours   mine

a|b|c

= [abc]

[gG]roundhog|[Ww]oodchuck



Regular Expressions: ? * + .

Pattern

Matches

colou?r

Optional previous char

color    colour

oo*h!

0 or more of previous char

oh!ooh!  ooohooooh!

o+h!

1 or more of previous char

oh!ooh!  ooohooooh!

baa+

baa baaa baaaa  baaaaa

beg.n

begin begun begun beg3n


Anchors: ^ $

^ : start with

$: end with

Pattern

Matches

^[A-Z]

Palo Alto

^[^A-Za-z]

1    Hello”

\.$

The end.

.$

The end?  The end!


Two kinds of errors:

Type I: matching strings that we should not have matched     False positives

Type II: not matching things that we should have matched     False negatives

Two antagonistic efforts:

Increasing accuracy or precision (minimize false positives)

Increasing coverage or recall (minimizing false negatives)


Word Tokenization






评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值