matlab如何提取文本词干,英文词干提取(stemming)算法 - Lovins, Porter

英文词干提取有多种方式,在实践中,可能涉及到机器学习数据挖掘等多方面的内容。

这里主要介绍的是易于实现的几种原始算法:

Lovins (1968)

Porter (1980)

Porter2 (2000)

1. Lovins

Lovins是最早的实现

1.1. 简介

算法涉及如下部件:

ending, 词后缀,共有294个,详细列表见最后

condition, 词后缀去除条件,每个ending对应一个condition,共有29个,详细列表见最后

transformation, 转换ending的方式,共有35个,详细列表见最后

算法分为两部:

对英文词,根据ending列表,按照ending从长到短扫描,找到第一个符合condition的ending

根据剩下的stem应用transformation,将ending转为恰当的形式

1.2. 例子

第一步

英文词为nationally,按照endling列表,从长到短扫描,首先找到 .09. ationally B,

对应的规则是B Minimum stem length = 3,要求去除ending后,剩余的部分长度大于等于3

nationally 去除 ationally 后只剩下 n, 不符合condition

继续扫描ending,找到 .07. ionally A,对应的规则是 A No restrictions on stem,没有任何限制。

于是最终选定 ionally作为ending

第二步

英文词nationally的stem是nat, 查找transformation,发现没有符合的transformation,不进行变换直接输出。

比如又一个词sitting,第一步得到stem是sitt, 第二步这里会应用第一条transformation,最终输出sit

1.Appendix.A endings 列表

.11.

alistically B arizability A izationally B

.10.

antialness A arisations A arizations A entialness A

.09.

allically C antaneous A antiality A arisation A

arization A ationally B ativeness A eableness E

entations A entiality A entialize A entiation A

ionalness A istically A itousness A izability A

izational A

.08.

ableness A arizable A entation A entially A

eousness A ibleness A icalness A ionalism A

ionality A ionalize A iousness A izations A

lessness A

.07.

ability A aically A alistic B alities A

ariness E aristic A arizing A ateness A

atingly A ational B atively A ativism A

elihood E encible A entally A entials A

entiate A entness A fulness A ibility A

icalism A icalist A icality A icalize A

ication G icianry A ination A ingness A

ionally A isation A ishness A istical A

iteness A iveness A ivistic A ivities A

ization F izement A oidally A ousness A

.06.

aceous A acious B action G alness A

ancial A ancies A ancing B ariser A

arized A arizer A atable A ations B

atives A eature Z efully A encies A

encing A ential A enting C entist A

eously A ialist A iality A ialize A

ically A icance A icians A icists A

ifully A ionals A ionate D ioning A

ionist A iously A istics A izable E

lessly A nesses A oidism A

.05.

acies A acity A aging B aical A

alist A alism B ality A alize A

allic BB anced B ances B antic C

arial A aries A arily A arity B

arize A aroid A ately A ating I

ation B ative A ators A atory A

ature E early Y ehood A eless A

elity A ement A enced A ences A

eness E ening E ental A ented C

ently A fully A ially A icant A

ician A icide A icism A icist A

icity A idine I iedly A ihood A

inate A iness A ingly B inism J

inity CC ional A ioned A ished A

istic A ities A itous A ively A

ivity A izers F izing F oidal A

oides A otide A ously A

.04.

able A ably A ages B ally B

ance B ancy B ants B aric A

arly K ated I ates A atic B

ator A ealy Y edly E eful A

eity A ence A ency A ened E

enly E eous A hood A ials A

ians A ible A ibly A ical A

ides L iers A iful A ines M

ings N ions B ious A isms B

ists A itic H ized F izer F

less A lily A ness A ogen A

ward A wise A ying B yish A

.03.

acy A age B aic A als BB

ant B ars O ary F ata A

ate A eal Y ear Y ely E

ene E ent C ery E ese A

ful A ial A ian A ics A

ide L ied A ier A ies P

ily A ine M ing N ion Q

ish C ism B ist A ite AA

ity A ium A ive A ize F

oid A one R ous A

.02.

ae A al BB ar X as B

ed E en F es E ia A

ic A is A ly B on S

or T um U us V yl R

s' A 's A

.01.

a A e A i A o A

s W y B

1.Appendix.B conditions 列表

A No restrictions on stem

B Minimum stem length = 3

C Minimum stem length = 4

D Minimum stem length = 5

E Do not remove ending after e

F Minimum stem length = 3 and do not remove ending after e

G Minimum stem length = 3 and remove ending only after f

H Remove ending only after t or ll

I Do not remove ending after o or e

J Do not remove ending after a or e

K Minimum stem length = 3 and remove ending only after l, i or u*e

L Do not remove ending after u, x or s, unless s follows o

M Do not remove ending after a, c, e or m

N Minimum stem length = 4 after s**, elsewhere = 3

O Remove ending only after l or i

P Do not remove ending after c

Q Minimum stem length = 3 and do not remove ending after l or n

R Remove ending only after n or r

S Remove ending only after dr or t, unless t follows t

T Remove ending only after s or t, unless t follows o

U Remove ending only after l, m, n or r

V Remove ending only after c

W Do not remove ending after s or u

X Remove ending only after l, i or u*e

Y Remove ending only after in

Z Do not remove ending after f

AA Remove ending only after d, f, ph, th, l, er, or, es or t

BB Minimum stem length = 3 and do not remove ending after met or ryst

CC Remove ending only after l

1.Appendix.C transformations 列表

1 remove one of double b, d, g, l, m, n, p, r, s, t

2 iev -> ief

3 uct -> uc

4 umpt -> um

5 rpt -> rb

6 urs -> ur

7 istr -> ister

7a metr -> meter

8 olv -> olut

9 ul -> l except following a, o, i

10 bex -> bic

11 dex -> dic

12 pex -> pic

13 tex -> tic

14 ax -> ac

15 ex -> ec

16 ix -> ic

17 lux -> luc

18 uad -> uas

19 vad -> vas

20 cid -> cis

21 lid -> lis

22 erid -> eris

23 pand -> pans

24 end -> ens except following s

25 ond -> ons

26 lud -> lus

27 rud -> rus

28 her -> hes except following p, t

29 mit -> mis

30 ent -> ens except following m

31 ert -> ers

32 et -> es except following n

33 yt -> ys

34 yz -> ys

2. Porter

2.1. 简介

元音与辅音

元音辅音与常见的定义略有不同:

元音(Vowel) - A E I O U, 以及辅音后边的Y

辅音(Consonant) - 除了 A E I O U,以及元音后边的Y

单词的分组

连续的元音看作元音组V,连续的辅音看作辅音组C,于是任意一个单词都可以表示成VC交错的形式,例如:

segmentfault -> s/e/gm/e/ntf/au/lt -> CVCVCVC

porter -> p/o/rt/e/r -> CVCVC

application -> a/ppl/i/c/a/t/io/n -> VCVCVCVC

apple -> a/ppl/e -> V/C/V

综合起来,可以表示为 VC 组的形式:$$ C^m[V] $$

其中参数m类似于Lovin中condition的stem长度,用于后续的判断

规则

Porter算法以rule为主,rule的形式为:

(condition) S1 -> S2

condition作用于去除了S1的stem,除了m还有其他特征:

m - 表示VC组的数目

* - 表示任意字符, 和子串,v,d,o配合使用

大写字母 - 表示子串

v - 表示一个元音字符

d - 表示两个一样的辅音

o - 表示cvc, 其中第二个c不能是W,X,Y

S1是词的后缀,S2的变化后的后缀

和Lovin不同,一个词语经过多个规则的串联处理,输出目标词(Lovin是一次性输出)

例如 hopping, 首先应用规则(*v*) ING ->, 变为hopp

然后应用规则(*d and not (*L or *S or *Z)) -> single letter,从hopp变为hop

流程

整个算法是从上往下应用规则,有些规则比较特殊,如果触发了要处理额外的规则

规则很多,于是对规则进行分组(step),这里的分组是为了逻辑上做区分(实际上算法也可以根据分组优化),整个算法就是从头到位执行的,流程如下:

do Step_1a

do Step_1b (如果命中step 2b.2 or step 2b.3, 则做一些额外工作)

do Step_1c

do Step_2

do Step_3

do Step_4

do Step_5a

do Step_5b

每个Step的详细内容见附录

2.2. 例子

2.Appendix Step 1a

SSES -> SS

IES -> I

SS -> SS

S ->

2.Appendix Step 1b

(m>0) EED -> EE

(*v*) ED ->

(*v*) ING ->

If the second or third of the rules in Step 1b is successful, the following is done:

AT -> ATE

BL -> BLE

IZ -> IZE

(*d and not (*L or *S or *Z)) -> single letter

(m=1 and *o) -> E

2.Appendix Step 1c

(*v*) Y -> I

2.Appendix Step 2

(m>0) ATIONAL -> ATE

(m>0) TIONAL -> TION

(m>0) ENCI -> ENCE

(m>0) ANCI -> ANCE

(m>0) IZER -> IZE

(m>0) ABLI -> ABLE

(m>0) ALLI -> AL

(m>0) ENTLI -> ENT

(m>0) ELI -> E

(m>0) OUSLI -> OUS

(m>0) IZATION -> IZE

(m>0) ATION -> ATE

(m>0) ATOR -> ATE

(m>0) ALISM -> AL

(m>0) IVENESS -> IVE

(m>0) FULNESS -> FUL

(m>0) OUSNESS -> OUS

(m>0) ALITI -> AL

(m>0) IVITI -> IVE

(m>0) BILITI -> BLE

2.Appendix Step 3

(m>0) ICATE -> IC

(m>0) ATIVE ->

(m>0) ALIZE -> AL

(m>0) ICITI -> IC

(m>0) ICAL -> IC

(m>0) FUL ->

(m>0) NESS ->

2.Appendix Step 4

(m>1) AL ->

(m>1) ANCE ->

(m>1) ENCE ->

(m>1) ER ->

(m>1) IC ->

(m>1) ABLE ->

(m>1) IBLE ->

(m>1) ANT ->

(m>1) EMENT ->

(m>1) MENT ->

(m>1) ENT ->

(m>1 and (*S or *T)) ION ->

(m>1) OU ->

(m>1) ISM ->

(m>1) ATE ->

(m>1) ITI ->

(m>1) OUS ->

(m>1) IVE ->

(m>1) IZE ->

2.Appendix Step 5a

(m>1) E ->

(m=1 and not *o) E ->

2.Appendix Step 5b

(m > 1 and *d and *L) -> single letter

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值