Word Normalization and Stemming

最新推荐文章于 2020-07-05 18:48:38 发布

weixin_34306446

最新推荐文章于 2020-07-05 18:48:38 发布

阅读量127

点赞数

原文链接：http://www.cnblogs.com/chuanlong/archive/2013/04/01/2992785.html

版权

Well, today I learned the word normalization and stemming.

After word tokenization, we should stem to map them to a normal form. For examples, u should refer "are is " to "be", and refer "windows" to "window" and so on. Afterwards, we can use Linux tool to implement.

Firstly, u know, divede every word into one line and display.

translate the captial to lowercase

grep, which is global search regular expression and print out the line, allow u to use regular expression.

then, u should sort before use uniq by count, afterwards, u can sort by num by default increasing, if u don't use '-r'.

note, u can use 'G' to go to end, and use 'g' to go to the begin.

u konw some words which is end with ing are no need to find out, so we can modify the regular expression.

well, the result is better even thought there are still some words included. there are a long road for me.

In conclusion, we should segmentation words then we normalize those words.

转载于:https://www.cnblogs.com/chuanlong/archive/2013/04/01/2992785.html

确定要放弃本次机会？

福利倒计时

: :

立减 ¥

普通VIP年卡可用

立即使用

weixin_34306446

关注关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Word Normalization and Stemming

Well, today I learned the word normalization and stemming.After word tokenization, we should stem to map them to a normal form. For examples, u should refer "are is " to "be", and refer "windows" t...
复制链接

扫一扫