Word Tokenization

Well, after listen the class, it's necessary to make notes.

In this class, Pro just tell us there are a lot of words in various corpus using Linux program, and introduce that in different language, there are different language. For example, word segmentation is the crucial step for Chinese, and introduce maximun matching is the relative good algoritm for Chinese but not for English.

Some Linux program

less a.txt | less : means display a.txt

 

tr -sc 'A-Za-z' '\n' < a.txt | less : means replace all the periods and commas with new lines and display.

if u want to sort, just add " | sort"

if u want to get the unique, u can add "uniq -c"

if u want to sort, u can add "sort -n - r", -n means display the num, -r means sort by decreasing.

 

转载于:https://www.cnblogs.com/chuanlong/archive/2013/03/31/2991846.html

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值