好吧,我运行了我为
this SO question编写的脚本,进行了一些小的更改 – 使用日志概率来避免下溢,并修改它以读取多个文件作为语料库.
对于我的语料库,我从项目Gutenberg下载了一堆文件 – 没有真正的方法,只需从etext00,etext01和etext02中获取所有英语文件.
以下是结果,我保存了每个组合的前三名.
expertsexchange: 97 possibilities
- experts exchange -23.71
- expert sex change -31.46
- experts ex change -33.86
penisland: 11 possibilities
- pen island -20.54
- penis land -22.64
- pen is land -25.06
choosespain: 28 possibilities
- choose spain -21.17
- chooses pain -23.06
- choose spa in -29.41
kidsexpress: 15 possibilities
- kids express -23.56
- kid sex press -32.65
- kids ex press -34.98
childrenswear: 34 possibilities
- children swear -19.85
- childrens wear -25.26
- child ren swear -32.70
dicksonweb: 8 possibilities
- dickson web -27.09
- dick son web -30.51
- dicks on web -33.63