1、到搜狗或者其他中文词库下载词库(以疾病为例)。
2、使用奥创字库转换,生成的word.txt文件。
附件word.php代码:
<?php
ini_set('display_errors', 'On');
error_reporting(E_ALL);
date_default_timezone_set ('Asia/Shanghai');
set_time_limit(0);
$buffer = ini_get('output_buffering');
if($buffer){
ob_end_flush();
}
echo '处理新词库...'.PHP_EOL;
flush();
$filename = "words.txt";
$handle = fopen ($filename, "r");
$content = fread ($handle, filesize ($filename));
fclose ($handle);
$content = trim($content);
$arr1 = explode( "\r\n" ,$content );
$arr1 = array_flip(array_flip($arr1));
foreach($arr1 as $key=>$value){
$value = dealchinese($value);
if(!empty($value)){
$arr1[$key] = $value;
}else{
unset($arr1[$key]);
}
}
echo '处理原来词库...'.PHP_EOL;
flush();
$filename2 = "unigram.txt";
$handle2 = fopen ($filename2, "r");
$content2 = fread ($handle2, filesize ($filename2));
fclose ($handle2);
$content2 = dealchinese($content2,"\r\n");
$arr2 = explode( "\r\n" ,$content2 );
echo '删除相同词条...'.PHP_EOL;
flush();
$array_diff = array_diff($arr1,$arr2);
echo '格式化词库...'.PHP_EOL;
flush();
$words='';
foreach($array_diff as $k => $word){
$words .= $word."\t1".PHP_EOL."x:1".PHP_EOL;
}
//echo $words;
file_put_contents('words_new.txt',$words,FILE_APPEND);
echo 'done!';
function dealChinese($str, $join=''){
preg_match_all('/[\x{4e00}-\x{9fff}]+/u', $str, $matches); //将中文字符全部匹配出来
$str = join($join, $matches[0]); //从匹配结果中重新组合
return $str;
}
3、运行word.php生成我们最后需要的词库放到/usr/local/mmseg3/etc下面 命名为unigram.txt文件。
4、进入到mmseg3/etc目录 生成词库运行此命令 /usr/local/mmseg3/bin/mmseg -u unigram.txt
5、重建索引/usr/local/coreseek/bin/indexer -c /usr/local/coreseek/etc/csft.conf aaa --rotate
6、执行/usr/local/coreseek/bin/search -c /usr/local/coreseek/etc/csft.conf 发烧 说明你已经成功建立词库