最近使用了SCWS进行中文分词。有个问题就是添加自定义词典,总结一下吧。
词典格式的话
# WORD TF IDF ATTR
学五 14.01 5.92 n
去这个网址可查:http://www.xunsearch.com/scws/demo/get_tfidf.php
自己写的,把要查的词放到txt中,然后批量查询~
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
|
require_once
__DIR__.
'/func/my_curl_function.php'
;
require_once
__DIR__.
'/func/simple_html_dom.php'
;
$url
=
'http://www.xunsearch.com/scws/demo/get_tfidf.php'
;
$header
=
array
(
"Referer:http://www.xunsearch.com/scws/demo/get_tfidf.php"
,
"Cookie:PHPSESSID=1fuk5j3ckb7n55s5j4cltk2sd3"
);
$f
=
fopen
(
'tfidf.txt'
,
'r'
);
$i
=0;
$html
=
new
simple_html_dom();
while
(!
feof
(
$f
))
{
sleep(1);
$contents
=
array
();
$contents
= trim(
fgets
(
$f
));
if
(
$contents
==
""
)
continue
;
//echo $contents;
$data
[
'data'
]=urlencode(trim(
$contents
));
$returndata
=my_curl_post(
$url
,
$data
,
$header
);
//var_dump($returndata);
//正则匹配出TF IDF
preg_match(
'/WORD=.*? TF=(.*?) IDF=(.*?)<br \/>/'
,
$returndata
,
$ret
);
$tf
=
$ret
[1];
$idf
=
$ret
[2];
echo
"\n"
;
echo
$contents
.
"\t"
.
$tf
.
"\t"
.
$idf
.
"\tn\r\n"
;
//$html->load($returndata);
//$ps=$html->find('p');
//var_dump($ret);
file_put_contents
(
"tfidf.out"
,
$contents
.
"\t"
.
$tf
.
"\t"
.
$idf
.
"\tn\r\n"
,FILE_APPEND);
}
fclose(
$f
);
|
生成好自己的字典后就可以添加了
其实只要添加$so->add_dict('路径',词典);即可.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
|
function
scws_text(
$string
)
{
$so
= scws_new();
$so
->set_charset(
'utf8'
);
$so
->set_ignore(true);
$so
->add_dict(
'/usr/local/scws/etc/dict.xdb'
,SCWS_XDICT_XDB);
$so
->add_dict(
'/usr/local/scws/etc/dict.utf8.xdb'
,SCWS_XDICT_XDB);
$so
->add_dict(
'/usr/local/scws/etc/mydict.txt'
,SCWS_XDICT_TXT);
$so
->send_text(
$string
);
$text
=
""
;
while
(
$tmp
=
$so
->get_result())
{
foreach
(
$tmp
as
$key
=>
$value
){
//print_r($tmp);
//$tmp .= $tmp;
$text
.=
$value
[
'word'
].
" "
;
}
}
//print_r($text);
$so
->close();
return
$text
;
}
|
添加上就可以使用咯
本文转自 努力的C 51CTO博客,原文链接:http://blog.51cto.com/fulin0532/1952455