php提取指定关键字,php – 从文本块中提取相关标记/关键字

最新推荐文章于 2023-07-23 18:07:09 发布

weixin_39748928

最新推荐文章于 2023-07-23 18:07:09 发布

阅读量294

点赞数

文章标签： php提取指定关键字

一个非常天真的方法是从文本中删除常见的

stopwords,留下更多有意义的单词,如“标准”,“JSON”等.但是你仍会得到很多噪音,所以你可以考虑像

OpenCalais那样的服务对您的文本进行相当复杂的分析.

更新：

好的,我之前回答中的链接指向了实现,但是你要求一个,所以这里有一个简单的：

function stopWords($text, $stopwords) {

// Remove line breaks and spaces from stopwords

$stopwords = array_map(function($x){return trim(strtolower($x));}, $stopwords);

// Replace all non-word chars with comma

$pattern = '/[0-9\W]/';

$text = preg_replace($pattern, ',', $text);

// Create an array from $text

$text_array = explode(",",$text);

// remove whitespace and lowercase words in $text

$text_array = array_map(function($x){return trim(strtolower($x));}, $text_array);

foreach ($text_array as $term) {

if (!in_array($term, $stopwords)) {

$keywords[] = $term;

}

};

return array_filter($keywords);

}

$stopwords = file('stop_words.txt');

$text = "Requirements - Working knowledge, on LAMP Environment using Linux, Apache 2, MySQL 5 and PHP 5, - Knowledge of Web 2.0 Standards - Comfortable with JSON - Hands on Experience on working with Frameworks, Zend, OOPs - Cross Browser Javascripting, JQuery etc. - Knowledge of Version Control Software such as sub-version will be preferable.";

print_r(stopWords($text, $stopwords));

你可以在这个Gist中看到这个,以及stop_word.txt的内容.

在示例文本上运行上面的内容会生成以下数组：

Array

(

[0] => requirements

[4] => linux

[6] => apache

[10] => mysql

[13] => php

[25] => json

[28] => frameworks

[30] => zend

[34] => browser

[35] => javascripting

[37] => jquery

[38] => etc

[42] => software

[43] => preferable

)

所以,就像我说的那样,这有些天真并且可以使用更多的优化(加上它很慢),但它会从文本中提取更相关的关键字.您还需要对停用词进行一些微调.捕获像Web 2.0这样的术语将非常困难,所以我认为你最好使用像OpenCalais这样可以理解文本并返回实体和引用列表的严肃服务. DocumentCloud依靠这项服务从文件中收集信息.

此外,对于客户端实现,您可以使用JavaScript执行几乎相同的操作,并且可能更清晰(尽管对于客户端来说可能会很慢).

weixin_39748928

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
php提取指定关键字,php – 从文本块中提取相关标记/关键字

一个非常天真的方法是从文本中删除常见的stopwords,留下更多有意义的单词,如“标准”,“JSON”等.但是你仍会得到很多噪音,所以你可以考虑像OpenCalais那样的服务对您的文本进行相当复杂的分析.更新：好的,我之前回答中的链接指向了实现,但是你要求一个,所以这里有一个简单的：function stopWords($text, $stopwords) {// Remove line br...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。