aho-corasick php,PHP中的Aho-Corasick多关键字字符串搜索库

Aho Corasick in PHP

This is a small library which implements the Aho-Corasick string search algorithm.

It's coded in pure PHP and self-contained in a single file, ahocorasick.php.

It's useful when you want to search for many keywords all at once. It's faster than simply calling strpos many times, and it's much faster than calling preg_match_all with something like /keyword1|keyword2|...|keywordn/.

I originally wrote this to use with F5Bot, since it's searching for the same set of a few thousand keywords over and over again.

Usage

It's designed to be really easy to use. You create the ahocorasick object, add your keywords, call finalize() to finish setup, and then search your text. It'll return an array of the keywords found and their position in the search text.

Create, add keywords, and finalize():

require('ahocorasick.php');

$ac = new ahocorasick();

$ac->add_needle('art');

$ac->add_needle('cart');

$ac->add_needle('ted');

$ac->finalize();

Call search() to preform the actual search. It'll return an array of matches.

$found = $ac->search('a carted mart lot one blue ted');

print_r($found);

$found will be an array with these elements:

[0] => Array

(

[0] => cart

[1] => 2

)

[1] => Array

(

[0] => art

[1] => 3

)

[2] => Array

(

[0] => ted

[1] => 5

)

[3] => Array

(

[0] => art

[1] => 10

)

[4] => Array

(

[0] => ted

[1] => 27

)

See example.php for a complete example.

Speed

A simple benchmarking program is included which compares various alternatives.

$ php benchmark.php

Loaded 3000 keywords to search on a text of 19377 characters.

Searching with strpos...

time: 0.38440799713135

Searching with preg_match...

time: 5.6817619800568

Searching with preg_match_all...

time: 5.0735609531403

Searching with aho corasick...

time: 0.054709911346436

Note: the regex solutions are actually slightly broken. They won't work if you have a keyword that is a prefix or suffix of another. But hey, who really uses regex when it's not slightly broken?

Also keep in mind that building the search tree (the add_needle() and finalize() calls) takes time. So you'll get the best speed-up if you're reusing the same keywords and calling search() many times.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值