Aho-Corasick string matching in C#

转载 2007年09月27日 11:16:00

Introduction

In this article, I will describe the implementation of an efficient Aho-Corasick algorithm for pattern matching. In simple words, this algorithm can be used for searching a text for specified keywords. The following code is useful when you have a set of keywords and you want to find all occurrences of a keywords in the text or check if any of the keywords is present in the text. You should use this algorithm especially if you have a large number of keywords that don't change often, because in this case, it is much more efficient than other algorithms that can be simply implemented using the .NET class library.

Aho-Corasick algorithm

In this section, I'll try to describe the concept of this algorithm. For more information and for a more exact explanation, please take a look at the links at the end of this article. The algorithm consists of two parts. The first part is the building of the tree from keywords you want to search for, and the second part is searching the text for the keywords using the previously built tree (state machine). Searching for a keyword is very efficient, because it only moves through the states in the state machine. If a character is matching, it follows goto function otherwise it follows fail function.

Tree building

In the first phase of the tree building, keywords are added to the tree. In my implementation, I use the class StringSearch.TreeNode, which represents one letter. The root node is used only as a place holder and contains links to other letters. Links created in this first step represents the goto function, which returns the next state when a character is matching.

During the second phase, the fail and output functions are found. The fail function is used when a character is not matching and the output function returns the found keywords for each reached state. For example, in the text "SHIS", the failure function is used to exit from the "SHE" branch to "HIS" branch after the first two characters (because the third character is not matching). During the second phase, the BFS (breadth first search) algorithm is used for traversing through all the nodes. Functions are calculated in this order, because the fail function of the specified node is calculated using the fail function of the parent node.

Building of the keyword tree (figure 1 - after the first step, figure 2 - tree with the fail function)

Searching

As I already mentioned, searching only means traversing the previously built keyword tree (state machine). To demonstrate how this algorithm works, let's look at the commented method which returns all the matches of the specified keywords:

Collapse
// Searches passed text and returns all occurrences of any keyword
// Returns array containing positions of found keywords
public StringSearchResult[] FindAll(string text)
{
  ArrayList ret=new ArrayList(); // List containing results
  TreeNode ptr=_root;            // Current node (state)
  int index=0;                   // Index in text

  // Loop through characters
  while(index<text.Length)
  {
    // Find next state (if no transition exists, fail function is used)
    // walks through tree until transition is found or root is reached
    TreeNode trans=null;
    while(trans==null)
    {
      trans=ptr.GetTransition(text[index]);
      if (ptr==_root) break;
      if (trans==null) ptr=ptr.Failure;
    }
    if (trans!=null) ptr=trans;

    // Add results from node to output array and move to next character
    foreach(string found in ptr.Results)
      ret.Add(new StringSearchResult(index-found.Length+1,found));
    index++;
  }
  
  // Convert results to array
  return (StringSearchResult[])ret.ToArray(typeof(StringSearchResult));
}

Algorithm complexity

Complexity of the first part is not so important, because it is executed only once. Complexity of the second part is O(m+z) where m is the length of the text and z is the number of found keywords (in simple words, it is very fast and it's speed doesn't drop quickly for longer texts or many keywords).

Performance comparison

To show how efficient this algorithm is, I created a test application which compares this algorithm with two other simple methods that can be used for this purpose. The first algorithm uses the String.IndexOf method to search the text for all the keywords, and the second algorithm uses regular expressions - for example, for keywords he, she, and his, it creates a regular expression (he|she|his). The following graphs show the results of tests for two texts of different sizes. The number of used keywords is displayed on the X axis and the time of search is displayed on the Y axis.

The interesting thing is that for less than 70 keywords, it is better to use a simple method using String.IndexOf. Regular expressions are almost always slower than other algorithms. I also tried compiling the test under both .NET 1.1 and .NET 2.0 to see the difference. Although my measuring method may not be very precise, it looks like .NET 2.0 is a bit faster (about 5-10%), and the method with regular expressions gives much better results (about 60% faster).

Two charts comparing the speed of the three described algorithms - Aho-Corasick (green), IndexOf (blue), and Regex (yellow)

How to use the code

I decided to implement this algorithm when I had to ban some words in a community web page (vulgarisms etc.). This is a typical use case because searching should be really fast, but blocked keywords don't change often (and the creation of the keyword tree can be slower).

The search algorithm is implemented in a file StringSearch.cs. I created the interface that represents any search algorithm (so it is easy to replace it with another implementation). This interface is called IStringSearchAlgorithm, and it contains a property Keywords (gets or sets keywords to search for) and methods for searching. The method FindAll returns all the keywords in the passed text, and FindFirst returns the first match. Matches are represented by the StringSearchResult structure that contains the found keyword and its position in the text. The last method is ContainsAny, which returns true when the passed text contains a keyword. The class that implements the Aho-Corasick algorithm is called StringSearch.

Initialization

The following example shows how to load keywords from a database and create a SearchAlgorithm instance:

// Initialize DB connection
SqlConnection conn = new SqlConnection(connectionString);
SqlCommand cmd = new SqlCommand("SELECT BlockedWord" + 
                                " FROM BlockedWords",conn);
conn.Open();

// Read list of banned words
ArrayList listWords = new ArrayList();
using(SqlDataReader reader = 
  cmd.ExecuteReader(CommandBehavior.CloseConnection))
{
  while(reader.Read()) 
    listWords.Add(myReader.GetString(0));
}
string[] arrayWords = (string[])listWords.ToArray(typeof(string));

// Create search algorithm instance
IStringSearchAlgorithm searchAlg = new StringSearch();
searchAlg.Keywords = arrayWords;

You can also use the StringSearch constructor which takes an array of keywords as parameter.

Searching

Searching the passed text for keywords is even easier. The following sample shows how to write all the matches to the console output:

// Find all matching keywords  
StringSearchResult[] results=searchAlg.FindAll(textToSearch);

// Write all results  
foreach(StringSearchResult r in results)
{
  Console.WriteLine("Keyword='{0}', Index={1}", r.Keyword, r.Index);
}

Conclusion

This implementation of the Aho-Corasick search algorithm is very efficient if you want to find a large number of keywords in a text of any length, but if you want to search only for a few keywords, it is better to use a simple method like String.IndexOf. The code can be compiled in both .NET 1.1 and .NET 2.0 without any modifications. If you want to learn more about this algorithm, take a look at the link in the next section, it was very useful for me during the implementation of the algorithm and explains the theory behind this algorithm.

Links and references

Future work and history

  • 12/03/2005 - First version of this article published at CodeProject.

About Tomas Petricek


I'm student from Prague, the capital city of Czech republic. I live here and I'm student of Charles University of Prague (Faculty of Mathematics and Physics - computer science). I'm Microsoft MVP for Visual C# since July 2004 and I'm member of the Skilldrive.com group. My hobbies include photography, fractals and of course many things related to computers (except fixing them). My favorite book writers are Terry Pratchett and Philip K Dick and I like paintings by M. C. Escher.


My favorite codeproject icon is .

Click here to view Tomas Petricek's online profile.

 

Other popular C# Algorithms articles:

Swift2.2中的新变化

原文链接点击这里Swift2.2已经更新了,这次更新去除了一些难用的语法还添加了一些缺失的特性,并且还弃用了一些有争议的语言特性。这篇文章将详细介绍Swift2.2中的一些主要的变化和一些细微的改变,...
  • chaoyang805
  • chaoyang805
  • 2016年03月27日 00:27
  • 967

深入理解Aho-Corasick自动机算法

Aho-Corasick automaton(后面心均以AC代替),该算法在1975年产生于贝尔实验室,是著名的多模匹配算法之一。AC自动机算法分为3步:构造一棵Trie树,构造失效指针和模式匹配过程...
  • u013761665
  • u013761665
  • 2015年10月24日 13:04
  • 6264

字符串处理总结之一(C#String类)

C#(静态String类) C#中提供了比较全面的字符串处理方法,很多函数都进行了封装为我们的编程工作提供了很大的便利。System.String是最常用的字符串操作类,可以帮助开发者完成绝大部分的...
  • angelazy
  • angelazy
  • 2013年01月14日 17:03
  • 36638

[C#基础知识]泛型Dictionary<string,string>的用法详解

泛型最常见的用途是泛型集合,命名空间System.Collections.Generic 中包含了一些基于泛型的集合类,使用泛型集合类可以提供更高的类型安全性,还有更高的性能,避免了非泛型集合的重复的...
  • hany3000
  • hany3000
  • 2013年07月13日 07:09
  • 2483

详解C#引用类型String

转自:.NET,你忘记了么?(六)——再谈String  一. 文章伊始 在文章之前,说下写出这篇文章的目的。在我昨天的一篇文章>中,我在文中提到了关于String的字符串驻留机制。在文章的评论...
  • f10_s
  • f10_s
  • 2013年11月09日 21:19
  • 3536

C#扩展方法类库StringExtensions

using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.T...
  • fuyifang
  • fuyifang
  • 2014年10月21日 22:29
  • 2484

C#string编码总结

C#中将文件保存为utf-8无bom格式 http://www.csharpwin.com/csharpspace/11628r8120.shtml [java] view pla...
  • xiyanlgu
  • xiyanlgu
  • 2013年05月31日 17:24
  • 1876

C# string转换为几种不同编码的Byte[]的问题

C#中关于string转换为几种不同编码的Byte[]的问题
  • xiefei20098648
  • xiefei20098648
  • 2017年03月29日 20:40
  • 1651

C# 字符串加密

最近在学习一些加密的东西,现在对C#加密代码进行整理: 1、对称加密; /// /// 加密 /// /// /// /// public ...
  • roguemaster
  • roguemaster
  • 2016年10月27日 14:36
  • 945

字符串处理总结(C#String类)

转载地址:http://blog.csdn.net/angelazy/article/details/8501776 C#中提供了比较全面的字符串处理方法,很多函数都进行了封装为我们的编程工作提...
  • wangzhen209
  • wangzhen209
  • 2016年05月18日 10:44
  • 1480
内容举报
返回顶部
收藏助手
不良信息举报
您举报文章:Aho-Corasick string matching in C#
举报原因:
原因补充:

(最多只允许输入30个字)