字符串相似度算法及应用

最新推荐文章于 2023-08-17 11:06:14 发布

mingspy

最新推荐文章于 2023-08-17 11:06:14 发布

阅读量5.8k

点赞数

分类专栏： C/C++ 文章标签：算法 distance matrix delete c# c

本文链接：https://blog.csdn.net/mingspy/article/details/7615855

版权

C/C++ 专栏收录该内容

17 篇文章 0 订阅

订阅专栏

Levenshtein Distance 简介

字符串相似度的算法还是比较经典的DP算法，看到有两篇文章写的比较不错，他们的介绍也非常详细，值得学习。文章地址如下:

文章1 http://blog.csdn.net/orbit/article/details/6649322 (C++) ，

文章2 http://www.cnblogs.com/ivanyb/archive/2011/11/25/2263356.html (C#)。

文章1中实现Levenshtein Distance的算法已经比较简洁了，不过他是在Stack上申请的距离数组，如果处理特别长字符串，会耗尽Stack，所以改为在Heap上申请内存。

//Implement Levenshtein Distance
// author mingspy 5/30 2012
#include "stdafx.h"
#include <iostream>
#include <cmath>

#define Element(arr, col, i, j) arr[(i)*(col)+j]


template<typename T>
inline const T& min(const T& i, const T& j, const T& k)
{
    return std::min(std::min(i, j), k);
}

int LevenshteinDistance(const char * str1, int len1, const char * str2, int len2)
{
    if(!str1 || !str2)
    {
        _ASSERT(false);
        return INT_MAX;
    }

    // new a matrix with [len1, len2] to save result.
    int * distance = new int[(len1+1)*(len2+1)];
    memset(distance, 0, (len1+1)*(len2+1) * sizeof(int));

    // Initialize the matrix first row.
    for(int i = 0; i < len1+1; i++)
    {
        distance[i] = i;
    }

    // Initialize the matrix first column.
    for(int i= 0; i< len2+1; i++)
    {
        distance[(len1+1)*i] = i;
    }

    // Using dynamic programming to calculate 
    // the distance between str1 and str2.
    int dist = 0;
    for(int i = 1; i < len2 + 1; i++)
    {
        for(int j = 1; j < len1 + 1; j++)
        {
            if(str1[j - 1] == str2[i - 1])
            {
                dist = 0;
            }
            else
            {
                dist = 1;
            }

            Element(distance, len1 + 1, i, j) = min(
                Element(distance, len1 + 1, i - 1, j -1) + dist, 
                Element(distance, len1 + 1, i - 1, j) + 1, 
                Element(distance, len1 + 1, i, j-1) + 1);
        }
    }

#ifdef _DEBUG
    // print distance
    for(int i = 0; i < len2 + 1; i++)
    {
        for(int j = 0; j < len1 + 1; j++)
        {
           std::cout<<Element(distance, len1 + 1, i, j)<<" ";
        }
        std::cout<<std::endl;
    }
#endif

    dist = Element(distance,len1 + 1, len2, len1);
    delete [] distance;
    return dist;
}

算法中占用(M+1)*(N+1)的额外内存，让人感觉很是浪费，如果能减少内存的使用那么就相当完美了，在CodeProject找到一篇文章，只用了M+N个额外空间:

http://www.codeproject.com/Articles/8265/More-than-strcmp-similarity-in-strings

应用

最后说一下我是在什么地方使用这个算法的。比如有一幅固定大小的图片，在图片上固定位置有比较规则排列的字符，写一个算法提取出这些字符串。

这是一个简单的字符串识别的小程序，不涉及到字符串的形变和噪声(很少噪声)。我的做法就相当简单:

1.定位到图片字符串的位置。

2.剪切字符串所在位置的图片。

3.二值化图片。

4.切割出每个字符。

5.用每个字符的特征串与字符库的特征串进行比较，得出字符串所代表的字符。

第5步的比较算法就用到了字符串相识度算法。由于二值化后的图片还可能存在噪声，我又不想使用去噪的算法，所以对切割后的图片，提取特征码之后，采用相似度匹配字符串库，这样就得到了比较理想的结果。

mingspy

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
字符串相似度算法及应用

Levenshtein Distance 简介字符串相似度的算法还是比较经典的DP算法，看到有两篇文章写的比较不错，他们的介绍也非常详细，值得学习。文章地址如下:文章1 http://blog.csdn.net/orbit/article/details/6649322 (C++) ，文章2 http://www.cnblogs.com/ivanyb/archive/2011/11/2
复制链接

扫一扫