有小改进的最大公共子串计算

最新推荐文章于 2022-01-31 22:20:54 发布

決心

最新推荐文章于 2022-01-31 22:20:54 发布

阅读量415

点赞数

分类专栏：数据挖掘

本文链接：https://blog.csdn.net/u010910436/article/details/52057938

版权

数据挖掘专栏收录该内容

12 篇文章 0 订阅

订阅专栏

实际应用，两个String的相似性判定，要去除标点符号，甚至停用词等，然后对于连续的数字要降低权重，比如同样有“2016”只能作为一个相似度。
具体代码，去停词那个以后再发，需要一个停词表+扫描的数据结构算法，达到近乎索引的效率。还有的是用分词，分词会有词性属性帮助去停词，但是思路和这个就不一样了。下面是字符串的转换，转换为String【】，而且因为去掉了标点，将连续数字合一，减少了计算量，速度比简单的最大公共子串计算要快1/3.

public static String[] convertDigit(String str)
        {
        //正则去除标点
        String str = str.replaceAll("[\\pP]", "");

        //连续数字合一
            ArrayList<String> find = new ArrayList<String>();
            StringBuffer sb = new StringBuffer();
            for (int i = 0; i < str.length(); i++)
            {
                if (Character.isDigit(str.charAt(i)) )
                {
                    sb.append(str.charAt(i));
                } else
                {
                    if (sb.length() > 0)
                    {
                        find.add(sb.toString());
                        sb.setLength(0);
                    }
                    find.add(str.charAt(i) + "");

                }
                if(i==str.length()-1&&sb.length()>0)
                    find.add(sb.toString());
            }

            return find.toArray(new String[0]);
        }

然后就是用上面方法转换得到的两个String【】，计算最大公共子串

// 最大公共子串,统计矩阵行列公共元素，动态规划找到连续1最长的对角线长度,连续数字权重为1
        public static int GetSimilarLengthImprove(String[] s, String[] t)
        {
            int same[][]; // matrix
            int score[][]; // matrix
            int n; // length of s
            int m; // length of t
            int i; // iterates through s
            int j; // iterates through t
            char s_i; // ith character of s
            char t_j; // jth character of t
            int cost; // cost
            n = s.length;
            m = t.length;
            if (n == 0)
            {
                return m;
            }
            if (m == 0)
            {
                return n;
            }
            same = new int[n][m];
            score = new int[n][m];


            for (i = 0; i < n; i++)
            {
                for (j = 0; j < m; j++)
                {
                    if (s[i].equals(t[j]))
                    {
                        same[i][j] = 1;
                    }
                }
            }


            for (i = 0; i < n; i++)
            {
                for (j = 0; j < m; j++)
                {
                    if (i == 0 || j == 0)
                    {
                        score[i][j] = same[i][j];
                    } else
                    {
                        int max = Math.max(score[i - 1][j - 1] + same[i][j], Math.max(score[i][j - 1], score[i - 1][j]));
                        score[i][j] = max;
                    }
                }
            }
            return score[n - 1][m - 1];
        }