英文分词程序

一个英文分词程序,要将形如:Books in tuneBoxes are for Chinese-Children!断为:Book in tune Box are for Chinese child,也就是说要将复数转为单数,将连写的首字母大写的单词分开等等。复数转单数考虑的比较周全了应该,基本囊活了绝大多数情况。根据大写断词上原先考虑有些欠妥,现在保留了比如NEC这样的词。

    /**
     * 分词
     *
     * @param source
     *            待分的字符串
     * @return String[]
     */
    public String[] fenci(String source) {
        /* 分隔符的集合 */
        String delimiters = " \t\n\r\f~!@#$%^&*()_+|`1234567890-=\\{}[]:\";'<>?,./'";

        /* 根据分隔符分词 */
        StringTokenizer stringTokenizer = new StringTokenizer(source,
                delimiters);
        Vector vector = new Vector();

        /* 根据大写首字母分词 */
        flag0: while (stringTokenizer.hasMoreTokens()) {
            String token = stringTokenizer.nextToken();

            /* 全大写的词不处理 */
            boolean allUpperCase = true;
            for (int i = 0; i < token.length(); i++) {
                if (!Character.isUpperCase(token.charAt(i))) {
                    allUpperCase = false;
                }
            }
            if (allUpperCase) {
                vector.addElement(token);
                continue flag0;
            }

            /* 非全大写的词 */
            int index = 0;
            flag1: while (index < token.length()) {
                flag2: while (true) {
                    index++;
                    if ((index == token.length())
                            || !Character.isLowerCase(token.charAt(index))) {
                        break flag2;
                    }
                }
                vector.addElement(token.substring(0, index));
                token = token.substring(index);
                index = 0;
                continue flag1;
            }
        }

        /*
         * 复数转单数 参考以下文档:
         * http://ftp.haie.edu.cn/Resource/GZ/GZYY/DCYFWF/NJSYYY/421b0061ZW_0015.htm
         */
        for (int i = 0; i < vector.size(); i++) {
            String token = (String) vector.elementAt(i);
            if (token.equalsIgnoreCase("feet")) {
                token = "foot";
            } else if (token.equalsIgnoreCase("geese")) {
                token = "goose";
            } else if (token.equalsIgnoreCase("lice")) {
                token = "louse";
            } else if (token.equalsIgnoreCase("mice")) {
                token = "mouse";
            } else if (token.equalsIgnoreCase("teeth")) {
                token = "tooth";
            } else if (token.equalsIgnoreCase("oxen")) {
                token = "ox";
            } else if (token.equalsIgnoreCase("children")) {
                token = "child";
            } else if (token.endsWith("men")) {
                token = token.substring(0, token.length() - 3) + "man";
            } else if (token.endsWith("ies")) {
                token = token.substring(0, token.length() - 3) + "y";
            } else if (token.endsWith("ves")) {
                if (token.equalsIgnoreCase("knives")
                        || token.equalsIgnoreCase("wives")
                        || token.equalsIgnoreCase("lives")) {
                    token = token.substring(0, token.length() - 3) + "fe";
                } else {
                    token = token.substring(0, token.length() - 3) + "f";
                }
            } else if (token.endsWith("oes") || token.endsWith("ches")
                    || token.endsWith("shes") || token.endsWith("ses")
                    || token.endsWith("xes")) {
                token = token.substring(0, token.length() - 2);
            } else if (token.endsWith("s")) {
                token = token.substring(0, token.length() - 1);
            }

            /* 处理完毕 */
            vector.setElementAt(token, i);
        }

        /* 转为数组形式 */
        String[] array = new String[vector.size()];
        Enumeration enumeration = vector.elements();
        int index = 0;
        while (enumeration.hasMoreElements()) {
            array[index] = (String) enumeration.nextElement();
            index++;
        }

        /* 打印显示 */
        for (int i = 0; i < array.length; i++) {
            System.out.print(array[i] + " ");
        }

        /* 返回 */
        return array;
    }

 

本文来自CSDN博客,转载请标明出处:http://blog.csdn.net/petercheng456/archive/2005/07/04/412390.aspx

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值