利用Stanford Parser进行中文行为抽取

最新推荐文章于 2022-03-31 16:49:34 发布

vincent2610

最新推荐文章于 2022-03-31 16:49:34 发布

阅读量1k

点赞数

分类专栏：自然语言处理

自然语言处理专栏收录该内容

13 篇文章 0 订阅

订阅专栏

转自twenz

利用Stanford Parser进行中文行为抽取(Action mining)

问题

所谓的行为抽取就是从开源文本中获取关于给定的某个人/组织的行为，主要包括主语、谓语和宾语。其中主语是给定的一些词表示了需要抽取的信息对象（人、组织或团体）。谓语和宾语则表示了行为。

例如，我们要抽取关于“塔利班”的行为，则给定句子“塔利班制造了这起爆炸。”的抽取结果为“塔利班：制造爆炸”。如果塔利班还有其他的别称（比如基地组织）或者我们有关于塔利班里面重要成员的可以代表塔利班行为的人等，则应把它们作为主语的行为也一并抽取出。

方法

这种关于行为抽取的显然是在句子层面上的工作，用统计机器学习方法可能效果不会很好（个人感觉）。

1.选择数据（数据源，如新闻等）

2.划分句子

3.筛选相关句子（找出含有识别对象的句子，直接匹配）

4.分词（把那些目标对象的词语加入到词典中，采用ICTCLAS）

5.语法分析（Stanford Parser)

6.抽取行为 (查找规律，利用规则匹配等方法，这里面应该有很多方法优化。我这里只是给出了一个简单的匹配搜索）

代码

public class Dependency {
    LexicalizedParser lp = new LexicalizedParser("xinhuaFactored.ser.gz");
    Tree parser;
    TreebankLanguagePack tlp = new ChineseTreebankLanguagePack();
    GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
    GrammaticalStructure gs;
    
    public Dependency(){}
    public void parserToTree(String sent[]){
        parser = lp.apply(Arrays.asList(sent));
    }
    public void getGrammaticalStructure(){
        gs = gsf.newGrammaticalStructure(parser);
    }
    public void outputRalation(){
        Collection tdl = gs.typedDependenciesCollapsedTree();
        System.out.println(tdl);
    }
    private boolean relationSubj(String rel){
        if(rel.contains("subj"))return true;
        if(rel.contains("advmod"))return true;
        if(rel.contains("nn"))return true;  
        return false;
    }
    private boolean relationObj(String rel){
        if(rel.contains("obj"))return true;
        if(rel.contains("comp"))return true;
        return false;
    }
    public void extractAction(String agent)
    {
        Collection tdl = gs.typedDependenciesCollapsedTree();
        int act = -1;
        for(int i = 0;i < tdl.size();i ++)
        {
            //TypedDependency(GrammaticalRelation reln, TreeGraphNode gov, TreeGraphNode dep) 
            TypedDependency td = (TypedDependency)tdl.toArray()[i];
            String age = td.dep().toString();
            if(age.contains(agent) && relationSubj(td.reln().toString())){
                act = td.gov().index();
                System.out.print(agent+":");
                String verb = td.gov().value();
                System.out.print(verb+":");
                break;
            }
            //System.out.println(td.gov().toString()+td.reln().toString()+td.dep().toString());
        }
        if(act == -1)return;
        for(int i = 0;i < tdl.size();i ++)
        {
            //TypedDependency(GrammaticalRelation reln, TreeGraphNode gov, TreeGraphNode dep) 
            TypedDependency td = (TypedDependency)tdl.toArray()[i];
            if(td.gov().index() == act)
            {
                if(relationObj(td.reln().toString()))
                    System.out.print(td.dep().value());
            }
        }
        System.out.println();
    }
    public void extractAction(String sentence,String agent){
        String sent[] = sentence.split(" ");
        System.out.println(sent.length);
        if(sent.length > 55)return;
        this.parserToTree(sent);
        this.getGrammaticalStructure();
        this.extractAction(agent);
    }
    public static void main(String args[]){
        Dependency dep = new Dependency();
        //String sentence = "塔利班 制造 了 这 起 爆炸 。";
        String sentence = "昨日 ， 圣元 又 遭 网友 质疑 ， 称 其 利用 专家 之 名 ， 制造 奶粉 安全 的 舆论 。";
        String sent[] = sentence.split(" ");
        dep.parserToTree(sent);
        dep.getGrammaticalStructure();
        dep.outputRalation();
        dep.extractAction("网友");
    }
}

这里代码只是第5步和第6步的，经测试：能够抽取出一些简单的规范的行为。如果要提高准确率或召回率还需要不断地对规则进行修改。

分析

1.Stanford parser句法树分析时候占用内存可能较大，所以要调整eclipse虚拟内存空间，方法是在“运行——运行——自变量——VM自变量中填上-Xms256M -Xmx800M”，大小就要看实际情况和机子性能。另外还经常出现异常，还未找到有效办法解决。如句子较长出现“FactoredParser: exceeded MAX_ITEMS work limit [200000 items]; aborting.”解决办法参考：http://blog.amelielee.com/archives/140。在上面代码的构造函数中加上 Test.MAX_ITEMS = 3000000;

2.依赖关系树错误率很高，影响实际结果。据说哈工大的句法分析效果还不错，可以尝试。

3.因为句法分析的效果不佳，所以抽取的行为的依赖关系也不仅仅是 subj 和obj的主谓和渭滨关系，这就要进一步探索。另外还有很多的隐藏的行为、被动行为等都需要去改进。如“塔利班遭受政府军的严厉打击。”；“发动了这次袭击的塔利班武装分子撤出了阿富汗”隐含有行为“发动袭击”。