java-语种类型判断
一、使用language-detector进行语种判断
源码地址:https://github.com/optimaize/language-detector
二、语言支持
71 个内置语言配置文件
三、简介
该软件使用基于每种语言的通用文本创建的语言配置文件。然后从该文本中提取 N-gram https://baike.baidu.com/item/n%E5%85%83%E8%AF%AD%E6%B3%95/19133139,这就是存储在配置文件中的内容。
当要分析的输入文本很短或不干净时,识别会不准确。
当文本以多种语言编写时,该软件的默认算法不合适。您可以尝试拆分文本(按句子或段落)并检测各个部分。在最好的情况下,对整个文本运行语言检测只会告诉您最主要的语言。
当输入文本不是预期的(和支持的)语言时,该软件无法很好地处理它。例如,如果您只从英语和汉语加载语言配置文件,但文本是用日语编写的,则程序会从已选择的配置文件中进行识别。
四、开始
maven依赖
<dependency>
<groupId>com.optimaize.languagedetector</groupId>
<artifactId>language-detector</artifactId>
<version>0.6</version>
</dependency>
使用默认语言
//加载所有语言:
List<LanguageProfile> languageProfiles = new LanguageProfileReader().readAllBuiltIn();
//build language detector:
LanguageDetector languageDetector = LanguageDetectorBuilder.create(NgramExtractors.standard())
.withProfiles(languageProfiles)
.build();
//create a text object factory
TextObjectFactory textObjectFactory = CommonTextObjectFactories.forDetectingOnLargeText();
//query:
TextObject textObject = textObjectFactory.forText("my text");
//获取识别语言列表
List<DetectedLanguage> probabilities = languageDetector.getProbabilities(textObject);
创建自定义语言配置文件
//create text object factory:
TextObjectFactory textObjectFactory = CommonTextObjectFactories.forIndexingCleanText();
//load your training text:
TextObject inputText = textObjectFactory.create()
.append("this is my")
.append("training text")
//create the profile:
LanguageProfile languageProfile = new LanguageProfileBuilder("en")
.ngramExtractor(NgramExtractors.standard())
.minimalFrequency(5) //adjust please
.addText(inputText)
.build();
//store it to disk if you like:
new LanguageProfileWriter().writeToDirectory(languageProfile, "c:/foo/bar");
添加自定义语言
//加载所有语言:
List<LanguageProfile> languageProfiles = new LanguageProfileReader().readAllBuiltIn();
//获取生成配置文件路径
File yourConf = new File("文件路径");
LanguageProfile languageProfile = new LanguageProfileReader().read(yourConf);
//添加
languageProfiles.add(languageProfile);
LanguageDetector languageDetector = LanguageDetectorBuilder.create(NgramExtractors.standard())
.withProfiles(languageProfiles)
.build();
//create a text object factory
TextObjectFactory textObjectFactory = CommonTextObjectFactories.forDetectingOnLargeText();
//query:
TextObject textObject = textObjectFactory.forText("my text");
//获取识别语言列表
List<DetectedLanguage> probabilities = languageDetector.getProbabilities(textObject);