多语种检测

最新推荐文章于 2020-12-10 14:43:56 发布

Ai_践行者

最新推荐文章于 2020-12-10 14:43:56 发布

阅读量1.3k

点赞数

分类专栏：人工智能文章标签：语种检测

本文链接：https://blog.csdn.net/qq_41424519/article/details/81740448

版权

这个博客介绍了如何使用Python实现一个多语种检测的模型，主要依赖于`sklearn`库中的`CountVectorizer`和`MultinomialNB`。首先，定义了一个`LanguageDetector`类，通过预处理文本（移除噪声数据如URL、@提及和#话题），然后使用字符n-gram作为特征。模型训练和评估的数据来自'data.csv'文件，通过`train_test_split`将数据分为训练集和测试集。最后，展示了模型对给定文本的预测及整体得分。

摘要由CSDN通过智能技术生成

# -*- coding: utf-8 -*-
"""
Created on Sun Mar 25 22:02:48 2018

@author: Administrator
"""

import re

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

class LanguageDetector():
# MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
def __init__(self, classifier=MultinomialNB()):
self.classifier = classifier
self.vectorizer = CountVectorizer(
lowercase=True, # lowercase the text
analyzer='char_wb', # tokenise by character ngrams
ngram_range=(1,2), # use ngrams of size 1 and 2
max_features=1000, # keep the most common 1000 ngrams
preprocessor=self._remove_noise
)
#模型要有好效果&#