这里有一种使用sklearn的方法。在过去的例子中,我会使用LabelBinarizer(),但它不能在管道中工作,因为它不再接受X,y作为输入。在
如果您是新手,管道可能会有点混乱,但实际上它们只是在传递给分类器之前按步骤处理数据。在这里,我将X转换成单词和字符标记的ngram“矩阵”(一个表),然后将其传递给分类器。在import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline, FeatureUnion
X = np.array([['AI'],
['Artificial Intelligence'],
['VR'],
['Virtual Reality'],
['Mobile application'],
['Desktop softwares']])
y = np.array(['Artificial Intelligence', 'Artificial Intelligence',
'Virtual Reality', 'Virtual Reality', 'Application', 'Application'])
pipeline = Pipeline(steps=[
('union', FeatureUnion([
('word_vec', CountVectorizer(binary=True, analyzer='word', ngram_range=(1,2))),
('char_vec', CountVectorizer(analyzer='char', ngram_range=(2,5)))
])),
('lreg', LogisticRegression())
])
pipeline.fit(X.ravel(), y)
print(pipeline.predict(['web application', 'web app', 'dog', 'super intelligence']))
预测:
^{pr2}$