利用python将文本进行分类,使用Python中的文本分析对企业进行分类

I'm a new-bee to AI and want to perform the below exercise. Can you please suggest the way to achieve it using python:

Scenario -

I have list of businesses of some companies as below like:

1. AI

2. Artificial Intelligence

3. VR

4. Virtual reality

5. Mobile application

6. Desktop softwares

and want to categorize them as below:

Technology ---> Category

1. AI ---> Category Artificial Intelligence

2. Artificial Intelligence ---> Category Artificial Intelligence

3. VR ---> Category Virtual Reality

4. Virtual reality ---> Category Virtual Reality

5. Mobile application ---> Category Application

6. Desktop softwares ---> Category Application

i.e when I receive a text like AI or Artificial Intelligence, then it must identify AI & Artificial Intelligence as one and the same and put both keywords under Artificial Intelligence category.

The current approach I follow is using the lookup a table but, I want to apply TEXT CLASSIFICATION on the technologies/business for the above input using python where I can segregate the technologies instead of using the lookup table.

Please suggest me any relevant approach.

解决方案

Here's one approach using sklearn. In past cases, I would use LabelBinarizer() but it won't work in a pipeline because it no-longer accepts X, y as inputs.

If you are a newbie, pipelines can be a bit confusing but essentially they just process the data in steps before passing to a classifier. Here, I am converting X into an ngram "matrix" (a table) of word and character tokens, and then passing that to a classifier.

import numpy as np

from sklearn.linear_model import LogisticRegression

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.pipeline import Pipeline, FeatureUnion

X = np.array([['AI'],

['Artificial Intelligence'],

['VR'],

['Virtual Reality'],

['Mobile application'],

['Desktop softwares']])

y = np.array(['Artificial Intelligence', 'Artificial Intelligence',

'Virtual Reality', 'Virtual Reality', 'Application', 'Application'])

pipeline = Pipeline(steps=[

('union', FeatureUnion([

('word_vec', CountVectorizer(binary=True, analyzer='word', ngram_range=(1,2))),

('char_vec', CountVectorizer(analyzer='char', ngram_range=(2,5)))

])),

('lreg', LogisticRegression())

])

pipeline.fit(X.ravel(), y)

print(pipeline.predict(['web application', 'web app', 'dog', 'super intelligence']))

Predicts:

['Application' 'Application' 'Virtual Reality' 'Artificial Intelligence']

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值