Count Vectorizer以及N-Gram模型进行文本特征的向量化表示

学习目标:

  1. 学习并理解特征缩放的具体方法:区间缩放+标准化法
  2. 理解并使用Count Vectorizer以及N-Gram模型进行文本特征的向量化表示
  3. 掌握线性回归和二元多项式回归的基本思想
  4. 掌握信息熵与互信息的计算方法
  5. 基于上述的基础知识,分别使用Python进行编程练习。

学习内容:

1:基于Count Vectorizer特征提取方法与N-Gram模型,使用SKlearn对文档进行向量化表示
2:基于两种特征缩放方法(标准化、区间缩放),使用SKlearn进行量纲的特征缩放。要求:采用自定义的简单二维数组和IRIS数据集进行实验。
3:使用SKlearn进行线性回归和二次回归曲线的绘制,以及拟合效果的对比(披萨尺寸与价格的回归分析)
4:信息熵和互信息的计算(数据可自定义)


实验步骤及编码:

1from sklearn.feature_extraction.text import CountVectorizer
corpus = [
'Jobs was the chairman of Apple Inc., and he was very famous',
'I like to use apple computer',
'And I also like to eat apple'
]

vectorizer =CountVectorizer()
print(vectorizer.fit_transform(corpus).todense())
print(vectorizer.vocabulary_)
print(" ")

import nltk
nltk.download('stopwords')
stopwords = nltk.corpus.stopwords.words('english')
print (stopwords)
print(" ")
vectorizer =CountVectorizer(stop_words='english')
print("after stopwords removal:  ", vectorizer.fit_transform(corpus).todense())
print("after stopwords removal:  ", vectorizer.vocabulary_)
print(" ")
#采用ngram模式进行文档向量化
vectorizer =CountVectorizer(ngram_range=(1,2))
print("N-gram mode:     ",vectorizer.fit_transform(corpus).todense())
print(" ")
print("N-gram mode:         ",vectorizer.vocabulary_)

结果

在这里插入图片描述

from sklearn import preprocessing
import numpy as np

X = np.array([[0, 0],
        [0, 0],
        [100, 1],
        [1, 1]])
X_mean = X.mean(axis=0)
X_std = X.std(axis=0)  
X1 = (X-X_mean)/X_std
print (X1)
print ("")
X_scale = preprocessing.scale(X)
print (X_scale)

在这里插入图片描述

from sklearn import datasets, preprocessing

iris = datasets.load_iris()
X_scale = preprocessing.scale(iris.data)
print (X_scale)
from sklearn.preprocessing import MinMaxScaler

data = [[0, 0],
        [0, 0],
        [100, 1],
        [1, 1]]

scaler = MinMaxScaler()
print(scaler.fit(data))
print(scaler.transform(data))

from sklearn.preprocessing import MinMaxScaler
data = iris.data
scaler = MinMaxScaler()
print(scaler.fit(data))
print(scaler.transform(data))

在这里插入图片描述
在这里插入图片描述

3print(__doc__)

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from matplotlib.font_manager import FontProperties

font_set = FontProperties(fname=r"c:\windows\fonts\simsun.ttc", size=20)


def runplt():
    plt.figure()
    plt.title(u'披萨的价格和直径', fontproperties=font_set)
    plt.xlabel(u'直径(inch)', fontproperties=font_set)
    plt.ylabel(u'价格(美元)', fontproperties=font_set)
    plt.axis([0, 25, 0, 25])
    plt.grid(True)
    return plt


X_train = [[6], [8], [10], [14], [18]]
y_train = [[7], [9], [13], [17.5], [18]]
X_test = [[7], [9], [11], [15]]
y_test = [[8], [12], [15], [18]]


plt1 = runplt()
plt1.scatter(X_train, y_train, s=40)

xx = np.linspace(0, 26, 5)
regressor = LinearRegression()
regressor.fit(X_train, y_train)
yy = regressor.predict(xx.reshape(xx.shape[0], 1))

plt.plot(xx, yy, label="linear equation")

quadratic_featurizer = PolynomialFeatures(degree=2)
X_train_quadratic = quadratic_featurizer.fit_transform(X_train)

regressor_quadratic = LinearRegression()
regressor_quadratic.fit(X_train_quadratic, y_train)

xx = np.linspace(0, 26, 100)
print(xx.shape)  # (100,)
print(xx.shape[0])  # 100
xx_quadratic = quadratic_featurizer.transform(xx.reshape(xx.shape[0], 1))
print(xx.reshape(xx.shape[0], 1).shape)  # (100,1)

plt.plot(xx, regressor_quadratic.predict(xx_quadratic), 'r-', label="quadratic equation")
plt.legend(loc='upper left')
plt.show()

X_test_quadratic = quadratic_featurizer.transform(X_test)
print('linear equation  r-squared', regressor.score(X_test, y_test))
print('quadratic equation r-squared', regressor_quadratic.score(X_test_quadratic, y_test))

在这里插入图片描述
在这里插入图片描述

4
import  numpy as np
from math import log
print("例 1")
def calc_ent(x):
    ent = 0.0
    for p in x:
        end = p* np.log(p)
    return ent
x1=np.array([0.4,0.2,0.2,0.2])
x2=np.array([1])
x3=np.array([0.2,0.2,0.2,0.2])
print("x1的信息熵:",calc_ent(x1))
print("x2的信息熵:",calc_ent(x2))
print("x3的信息熵:",calc_ent(x3))
print("")

print("例2")
def calcShannoEnt(dataSet):
    length, dataDict=float(len(dataSet)),{}
    for data in dataSet:
        try:dataDict[data]+=1
        except:dataDict[data] =1
        return sum([-d/length*log(d/length) for  d in  list(dataDict.values())])
print("x1的信息熵:",calcShannoEnt(['A','B','C','D','A']))
print("x2的信息熵:",calcShannoEnt(['A','A','A','A','A']))
print("x3的信息熵:",calcShannoEnt(['A','B','C','D','E']))

print("")
print("例3")
Ent_x4=calcShannoEnt(['3','4','5','5','3','2','2','6','6','1'])
Ent_x5=calcShannoEnt(['7','2','1','3','2','8','9','1','2','0'])
Ent_x4x5=calcShannoEnt(['37','42','51','53','32','28','29','61','62','10'])
MI_x4_x5=Ent_x4+Ent_x5+Ent_x4x5
print("x4和x5之间的互信息",MI_x4_x5)

在这里插入图片描述


  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值