目前Python是一门非常火爆的编程语言,应用非常广泛,各个场景都会涉及,职业选择余地大!
由此,起码在互联网行业里混,学习Python不是一个可选项,而是必选项了。打破以为看书、看视频的传统学习思路,很难达到预期效果,学得快,忘的快!
所以,更好的上手Python的方法,就是秉承“以终为始”,以案例驱动出最优学习路线!
案例1:Count Vector
(1)目的:
常见的特征数值计算类,最基础的特征抽取方法,是一个文本特征提取方法。
对于每一个训练文本,它只考虑每种词汇在该训练文本中出现的频率。
简单一句
目前Python是一门非常火爆的编程语言,应用非常广泛,各个场景都会涉及,职业选择余地大!
由此,起码在互联网行业里混,学习Python不是一个可选项,而是必选项了。打破以为看书、看视频的传统学习思路,很难达到预期效果,学得快,忘的快!
所以,更好的上手Python的方法,就是秉承“以终为始”,以案例驱动出最优学习路线!
(2)原理:
假设有一个语料库C,其中有D个文档:{d1, d2, ..., dD},C中一共有N个word。这N个word构成了原始输入的dictionary,我们据此可以生成一个矩阵M,其规模是D X N
假设语料库内容如下:
-
D1: He is a boy.
-
D2: She is a girl, good girl.
那么可以构建如下2 × 7维的矩阵
(3)实践:
✅时间1:人肉撸的方式,时间Count Vector
texts = ["He is a boy", "She is a girl, good girl"]
word_set = set()
for text in texts:
for word in text.strip().split(' '):
word_set.add(word.strip(','))
print(word_set)
word_id_dict = {}
for tu in enumerate(word_set):
word_id_dict[tu[1].strip()] = tu[0]
print(word_id_dict)
res_list = []
for text in texts:
t_list = []
for word in text.strip().split(' '):
word = word.strip(',')
if word in word_id_dict:
t_list.append(word_id_dict[word])
res_list.append(t_list)
print(res_list)
for res in res_list:
result = [0] * len(word_set)
for wordid in res:
result[wordid] += 1
print(result)
✅方式2:复用Sklearn现成API接口
sklearn.feature_extraction.text 中有4种文本特征提取方法:
- CountVectorizer(本次应用)
- TfidfVectorizer
- TfidfTransformer
- HashingVectorizer
(参考
import sys
from sklearn.feature_extraction.text import CountVectorizer
texts=["he is a boy","she is a girl, good girl"]
cv = CountVectorizer() #创建词袋数据结构
cv_fit = cv.fit_transform(texts)
# 上述代码等价于下面两行
# cv.fit(texts)
# cv_fit=cv.transform(texts)
print(cv.get_feature_names())
# ['boy', 'girl', 'good', 'he', 'is', 'she'] 列表形式呈现文章生成的词典
print(cv.vocabulary_)
# {'he': 3, 'is': 4, 'boy': 0, 'she': 5, 'girl': 1, 'good': 2}
# 字典形式,key:词,value:该词(特征)的索引,同时是tf矩阵的列号
#[https://blog.csdn.net/weixin_38278334/article/details/82320307](https://blog.csdn.net/weixin_38278334/article/details/82320307)
#[https://blog.csdn.net/weixin_38278334/article/details/82320307](https://blog.csdn.net/weixin_38278334/article/details/82320307)
print(cv_fit)
# (0, 3) 1 第0个列表元素,**词典中索引为3的元素**, 词频
# (0, 4) 1
# (0, 0) 1
# (1, 4) 1
# (1, 5) 1
# (1, 1) 2
# (1, 2) 1
print(cv_fit.toarray())
# toarray() 是将结果转化为稀疏矩阵矩阵的表示方式;
# [
# [1 0 0 1 1 0]
# [0 2 1 0 1 1]
# ]
print(cv_fit.toarray().sum(axis=0)) #每个词在所有文档中的词频
# [1 2 1 1 2 1]
)官方文档:sklearn.feature_extraction.text.CountVectorizer
(4)总结
Count Vector方法只考虑了TF,很片面!由此引出TF*IDF方法
(2)小案例:计算两个字符串的相似度
(4)实践
✅方式1:人肉撸
import pandas as pd
import math
docA = "The cat sat on my face"
docB = "The dog sat on my bed"
bowA = docA.split(" ")
bowB = docB.split(" ")
wordSet = set(bowA).union(set(bowB))
wordDictA = dict.fromkeys(wordSet, 0)
wordDictB = dict.fromkeys(wordSet, 0)
for word in bowA:
wordDictA[word] += 1
for word in bowB:
wordDictB[word] += 1
#print(pd.DataFrame([wordDictA, wordDictB]))
def computeTF(wordDict, bow):
tfDict = {}
bowCount = len(bow)
for word, count in wordDict.items():
tfDict[word] = count / float(bowCount)
return tfDict
tfBowA = computeTF(wordDictA, bowA)
tfBowB = computeTF(wordDictB, bowB)
#print(tfBowA)
def computeIDF(docList):
idfDict = {}
N = len(docList)
idfDict = dict.fromkeys(docList[0].keys(), 0)
for doc in docList:
for word, val in doc.items():
if val > 0:
idfDict[word] += 1
for word, val in idfDict.items():
idfDict[word] = math.log10(N + 10 / (float(val) + 1))
print("N: ", N)
print(idfDict)
return idfDict
idfs = computeIDF([wordDictA, wordDictB])
def computeTFIDF(tfBow, idfs):
tfidf = {}
for word, val in tfBow.items():
#print ("word: ", word)
#print ("val: ", val)
#print ("idfs[word]: ", idfs[word])
#print("=====")
tfidf[word] = idfs[word] * val
return tfidf
tfidfBowA = computeTFIDF(tfBowA, idfs)
tfidfBowB = computeTFIDF(tfBowB, idfs)
print(pd.DataFrame([tfidfBowA, tfidfBowB]))
✅方式2:用轮子(Jieba Tool)
# https://github.com/fxsjy/jieba
% git clone https://github.com/fxsjy/jieba.git