人工智能第02次课 Python基础&文本处理

最新推荐文章于 2024-09-08 20:37:05 发布

心碎为了谁

最新推荐文章于 2024-09-08 20:37:05 发布

阅读量49

点赞数

文章标签：人工智能 python 深度学习

本文链接：https://blog.csdn.net/zhou251120471/article/details/132231736

版权

文章讲述了Python编程语言的热门程度及其在互联网行业的必要性，强调了传统的看书、看视频学习方式效率低下。推荐采用“以终为始”的方法，通过实际案例学习CountVector（文本特征提取）和TF-IDF（计算相似度）技术。

摘要由CSDN通过智能技术生成

目前Python是一门非常火爆的编程语言，应用非常广泛，各个场景都会涉及，职业选择余地大！

由此，起码在互联网行业里混，学习Python不是一个可选项，而是必选项了。打破以为看书、看视频的传统学习思路，很难达到预期效果，学得快，忘的快！

所以，更好的上手Python的方法，就是秉承“以终为始”，以案例驱动出最优学习路线！

案例1：Count Vector

（1）目的：

常见的特征数值计算类，最基础的特征抽取方法，是一个文本特征提取方法。

对于每一个训练文本，它只考虑每种词汇在该训练文本中出现的频率。

简单一句

目前Python是一门非常火爆的编程语言，应用非常广泛，各个场景都会涉及，职业选择余地大！

所以，更好的上手Python的方法，就是秉承“以终为始”，以案例驱动出最优学习路线！

（2）原理：

假设有一个语料库C，其中有D个文档：{d1, d2, ..., dD}，C中一共有N个word。这N个word构成了原始输入的dictionary，我们据此可以生成一个矩阵M，其规模是D X N

假设语料库内容如下：

D1: He is a boy.
D2: She is a girl, good girl.

那么可以构建如下2 × 7维的矩阵

（3）实践：

✅时间1：人肉撸的方式，时间Count Vector

texts = ["He is a boy", "She is a girl, good girl"]

word_set = set()

for text in texts:
    for word in text.strip().split(' '):
        word_set.add(word.strip(','))

print(word_set)

word_id_dict = {}
for tu in enumerate(word_set):
    word_id_dict[tu[1].strip()] = tu[0]

print(word_id_dict)


res_list = []

for text in texts:
    t_list = []
    for word in text.strip().split(' '):
        word = word.strip(',')
        if word in word_id_dict:
            t_list.append(word_id_dict[word])

    res_list.append(t_list)

print(res_list)

for res in res_list:
    result = [0] * len(word_set)
    for wordid in res:
        result[wordid] += 1

    print(result)

✅方式2：复用Sklearn现成API接口

sklearn.feature_extraction.text 中有4种文本特征提取方法：

CountVectorizer（本次应用）
TfidfVectorizer
TfidfTransformer
HashingVectorizer

（参考

import sys
from sklearn.feature_extraction.text import CountVectorizer

texts=["he is a boy","she is a girl, good girl"]

cv = CountVectorizer() #创建词袋数据结构

cv_fit = cv.fit_transform(texts)
# 上述代码等价于下面两行
# cv.fit(texts)
# cv_fit=cv.transform(texts)

print(cv.get_feature_names())
# ['boy', 'girl', 'good', 'he', 'is', 'she'] 列表形式呈现文章生成的词典

print(cv.vocabulary_)
# {'he': 3, 'is': 4, 'boy': 0, 'she': 5, 'girl': 1, 'good': 2}
# 字典形式，key：词，value:该词（特征）的索引，同时是tf矩阵的列号

#[https://blog.csdn.net/weixin_38278334/article/details/82320307](https://blog.csdn.net/weixin_38278334/article/details/82320307)
#[https://blog.csdn.net/weixin_38278334/article/details/82320307](https://blog.csdn.net/weixin_38278334/article/details/82320307)

print(cv_fit)

# (0, 3) 1 第0个列表元素，**词典中索引为3的元素**， 词频
# (0, 4) 1
# (0, 0) 1
# (1, 4) 1
# (1, 5) 1
# (1, 1) 2
# (1, 2) 1

print(cv_fit.toarray())
# toarray() 是将结果转化为稀疏矩阵矩阵的表示方式；
# [
#    [1 0 0 1 1 0]
#    [0 2 1 0 1 1]
# ]

print(cv_fit.toarray().sum(axis=0))  #每个词在所有文档中的词频
# [1 2 1 1 2 1]

）官方文档：sklearn.feature_extraction.text.CountVectorizer

（4）总结

Count Vector方法只考虑了TF，很片面！由此引出TF*IDF方法

（2）小案例：计算两个字符串的相似度

（4）实践

✅方式1：人肉撸

import pandas as pd
import math

docA = "The cat sat on my face"
docB = "The dog sat on my bed"

bowA = docA.split(" ")
bowB = docB.split(" ")

wordSet = set(bowA).union(set(bowB))

wordDictA = dict.fromkeys(wordSet, 0)
wordDictB = dict.fromkeys(wordSet, 0)

for word in bowA:
    wordDictA[word] += 1

for word in bowB:
    wordDictB[word] += 1

#print(pd.DataFrame([wordDictA, wordDictB]))

def computeTF(wordDict, bow):
    tfDict = {}
    bowCount = len(bow)

    for word, count in wordDict.items():
        tfDict[word] = count / float(bowCount)

    return tfDict


tfBowA = computeTF(wordDictA, bowA)
tfBowB = computeTF(wordDictB, bowB)

#print(tfBowA)

def computeIDF(docList):

    idfDict = {}

    N = len(docList)

    idfDict = dict.fromkeys(docList[0].keys(), 0)

    for doc in docList:
        for word, val in doc.items():
            if val > 0:
                idfDict[word] += 1

    for word, val in idfDict.items():
        idfDict[word] = math.log10(N + 10 / (float(val) + 1))

    print("N: ", N)
    print(idfDict)
    return idfDict


idfs = computeIDF([wordDictA, wordDictB])


def computeTFIDF(tfBow, idfs):
    tfidf = {}
    for word, val in tfBow.items():
        #print ("word: ", word)
        #print ("val: ", val)
        #print ("idfs[word]: ", idfs[word])
        #print("=====")
        tfidf[word] = idfs[word] * val

    return tfidf

tfidfBowA = computeTFIDF(tfBowA, idfs)
tfidfBowB = computeTFIDF(tfBowB, idfs)

print(pd.DataFrame([tfidfBowA, tfidfBowB]))

✅方式2：用轮子（Jieba Tool）

 # https://github.com/fxsjy/jieba
 
% git clone https://github.com/fxsjy/jieba.git