文本相似度匹配-task3

最新推荐文章于 2023-02-01 11:12:50 发布

lauqasim

最新推荐文章于 2023-02-01 11:12:50 发布

阅读量181

点赞数

分类专栏：深度学习 python 文章标签：深度学习人工智能

本文链接：https://blog.csdn.net/try763738799/article/details/128827687

版权

python 同时被 2 个专栏收录

8 篇文章 0 订阅

订阅专栏

深度学习

6 篇文章 0 订阅

订阅专栏

该文介绍了如何使用Python对文本进行统计特征提取，包括文本长度、分词后的单词个数、单词差异、最长公共字符串长度以及TF-IDF编码的相似度计算。这些特征在文本相似度任务中具有重要性。

摘要由CSDN通过智能技术生成

任务3：文本相似度（统计特征）
- 步骤1：对query1和query2计算文本统计特征
  - query1和query2文本长度
  - query1和query2文本单词个数
  - query1和query2文本单词差异
  - query1和query2文本最长公用字符串长度
  - query1和query2文本的TFIDF编码相似度
- 步骤2：根据相似度标签，上述哪一个特征最有区分性？

文本统计特征是指对文本进行统计并得到的一些数值，可以用来描述文本的特征。基础的文本特征包括：

    文本长度: 文本中的字符数或单词数
    字符频率: 每个字符在文本中出现的次数或频率
    单词频率: 每个单词在文本中出现的次数或频率
    句子长度: 文本中句子的平均长度
    句子数量: 文本中句子的数量

上述文本特征都是无监督的，不局限语言和模型，而且计算快速，在任务3中我们需要大家使用python统计相似的文本和不相似文本的基础统计特征。

1.在数据集后添加query1和2的句子长度

def text_len(data):
    data['q1_length'] = data['query1'].apply(lambda x:len(x))
    data['q2_length'] = data['query2'].apply(lambda x:len(x))

text_len(train)
print(train[:2])
text_len(valid)
print(valid[:2])
text_len(test)
print(test[:2])

query1 query2 label q1_length q2_length
0 喜欢打篮球的男生喜欢什么样的女生爱打篮球的男生喜欢什么样的女生 1 16 15
1 我手机丢了，我想换个手机我想买个新手机，求推荐 1 12 11
query1 query2 label q1_length q2_length
0 开初婚未育证明怎么弄？初婚未育情况证明怎么开？ 1 11 12
1 谁知道她是网络美女吗？爱情这杯酒谁喝都会醉是什么歌 0 11 14
query1 query2 label q1_length q2_length
0 谁有狂三这张高清的这张高清图，谁有 0 9 8
1 英雄联盟什么英雄最好英雄联盟最好英雄是什么 1 10 11

2.在数据集后添加query1和2的分词之后的句子长度

def text_jieba_count(data):
    data['q1_count'] = data['query1'].apply(lambda x:len(jieba.lcut(x)))
    data['q2_count'] = data['query2'].apply(lambda x:len(jieba.lcut(x)))

3.统计文本单词差异

def text_compare(data):
    data['q1_words'] = data['query1'].apply(lambda x:jieba.lcut(x))
    data['q2_words'] = data['query2'].apply(lambda x:jieba.lcut(x))
    # data['common_words', 'q1_other', 'q2_other'] = ' '
    data['common_words'] = ' '
    data['q1_other'] = ' '
    data['q2_other'] = ' '
    for i in range(len(data['query1'])):
        ls1 = data['q1_words'][i]
        ls2 = data['q2_words'][i]
        common = set(ls1).intersection(set(ls2))
        new_ls1 = ' '.join([w for w in ls1 if w not in common])
        new_ls2 = ' '.join([w for w in ls2 if w not in common])
        data['common_words'][i] = list(common)
        data['q1_other'][i] = list(new_ls1)
        data['q2_other'][i] = list(new_ls2)
    data['common_words_len'] = data['common_words'].apply(lambda x:len(x))
    data['q1_other_len'] = data['q1_other'].apply(lambda x:len(x))
    data['q2_other_len'] = data['q2_other'].apply(lambda x:len(x))

4.文本最长公用字符串长度

def getLongestSameStr(str1, str2):
    # 判断两个字符串长短，取短的那个进行操作
    if len(str1) > len(str2):
        str1, str2 = str2, str1

    # 用列表来接收最终的结果，以免出现同时有多个相同长度子串被漏查的情况
    resList = []

    # 从str1全长开始进行检测，逐渐检测到只有1位
    for i in range(len(str1), 0, -1):
        # 全长情况下不对切片进行遍历
        if i == len(str1):
            if str1 in str2:
                resList.append(str1)
        # 非全长情况下，对str1进行切片由0到当前检测长度，迭代到str1的最后
        else:
            j = 0
            while i < len(str1):
                testStr = str1[j:i]
                if testStr in str2:
                    resList.append(testStr)
                i += 1
                j += 1
        # 判断当前长度下，是否存在子串
        if len(resList) > 0:
            return resList
    return resList


def count_common(data):
    text1 = data['query1'].tolist()
    text2 = data['query2'].tolist()
    print(text1)
    print(text2)
    data['common_words_max'] = ''
    for i in range(len(text1)):
        s1 = text1[i]
        s2 = text2[i]
        coList = getLongestSameStr(s1, s2)
        print(coList)
        data['common_words_max'][i] = max(len(x) for x in coList)

train前两个输出

['我想', '手机']
['大家觉得']

5.文本的TF-IDF编码相似度

def tf_idf_cosine(data):
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.metrics.pairwise import cosine_similarity
    text1 = data['query1'].tolist()
    text2 = data['query2'].tolist()
    words1 = jieba.lcut(' '.join(text1))
    words2 = jieba.lcut(' '.join(text2))
    print(words1, words2)
    corpus = [' '.join(words1), ' '.join(words2)]
    print('corpus', corpus)
    # 转化为TF矩阵
    cv = TfidfVectorizer(tokenizer=lambda s: s.split())
    cv.fit(corpus)
    vectors1 = cv.transform([' '.join(words1)]).toarray()
    vectors2 = cv.transform([' '.join(words2)]).toarray()
    print(vectors1)
    print(vectors2)

['喜欢', '打篮球', '的', '男生', '喜欢', '什么样', '的', '女生', ' ', '我', '手机', '丢', '了', '，', '我想', '换个', '手机'] ['爱', '打篮球', '的', '男生', '喜欢', '什么样', '的', '女生', ' ', '我想', '买个', '新手机', '，', '求', '推荐']
corpus ['喜欢打篮球的男生喜欢什么样的女生我手机丢了，我想换个手机', '爱打篮球的男生喜欢什么样的女生我想买个新手机，求推荐']
/Users/liuqingmin/anaconda3/lib/python3.6/site-packages/sklearn/feature_extraction/text.py:1089: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
[[0.25744981 0. 0.25744981 0.18317766 0.36635532 0.18317766
0.25744981 0.18317766 0.51489962 0.18317766 0.25744981 0.
0. 0. 0. 0.18317766 0.36635532 0.18317766]]
[[0. 0.30760228 0. 0.21886156 0.21886156 0.21886156
0. 0.21886156 0. 0.21886156 0. 0.30760228
0.30760228 0.30760228 0.30760228 0.21886156 0.43772311 0.21886156]]
[[0.48108657]]