【NLP-code】查看文字覆盖率

最新推荐文章于 2023-04-09 17:10:53 发布

深度学习视觉

最新推荐文章于 2023-04-09 17:10:53 发布

阅读量519

点赞数

分类专栏： Machine Learning

本文链接：https://blog.csdn.net/lucky_kai/article/details/104771594

版权

Machine Learning 专栏收录该内容

29 篇文章 2 订阅

订阅专栏

深度学习视觉
公众号：深度学习视觉

import pandas as pd
from tqdm import tqdm
import operator

# 获取词汇表中的所有字
dict_path = '../bertModel/vocab.txt'
token_dict = getTokenDict(dict_path)

# 获取sentences
train_data_file = './tcdata/train.csv'
train_data = pd.read_csv(train_data_file)
sentences = train_data[['query1','query2']].apply(lambda x:x[0]+x[1],axis=1).values

在这里插入图片描述
相关函数

# 对文本建立词典
def build_vocab(sentences):
    # key is word,value is frequency
    '''
    sentences:[sentence]
    sentence:"w1w2w3w4,w5w6,w7."
    return:文本词频
    '''
    vocab = {}
    for sentence in tqdm(sentences):
        for word in sentence:
            try:
                vocab[word] += 1
            except:
                vocab[word] = 1
    return vocab

def check_coverage(vocab,embeddings_index):
    '''
    统计词典与文本的覆盖率
    return:没有覆盖到的字的频数
    '''
    iv = {} # in vocab
    oov = {} # out of vocba
    k = 0
    i = 0
    for word in tqdm(vocab):
        try:
            # 词典中的单词在embedding中
            iv[word] = embeddings_index[word]
            k += vocab[word]
        except:
            oov[word] = vocab[word]
            i += vocab[word]
            pass

    print('Found embeddings for {:.2%} of vocab'.format(len(iv) / len(vocab)))
    print('Found embeddings for  {:.2%} of all text'.format(k / (k + i)))
    sorted_x = sorted(oov.items(), key=operator.itemgetter(1))[::-1]

    return sorted_x

def getTokenDict(dict_path,encoding='utf-8'):
    '''
    dict_path:字典文件，每一个字为一行。
    '''
    token_dict = {}
    with open(dict_path, encoding=encoding) as reader:
        for line in reader:
            token = line.strip()
            token_dict[token] = len(token_dict)
            
    return token_dict

def clean_numbers(x):
    '''
    将数字替换
    '''
    x = re.sub('[0-9]{5,}', '#####', x)
    x = re.sub('[0-9]{4}', '####', x)
    x = re.sub('[0-9]{3}', '###', x)
    x = re.sub('[0-9]{2}', '##', x)
    return x

深度学习视觉

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
【NLP-code】查看文字覆盖率

深度学习视觉公众号：深度学习视觉import pandas as pdfrom tqdm import tqdmimport operator# 获取词汇表中的所有字dict_path = '../bertModel/vocab.txt'token_dict = getTokenDict(dict_path)# 获取sentencestrain_data_file = '...
复制链接

扫一扫