每日总结(2022/06/01)

最新推荐文章于 2024-07-13 16:36:54 发布

yodala

最新推荐文章于 2024-07-13 16:36:54 发布

阅读量83

点赞数

分类专栏：每日总结文章标签：自然语言处理人工智能 nlp

本文链接：https://blog.csdn.net/yodala/article/details/125090530

版权

每日总结专栏收录该内容

9 篇文章 0 订阅

订阅专栏

文本分类

新闻语料中类别与目录的对应关系如下，共十大类别：

C000007 汽车
C000008 财经
C000010 IT
C000013 健康
C000014 体育
C000016 旅游
C000020 教育
C000022 招聘
C000023 文化
C000024 军事

在Data文件夹中有训练数据集（train）及测试数据集（test），其中train目录中是已经分类好的文档，每个类别中有6000个文档，而test目录中共包含20000个所有类别的文档，需要参赛者设计算法进行自动归类。

评分标准

$F1=\frac{2P*R}{R+P}$
其中 $R$ 准确率， $P$ 召回率。
对于每个类别分别单独计算其 $F 1$ 值，然后求10个类别的 $F 1$ 平均值作为最终评分结果。

word2vec模型

# -*- coding: utf-8 -*-
"""
word2vec embeddings start with a line with the number of lines (tokens?) and 
the number of dimensions of the file. This allows gensim to allocate memory 
accordingly for querying the model. Larger dimensions mean larger memory is 
held captive. Accordingly, this line has to be inserted into the GloVe 
embeddings file.
"""

import os
import shutil
import smart_open
from sys import platform

import gensim


def prepend_line(infile, outfile, line):
	""" 
	Function use to prepend lines using bash utilities in Linux. 
	(source: http://stackoverflow.com/a/10850588/610569)
	"""
	with open(infile, 'r', encoding='UTF-8') as old:
		with open(outfile, 'w', encoding='UTF-8') as new:
			new.write(str(line) + "\n")
			shutil.copyfileobj(old, new)

def prepend_slow(infile, outfile, line):
	"""
	Slower way to prepend the line by re-creating the inputfile.
	"""
	with open(infile, 'r', encoding='UTF-8') as fin:
		with open(outfile, 'w', encoding='UTF-8') as fout:
			fout.write(line + "\n")
			for line in fin:
				fout.write(line)

def get_lines(glove_file_name):
    """Return the number of vectors and dimensions in a file in GloVe format."""
    with smart_open.smart_open(glove_file_name, 'r', encoding='UTF-8') as f:
        num_lines = sum(1 for line in f)
    with smart_open.smart_open(glove_file_name, 'r', encoding='UTF-8') as f:
        num_dims = len(f.readline().split()) - 1
    return num_lines, num_dims
	
# Input: GloVe Model File
# More models can be downloaded from http://nlp.stanford.edu/projects/glove/
glove_file="glove.6B.300d.txt"

num_lines, dims = get_lines(glove_file)

# Output: Gensim Model text format.
gensim_file='glove_model2.txt'
gensim_first_line = "{} {}".format(num_lines, dims)

# Prepends the line.
if platform == "linux" or platform == "linux2":
	prepend_line(glove_file, gensim_file, gensim_first_line)
else:
	prepend_slow(glove_file, gensim_file, gensim_first_line)

# Demo: Loads the newly created glove_model.txt into gensim API.
model=gensim.models.KeyedVectors.load_word2vec_format(gensim_file, binary=False) #GloVe Model

print(model.most_similar(positive=['australia'], topn=10))
print(model.similarity('woman', 'man'))