每日总结(2022/06/01)

文本分类

新闻语料中类别与目录的对应关系如下,共十大类别:

  • C000007 汽车
  • C000008 财经
  • C000010 IT
  • C000013 健康
  • C000014 体育
  • C000016 旅游
  • C000020 教育
  • C000022 招聘
  • C000023 文化
  • C000024 军事

在Data文件夹中有训练数据集(train)及测试数据集(test),其中train目录中是已经分类好的文档,每个类别中有6000个文档,而test目录中共包含20000个所有类别的文档,需要参赛者设计算法进行自动归类。

评分标准

F 1 = 2 P ∗ R R + P F1=\frac{2P*R}{R+P} F1=R+P2PR
其中 R R R准确率, P P P召回率。
对于每个类别分别单独计算其 F 1 F1 F1值,然后求10个类别的 F 1 F1 F1平均值作为最终评分结果。

word2vec模型

# -*- coding: utf-8 -*-
"""
word2vec embeddings start with a line with the number of lines (tokens?) and 
the number of dimensions of the file. This allows gensim to allocate memory 
accordingly for querying the model. Larger dimensions mean larger memory is 
held captive. Accordingly, this line has to be inserted into the GloVe 
embeddings file.
"""

import os
import shutil
import smart_open
from sys import platform

import gensim


def prepend_line(infile, outfile, line):
	""" 
	Function use to prepend lines using bash utilities in Linux. 
	(source: http://stackoverflow.com/a/10850588/610569)
	"""
	with open(infile, 'r', encoding='UTF-8') as old:
		with open(outfile, 'w', encoding='UTF-8') as new:
			new.write(str(line) + "\n")
			shutil.copyfileobj(old, new)

def prepend_slow(infile, outfile, line):
	"""
	Slower way to prepend the line by re-creating the inputfile.
	"""
	with open(infile, 'r', encoding='UTF-8') as fin:
		with open(outfile, 'w', encoding='UTF-8') as fout:
			fout.write(line + "\n")
			for line in fin:
				fout.write(line)

def get_lines(glove_file_name):
    """Return the number of vectors and dimensions in a file in GloVe format."""
    with smart_open.smart_open(glove_file_name, 'r', encoding='UTF-8') as f:
        num_lines = sum(1 for line in f)
    with smart_open.smart_open(glove_file_name, 'r', encoding='UTF-8') as f:
        num_dims = len(f.readline().split()) - 1
    return num_lines, num_dims
	
# Input: GloVe Model File
# More models can be downloaded from http://nlp.stanford.edu/projects/glove/
glove_file="glove.6B.300d.txt"

num_lines, dims = get_lines(glove_file)

# Output: Gensim Model text format.
gensim_file='glove_model2.txt'
gensim_first_line = "{} {}".format(num_lines, dims)

# Prepends the line.
if platform == "linux" or platform == "linux2":
	prepend_line(glove_file, gensim_file, gensim_first_line)
else:
	prepend_slow(glove_file, gensim_file, gensim_first_line)

# Demo: Loads the newly created glove_model.txt into gensim API.
model=gensim.models.KeyedVectors.load_word2vec_format(gensim_file, binary=False) #GloVe Model

print(model.most_similar(positive=['australia'], topn=10))
print(model.similarity('woman', 'man'))

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值