文本分类
新闻语料中类别与目录的对应关系如下,共十大类别:
- C000007 汽车
- C000008 财经
- C000010 IT
- C000013 健康
- C000014 体育
- C000016 旅游
- C000020 教育
- C000022 招聘
- C000023 文化
- C000024 军事
在Data文件夹中有训练数据集(train)及测试数据集(test),其中train目录中是已经分类好的文档,每个类别中有6000个文档,而test目录中共包含20000个所有类别的文档,需要参赛者设计算法进行自动归类。
评分标准
F
1
=
2
P
∗
R
R
+
P
F1=\frac{2P*R}{R+P}
F1=R+P2P∗R
其中
R
R
R准确率,
P
P
P召回率。
对于每个类别分别单独计算其
F
1
F1
F1值,然后求10个类别的
F
1
F1
F1平均值作为最终评分结果。
word2vec模型
# -*- coding: utf-8 -*-
"""
word2vec embeddings start with a line with the number of lines (tokens?) and
the number of dimensions of the file. This allows gensim to allocate memory
accordingly for querying the model. Larger dimensions mean larger memory is
held captive. Accordingly, this line has to be inserted into the GloVe
embeddings file.
"""
import os
import shutil
import smart_open
from sys import platform
import gensim
def prepend_line(infile, outfile, line):
"""
Function use to prepend lines using bash utilities in Linux.
(source: http://stackoverflow.com/a/10850588/610569)
"""
with open(infile, 'r', encoding='UTF-8') as old:
with open(outfile, 'w', encoding='UTF-8') as new:
new.write(str(line) + "\n")
shutil.copyfileobj(old, new)
def prepend_slow(infile, outfile, line):
"""
Slower way to prepend the line by re-creating the inputfile.
"""
with open(infile, 'r', encoding='UTF-8') as fin:
with open(outfile, 'w', encoding='UTF-8') as fout:
fout.write(line + "\n")
for line in fin:
fout.write(line)
def get_lines(glove_file_name):
"""Return the number of vectors and dimensions in a file in GloVe format."""
with smart_open.smart_open(glove_file_name, 'r', encoding='UTF-8') as f:
num_lines = sum(1 for line in f)
with smart_open.smart_open(glove_file_name, 'r', encoding='UTF-8') as f:
num_dims = len(f.readline().split()) - 1
return num_lines, num_dims
# Input: GloVe Model File
# More models can be downloaded from http://nlp.stanford.edu/projects/glove/
glove_file="glove.6B.300d.txt"
num_lines, dims = get_lines(glove_file)
# Output: Gensim Model text format.
gensim_file='glove_model2.txt'
gensim_first_line = "{} {}".format(num_lines, dims)
# Prepends the line.
if platform == "linux" or platform == "linux2":
prepend_line(glove_file, gensim_file, gensim_first_line)
else:
prepend_slow(glove_file, gensim_file, gensim_first_line)
# Demo: Loads the newly created glove_model.txt into gensim API.
model=gensim.models.KeyedVectors.load_word2vec_format(gensim_file, binary=False) #GloVe Model
print(model.most_similar(positive=['australia'], topn=10))
print(model.similarity('woman', 'man'))