【NLP入门】赛题1-新闻文本分类-Task5-基于深度学习的文本分类2-1Word2Vec

最新推荐文章于 2020-12-28 18:43:12 发布

世界而世界

最新推荐文章于 2020-12-28 18:43:12 发布

阅读量563

点赞数

分类专栏： NLP 知识图谱文章标签：自然语言处理

本文链接：https://blog.csdn.net/qq_25951401/article/details/107711432

版权

赛题1-新闻文本分类-Task5-基于深度学习的文本分类2-1Word2Vec

本次任务四天完成

赛题

安装 pyTorch

跑程序

先导入包

import logging
import random

import numpy as np
import torch

logging.basicConfig(level=logging.INFO, format='%(asctime)-15s %(levelname)s: %(message)s')

# set seed 
# 这个暂时没看懂
seed = 666
random.seed(seed)
np.random.seed(seed)
torch.cuda.manual_seed(seed)
torch.manual_seed(seed)

读取少量数据

# split data to 10 fold
fold_num = 10
data_file = './data/1/train_set.csv'
import pandas as pd

train_set = pd.read_csv(data_file, sep='\t', nrows=1000)
train_set

data to fold

def all_data2fold(fold_num, num=10000):
    fold_data = []
    f = pd.read_csv(data_file, sep='\t', encoding='UTF-8')
    texts = f['text'].tolist()[:num]
    labels = f['label'].tolist()[:num]

    total = len(labels)

    index = list(range(total))
    np.random.shuffle(index)

    all_texts = []
    all_labels = []
    for i in index:
        all_texts.append(texts[i])
        all_labels.append(labels[i])

    label2id = {
   }
    for i in range(total):
        label = str(all_labels[i])
        if label not in label2id:
            label2id[label] = [i]
        else:
            label2id[label].append(i)
            
    all_index = [[] for _ in range(fold_num)]
    for label, data in label2id.items():
        # print(label, len(data))
        batch_size = int(len(data) / fold_num)
        other = len(data) - batch_size * fold_num
        for i in range(fold_num):
            cur_batch_size = batch_size + 1 if i < other else batch_size
            # print(cur_batch_size)
            batch_data = [data[i * batch_size + b] for b in range(cur_batch_size)]
            all_index[i].extend(batch_data)

    batch_size = int(total / fold_num)
    other_texts = []
    other_labels = []
    other_num = 0
    start = 0
    
    for fold in range(fold_num):
        num = len(all_index[fold])
        texts = [all_texts[i] for i in all_index[fold]]
        labels = [all_labels[i] for i in all_index[fold]]

        if num > batch_size:
            fold_texts = texts[:batch_size]
            other_texts.extend(texts[batch_size:])
            fold_labels = labels[:batch_size]
            other_labels.extend(labels[batch_size:])
            other_num += num - batch_size
        elif num < batch_size:
            end = start + batch_size - num
            fold_texts = texts + other_texts[start: end]
            fold_labels = labels + other_labels[start: end]
            start = end
        else:
            fold_texts = texts
            fold_labels = labels

        assert batch_size == len(fold_labels)

        # shuffle
        index = list(range(batch_size))
        np.random.shuffle(index)

        shuffle_fold_texts = []
        shuffle_fold_labels = []
        for i in index:
            shuffle_fold_texts.append(fold_texts[i])
            shuffle_fold_labels.append(fold_labels[i])

        data = {
   'label': shuffle_fold_labels, 'text': shuffle_fold_texts}
        fold_data.append(data)
        
    logging.info("Fold lens %s", str([len(data['label']) for data in fold_data]))

    return fold_data


fold_data = all_data2fold(10)

build train data for word2vec

# build train data for word2vec
fold_id = 9

train_texts = []
for i in range(0, fold_id):
    data = fold_data[i]
    train_texts.extend(data['text'])
    
logging.info('Total %d docs.' % len(train_texts))

训练并保存模型

logging.info('Start training...')
from gensim.models.word2vec import Word2Vec

num_features = 100     # Word vector dimensionality
num_workers = 8       # Number of threads to run in parallel

train_texts = list(map(lambda x: list(x.split()), train_texts))
model = Word2Vec(train_texts, workers=num_workers, size=num_features)
model.init_sims(replace=True)

# save model
model.save("./data/1/word2vec.bin")

加载模型并格式化

# load model
model = Word2Vec.load("./data/1/word2vec.bin")

# convert format
model.wv.save_word2vec_format('./data/1/word2vec.txt', binary=False)

跑全量

# 跑全量

train_set = pd.read_csv(data_file, sep='\t')
train_set

fold_data = all_data2fold(10, 250000)


# build train data for word2vec
fold_id = 9

train_texts = []
for i in range(0, fold_id):
    data = fold_data[i]
    train_texts.extend(data['text'])
    
logging.info('Total %d docs.' % len(train_texts))

logging.info('Start training...')
from gensim.models.word2vec import Word2Vec

num_features = 100     # Word vector dimensionality
num_workers = 16       # Number of threads to run in parallel

train_texts = list(map(lambda x: list(x.split()), train_texts))
model = Word2Vec(train_texts, workers=num_workers, size=num_features)
model.init_sims(replace=True)

# save model
model.save("./data/1/word2vec_full.bin")

# load model
model = Word2Vec.load("./data/1/word2vec_full.bin")

# convert format
model.wv.save_word2vec_format('./data/1/word2vec_full.txt', binary=False)

# 日志
2020-07-31 13:20:18,167 INFO: Fold lens [20000, 20000, 20000, 20000, 20000, 20000, 20000, 20000, 20000, 20000]
2020-07-31 13:20:18,324 INFO: Total 180000 docs.
2020-07-31 13:20:18,325 INFO: Start training...
2020-07-31 13:21:12,556 INFO: collecting all words and their counts
2020-07-31 13:21:12,583 INFO: PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-07-31 13:21:14,992 INFO: PROGRESS: at sentence #10000, processed 9130106 words, keeping 5270 word types
2020-07-31 13:21:17,295 INFO: PROGRESS: at sentence #20000, processed 18164877 words, keeping 5657 word types
2020-07-31 13:21:19,485 INFO: PROGRESS: at sentence #30000, processed 27276746 words, keeping 5869 word types
2020-07-31 13:21:21,800 INFO: PROGRESS: at sentence #40000, processed 36152077 words, keeping 6032 word types
2020-07-31 13:21:24,144 INFO: PROGRESS: at sentence #50000, processed 45041581 words, keeping 6141 word types
2020-07-31 13:21:26,447 INFO: PROGRESS: at sentence #60000, processed 54232818 words, keeping 6241 word types
2020-07-31 13:21:28,571 INFO: PROGRESS: at sentence #70000, processed 63209681 words, keeping 6309 word types
2020-07-31 13:21:30,462 INFO: PROGRESS: at sentence #80000, processed 72425114 words, keeping 6394 word types
2020-07-31 13:21:32,251 INFO: PROGRESS: at sentence #90000, processed 81546121 words, keeping 6455 word types
2020-07-31 13:21:34,020 INFO: PROGRESS: at sentence #100000, processed 90528591 words, keeping 6503 word types
2020-07-31 13:21:35,957 INFO: PROGRESS: at sentence #110000, processed 99636366 words, keeping 6546 word types
2020-07-31 13:21:37,771 INFO: PROGRESS: at sentence #120000, processed 108810893 words, keeping 6609 word types
2020-07-31 13:21:39,586 INFO: PROGRESS: at sentence #130000, processed 117956062 words, keeping 6649 word types
2020-07-31 13:21:41,520 INFO: PROGRESS: at sentence #140000, processed 127001041 words, keeping 6683 word types
2020-07-31 13:21:43,417 INFO: PROGRESS: at sentence #150000, processed 136142501 words, keeping 6709 word types
2020-07-31 13:21:45,430 INFO: PROGRESS: at sentence #160000, processed 145281408 words, keeping 6761 word types
2020-07-31 13:21:47,425 INFO: PROGRESS: at sentence #170000, processed 154309124 words, keeping 6797 word types
2020-07-31 13:21:49,944 INFO: collected 6819 word types from a corpus of 163437638 raw words and 180000 sentences
2020-07-31 13:21:49,952 INFO: Loading a fresh vocabulary
2020-07-31 13:21:50,036 INFO: effective_min_count=5 retains 5981 unique words (87% of original 6819, drops 838)
2020-07-31 13:21:50,036 INFO: effective_min_count=5 leaves 163435977 word corpus (99% of original 163437638, drops 1661)
2020-07-31 13:21:50,060 INFO: deleting the raw counts dictionary of 6819 items
2020-07-31 13:21:50,070 INFO: sample=0.001 downsamples 62 most-common words
2020-07-31 13:21:50,070 INFO: downsampling leaves estimated 141079171 word corpus (86.3% of prior 163435977)
2020-07-31 13:21:50,087 INFO: estimated required memory for 5981 words and 100 dimensions: 7775300 bytes
2020-07-31 13:21:50,088 INFO: resetting layer weights
2020-07-31 13:21:51,172 INFO: training model with 16 workers on 5981 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
2020-07-31 13:21:52,213 INFO: EPOCH 1 - PROGRESS: at 1.15% examples, 1632418 words/s, in_qsize 31, out_qsize 0
2020-07-31 13:21:53,218 INFO: EPOCH 1 - PROGRESS: at 2.33% examples, 1632715 words/s, in_qsize 30, out_qsize 1
2020-07-31 13:21:54,224 INFO: EPOCH 1 - PROGRESS: at 3.34% examples, 1565865 words/s, in_qsize 29, out_qsize 2
2020-07-31 13:21:55,230 INFO: EPOCH 1 - PROGRESS: at 4.17% examples, 1457888 words/s, in_qsize 32, out_qsize 2
2020-07-31 13:21:56,233 INFO: EPOCH 1 - PROGRESS: at 5.04% examples, 1413364 words/s, in_qsize 30, out_qsize 1
2020-07-31 13:21:57,239 INFO: EPOCH 1 - PROGRESS: at 5.99% examples, 1402917 words/s, in_qsize 29, out_qsize 2
2020-07-31 13:21:58,285 INFO: EPOCH 1 - PROGRESS: at 6.66% examples, 1328337 words/s, in_qsize 28, out_qsize 3
2020-07-31 13:21:59,307 INFO: EPOCH 1 - PROGRESS: at 7.58% examples, 1319193 words/s, in_qsize 31, out_qsize 0
2020-07-31 13:22:00,312 INFO: EPOCH 1 - PROGRESS: at 8.59% examples, 1328107 words/s, in_qsize 27, out_qsize 4
2020-07-31 13:22:01,315 INFO: EPOCH 1 - PROGRESS: at 9.73% examples, 1354074 words/s, in_qsize 32, out_qsize 0
2020-07-31 13:22:02,317 INFO: EPOCH 1 - PROGRESS: at 10.77% examples, 1365465 words/s, in_qsize 31, out_qsize 0
2020-07-31 13:22:03,323 INFO: EPOCH 1 - PROGRESS: at 11.77% examples, 1367189 words/s, in_qsize 31, out_qsize 0
2020-07-31 13:22:04,328 INFO: EPOCH 1 - PROGRESS: at 12.88% examples, 1380325 words/s, in_qsize 30, out_qsize 1
2020-07-31 13:22:05,344 INFO: EPOCH 1 - PROGRESS: at 13.96% examples, 1386061 words/s, in_qsize 29, out_qsize 2
2020-07-31 13:22:06,347 INFO: EPOCH 1 - PROGRESS: at 15.00% examples, 1394157 words/s, in_qsize 31, out_qsize 0
2020-07-31 13:22:07,356 INFO: EPOCH 1 - PROGRESS: at 16.00% examples, 1395515 words/s, in_qsize 31, out_qsize 0
2020-07-31 13:22:08,368 INFO: EPOCH 1 - PROGRESS: at 17.05% examples, 1399388 words/s, in_qsize 30, out_qsize 1
2020-07-31 13:22:09,379 INFO: EPOCH 1 - PROGRESS: at 18.13% examples, 1400447 words/s, in_qsize 29, out_qsize 2
2020-07-31 13:22:10,388 INFO: EPOCH 1 - PROGRESS: at 19.27% examples, 1406652 words/s, in_qsize 31, out_qsize 0
2020-07-31 13:22:11,392 INFO: EPOCH 1 - PROGRESS: at 20.35% examples, 1412626 words/s, in_qsize 29, out_qsize 2
2020-07-31 13:22:12,395 INFO: EPOCH 1 - PROGRESS: at 21.54% examples, 1424993 words/s, in_qsize 30, out_qsize 1
2020-07-31 13:22:13,399 INFO: EPOCH 1 - PROGRESS: at 22.79% examples, 1437218 words/s, in_qsize 31, out_qsize 0
2020-07-31 13:22:14,400 INFO: EPOCH 1 - PROGRESS: at 23.99% examples, 1446832 words/s, in_qsize 28, out_qsize 4
2020-07-31 13:22:15,405 INFO: EPOCH 1 - PROGRESS: at 25.16% examples, 1455819 words/s, in_qsize 29, out_qsize 2
2020-07-31 13:22:16,406 INFO: EPOCH 1 - PROGRESS: at 26.53% examples, 1469288 words/s, in_qsize 29, out_qsize 2
2020-07-31 13:22:17,409 INFO: EPOCH 1 - PROGRESS: at 27.84% examples, 1481901 words/s, in_qsize 30, out_qsize 1
2020-07-31 13:22:18,409 INFO: EPOCH 1 - PROGRESS: at 29.08% examples, 1492260 words/s, in_qsize 31, out_qsize 0
2020-07-31 13:22:19,415 INFO: EPOCH 1 - PROGRESS: at 30.32% examples, 1499888 words/s, in_qsize 29, out_qsize 2
2020-07-31 13:22:20,418 INFO: EPOCH 1 - PROGRESS: at 31.51% examples, 1506514 words/s, in_qsize 30, out_qsize 1
2020-07-31 13:22:21,418 INFO: EPOCH 1 - PROGRESS: at 32.70% examples, 1513681 words/s, in_qsize 29, out_qsize 2
2020-07-31 13:22:22,419 INFO: EPOCH 1 - PROGRESS: at 33.88% examples, 1520697 words/s, in_qsize 31, out_qsize 0
2020-07-31 13:22:23,421 INFO: EPOCH 1 - PROGRESS: at 35.11% examples, 1525848 words/s, in_qsize 30, out_qsize 1
2020-07-31 13:22:24,430 INFO: EPOCH 1 - PROGRESS: at 36.29% examples, 1529542 words/s, in_qsize 30, out_qsize 1
2020-07-31 13:22:25,433 INFO: EPOCH 1 - PROGRESS: at 37.56% examples, 1535647 words/s, in_qsize 31, out_qsize 0
2020-07-31 13:22:26,433 INFO: EPOCH 1 - PROGRESS: at 38.69% examples, 1535730 words/s, in_qsize 31, out_qsize 0
2020-07-31 13:22:27,437 INFO: EPOCH 1 - PROGRESS: at 39.82% examples, 1536726 words/s, in_qsize 31, out_qsize 0
2020-07-31 13:22:28,442 INFO: EPOCH 1 - PROGRESS: at 40.83% examples, 1533979 words/s, in_qsize 30, out_qsize 1
2020-07-31 13:22:29,449 INFO: EPOCH 1 - PROGRESS: at 41.91% examples, 1534758 words/s, in_qsize 31, out_qsize 0
2020-07-31 13:22:30,455 INFO: EPOCH 1 - PROGRESS: at 43.08% examples, 1536308 words/s, in_qsize 32, out_qsize 0
2020-07-31 13:22:31,472 INFO: EPOCH 1 - PROGRESS: at 44.24% examples, 1538663 words/s, in_qsize 31, out_qsize 0
2020-07-31 13:22:32,476 INFO: EPOCH 1 - PROGRESS: at 45.46% examples, 1542211 words/s, in_qsize 29, out_qsize 2
2020-07-31 13:22:33,477 INFO: EPOCH 1 - PROGRESS: at 46.65% examples, 1544562 words/s, in_qsize 31, out_qsize 0
2020-07-31 13:22:34,486 INFO: EPOCH 1 - PROGRESS: at 47.79% examples, 1546760 words/s, in_qsize 31, out_qsize 0
2020-07-31 13:22:35,489 INFO: EPOCH 1 - PROGRESS: at 48.92% examples, 1547774 words/s, in_qsize 31, out_qsize 0
2020-07-31 13:22:36,499 INFO: EPOCH 1 - PROGRESS: at 49.78% examples, 1539974 words/s, in_qsize 29, out_qsize 2
2020-07-31 13:22:37,520 INFO: EPOCH 1 - PROGRESS: at 50.99% examples, 1542646 words/s, in_qsize 28, out_qsize 3
2020-07-31 13:22:38,533 INFO: EPOCH 1 - PROGRESS: at 52.19% examples, 1543733 words/s, in_qsize 30, out_qsize 1
2020-07-31 13:22:39,550 INFO: EPOCH 1 - PROGRESS: at 53.28% examples, 1541904 words/s, in_qsize 30, out_qsize 1
2020-07-31 13:22:40,550 INFO: EPOCH 1 - PROGRESS: at 54.42% examples, 1542848 words/s, in_qsize 28, out_qsize 3
2020-07-31 13:22:41,553 INFO: EPOCH 1 - PROGRESS: at 55.45% examples, 1541626 words/s, in_qsize 31, out_qsize 0
2020-07-31 13:22:42,558 INFO: EPOCH 1 - PROGRESS: at 56.61% examples, 1542230 words/s, in_qsize 32, out_qsize 0
2020-07-31 13:22:43,559 INFO: EPOCH 1 - PROGRESS: at 57.84% examples, 1546282 words/s, in_qsize 31, out_qsize 1
2020-07-31 13:22:44,565 INFO: EPOCH 1 - PROGRESS: at 58.97% examples, 1547196 words/s, in_qsize 31, out_qsize 0
2020-07-31 13:22:45,570 INFO: EPOCH 1 - PROGRESS: at 60.08% examples, 1547264 words/s, in_qsize 30, out_qsize 1
2020-07-31 13:22:46,586 INFO: EPOCH 1 - PROGRESS: at 61.31% examples, 1549707 words/s, in_qsize 30, out_qsize 1
2020-07-31 13:22:47,590 INFO: EPOCH 1 - PROGRESS

最低0.47元/天解锁文章

世界而世界

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
【NLP入门】赛题1-新闻文本分类-Task5-基于深度学习的文本分类2-1Word2Vec

赛题1-新闻文本分类-Task5-基于深度学习的文本分类2-1Word2Vec本次任务四天完成赛题赛题讲解赛题数据讲解机器学习讲解深度学习-fastText深度学习-word2vec安装 pyTorch官网下载地址及安装说明跑程序先导入包import loggingimport randomimport numpy as npimport torchlogging.basicConfig(level=logging.INFO, format='%(asct
复制链接

扫一扫