【NLP入门】赛题1-新闻文本分类-Task5-基于深度学习的文本分类2-1Word2Vec

赛题1-新闻文本分类-Task5-基于深度学习的文本分类2-1Word2Vec

本次任务四天完成

赛题

安装 pyTorch

跑程序

  1. 先导入包
import logging
import random

import numpy as np
import torch

logging.basicConfig(level=logging.INFO, format='%(asctime)-15s %(levelname)s: %(message)s')

# set seed 
# 这个暂时没看懂
seed = 666
random.seed(seed)
np.random.seed(seed)
torch.cuda.manual_seed(seed)
torch.manual_seed(seed)
  1. 读取少量数据
# split data to 10 fold
fold_num = 10
data_file = './data/1/train_set.csv'
import pandas as pd

train_set = pd.read_csv(data_file, sep='\t', nrows=1000)
train_set
  1. data to fold
def all_data2fold(fold_num, num=10000):
    fold_data = []
    f = pd.read_csv(data_file, sep='\t', encoding='UTF-8')
    texts = f['text'].tolist()[:num]
    labels = f['label'].tolist()[:num]

    total = len(labels)

    index = list(range(total))
    np.random.shuffle(index)

    all_texts = []
    all_labels = []
    for i in index:
        all_texts.append(texts[i])
        all_labels.append(labels[i])

    label2id = {
   }
    for i in range(total):
        label = str(all_labels[i])
        if label not in label2id:
            label2id[label] = [i]
        else:
            label2id[label].append(i)
            
    all_index = [[] for _ in range(fold_num)]
    for label, data in label2id.items():
        # print(label, len(data))
        batch_size = int(len(data) / fold_num)
        other = len(data) - batch_size * fold_num
        for i in range(fold_num):
            cur_batch_size = batch_size + 1 if i < other else batch_size
            # print(cur_batch_size)
            batch_data = [data[i * batch_size + b] for b in range(cur_batch_size)]
            all_index[i].extend(batch_data)

    batch_size = int(total / fold_num)
    other_texts = []
    other_labels = []
    other_num = 0
    start = 0
    
    for fold in range(fold_num):
        num = len(all_index[fold])
        texts = [all_texts[i] for i in all_index[fold]]
        labels = [all_labels[i] for i in all_index[fold]]

        if num > batch_size:
            fold_texts = texts[:batch_size]
            other_texts.extend(texts[batch_size:])
            fold_labels = labels[:batch_size]
            other_labels.extend(labels[batch_size:])
            other_num += num - batch_size
        elif num < batch_size:
            end = start + batch_size - num
            fold_texts = texts + other_texts[start: end]
            fold_labels = labels + other_labels[start: end]
            start = end
        else:
            fold_texts = texts
            fold_labels = labels

        assert batch_size == len(fold_labels)

        # shuffle
        index = list(range(batch_size))
        np.random.shuffle(index)

        shuffle_fold_texts = []
        shuffle_fold_labels = []
        for i in index:
            shuffle_fold_texts.append(fold_texts[i])
            shuffle_fold_labels.append(fold_labels[i])

        data = {
   'label': shuffle_fold_labels, 'text': shuffle_fold_texts}
        fold_data.append(data)
        
    logging.info("Fold lens %s", str([len(data['label']) for data in fold_data]))

    return fold_data


fold_data = all_data2fold(10)
  1. build train data for word2vec
# build train data for word2vec
fold_id = 9

train_texts = []
for i in range(0, fold_id):
    data = fold_data[i]
    train_texts.extend(data['text'])
    
logging.info('Total %d docs.' % len(train_texts))
  1. 训练并保存模型
logging.info('Start training...')
from gensim.models.word2vec import Word2Vec

num_features = 100     # Word vector dimensionality
num_workers = 8       # Number of threads to run in parallel

train_texts = list(map(lambda x: list(x.split()), train_texts))
model = Word2Vec(train_texts, workers=num_workers, size=num_features)
model.init_sims(replace=True)

# save model
model.save("./data/1/word2vec.bin")
  1. 加载模型并格式化
# load model
model = Word2Vec.load("./data/1/word2vec.bin")

# convert format
model.wv.save_word2vec_format('./data/1/word2vec.txt', binary=False)
  1. 跑全量
# 跑全量

train_set = pd.read_csv(data_file, sep='\t')
train_set

fold_data = all_data2fold(10, 250000)


# build train data for word2vec
fold_id = 9

train_texts = []
for i in range(0, fold_id):
    data = fold_data[i]
    train_texts.extend(data['text'])
    
logging.info('Total %d docs.' % len(train_texts))

logging.info('Start training...')
from gensim.models.word2vec import Word2Vec

num_features = 100     # Word vector dimensionality
num_workers = 16       # Number of threads to run in parallel

train_texts = list(map(lambda x: list(x.split()), train_texts))
model = Word2Vec(train_texts, workers=num_workers, size=num_features)
model.init_sims(replace=True)

# save model
model.save("./data/1/word2vec_full.bin")

# load model
model = Word2Vec.load("./data/1/word2vec_full.bin")

# convert format
model.wv.save_word2vec_format('./data/1/word2vec_full.txt', binary=False)
# 日志
2020-07-31 13:20:18,167 INFO: Fold lens [20000, 20000, 20000, 20000, 20000, 20000, 20000, 20000, 20000, 20000]
2020-07-31 13:20:18,324 INFO: Total 180000 docs.
2020-07-31 13:20:18,325 INFO: Start training...
2020-07-31 13:21:12,556 INFO: collecting all words and their counts
2020-07-31 13:21:12,583 INFO: PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-07-31 13:21:14,992 INFO: PROGRESS: at sentence #10000, processed 9130106 words, keeping 5270 word types
2020-07-31 13:21:17,295 INFO: PROGRESS: at sentence #20000, processed 18164877 words, keeping 5657 word types
2020-07-31 13:21:19,485 INFO: PROGRESS: at sentence #30000, processed 27276746 words, keeping 5869 word types
2020-07-31 13:21:21,800 INFO: PROGRESS: at sentence #40000, processed 36152077 words, keeping 6032 word types
2020-07-31 13:21:24,144 INFO: PROGRESS: at sentence #50000, processed 45041581 words, keeping 6141 word types
2020-07-31 13:21:26,447 INFO: PROGRESS: at sentence #60000, processed 54232818 words, keeping 6241 word types
2020-07-31 13:21:28,571 INFO: PROGRESS: at sentence #70000, processed 63209681 words, keeping 6309 word types
2020-07-31 13:21:30,462 INFO: PROGRESS: at sentence #80000, processed 72425114 words, keeping 6394 word types
2020-07-31 13:21:32,251 INFO: PROGRESS: at sentence #90000, processed 81546121 words, keeping 6455 word types
2020-07-31 13:21:34,020 INFO: PROGRESS: at sentence #100000, processed 90528591 words, keeping 6503 word types
2020-07-31 13:21:35,957 INFO: PROGRESS: at sentence #110000, processed 99636366 words, keeping 6546 word types
2020-07-31 13:21:37,771 INFO: PROGRESS: at sentence #120000, processed 108810893 words, keeping 6609 word types
2020-07-31 13:21:39,586 INFO: PROGRESS: at sentence #130000, processed 117956062 words, keeping 6649 word types
2020-07-31 13:21:41,520 INFO: PROGRESS: at sentence #140000, processed 127001041 words, keeping 6683 word types
2020-07-31 13:21:43,417 INFO: PROGRESS: at sentence #150000, processed 136142501 words, keeping 6709 word types
2020-07-31 13:21:45,430 INFO: PROGRESS: at sentence #160000, processed 145281408 words, keeping 6761 word types
2020-07-31 13:21:47,425 INFO: PROGRESS: at sentence #170000, processed 154309124 words, keeping 6797 word types
2020-07-31 13:21:49,944 INFO: collected 6819 word types from a corpus of 163437638 raw words and 180000 sentences
2020-07-31 13:21:49,952 INFO: Loading a fresh vocabulary
2020-07-31 13:21:50,036 INFO: effective_min_count=5 retains 5981 unique words (87% of original 6819, drops 838)
2020-07-31 13:21:50,036 INFO: effective_min_count=5 leaves 163435977 word corpus (99% of original 163437638, drops 1661)
2020-07-31 13:21:50,060 INFO: deleting the raw counts dictionary of 6819 items
2020-07-31 13:21:50,070 INFO: sample=0.001 downsamples 62 most-common words
2020-07-31 13:21:50,070 INFO: downsampling leaves estimated 141079171 word corpus (86.3% of prior 163435977)
2020-07-31 13:21:50,087 INFO: estimated required memory for 5981 words and 100 dimensions: 7775300 bytes
2020-07-31 13:21:50,088 INFO: resetting layer weights
2020-07-31 13:21:51,172 INFO: training model with 16 workers on 5981 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
2020-07-31 13:21:52,213 INFO: EPOCH 1 - PROGRESS: at 1.15% examples, 1632418 words/s, in_qsize 31, out_qsize 0
2020-07-31 13:21:53,218 INFO: EPOCH 1 - PROGRESS: at 2.33% examples, 1632715 words/s, in_qsize 30, out_qsize 1
2020-07-31 13:21:54,224 INFO: EPOCH 1 - PROGRESS: at 3.34% examples, 1565865 words/s, in_qsize 29, out_qsize 2
2020-07-31 13:21:55,230 INFO: EPOCH 1 - PROGRESS: at 4.17% examples, 1457888 words/s, in_qsize 32, out_qsize 2
2020-07-31 13:21:56,233 INFO: EPOCH 1 - PROGRESS: at 5.04% examples, 1413364 words/s, in_qsize 30, out_qsize 1
2020-07-31 13:21:57,239 INFO: EPOCH 1 - PROGRESS: at 5.99% examples, 1402917 words/s, in_qsize 29, out_qsize 2
2020-07-31 13:21:58,285 INFO: EPOCH 1 - PROGRESS: at 6.66% examples, 1328337 words/s, in_qsize 28, out_qsize 3
2020-07-31 13:21:59,307 INFO: EPOCH 1 - PROGRESS: at 7.58% examples, 1319193 words/s, in_qsize 31, out_qsize 0
2020-07-31 13:22:00,312 INFO: EPOCH 1 - PROGRESS: at 8.59% examples, 1328107 words/s, in_qsize 27, out_qsize 4
2020-07-31 13:22:01,315 INFO: EPOCH 1 - PROGRESS: at 9.73% examples, 1354074 words/s, in_qsize 32, out_qsize 0
2020-07-31 13:22:02,317 INFO: EPOCH 1 - PROGRESS: at 10.77% examples, 1365465 words/s, in_qsize 31, out_qsize 0
2020-07-31 13:22:03,323 INFO: EPOCH 1 - PROGRESS: at 11.77% examples, 1367189 words/s, in_qsize 31, out_qsize 0
2020-07-31 13:22:04,328 INFO: EPOCH 1 - PROGRESS: at 12.88% examples, 1380325 words/s, in_qsize 30, out_qsize 1
2020-07-31 13:22:05,344 INFO: EPOCH 1 - PROGRESS: at 13.96% examples, 1386061 words/s, in_qsize 29, out_qsize 2
2020-07-31 13:22:06,347 INFO: EPOCH 1 - PROGRESS: at 15.00% examples, 1394157 words/s, in_qsize 31, out_qsize 0
2020-07-31 13:22:07,356 INFO: EPOCH 1 - PROGRESS: at 16.00% examples, 1395515 words/s, in_qsize 31, out_qsize 0
2020-07-31 13:22:08,368 INFO: EPOCH 1 - PROGRESS: at 17.05% examples, 1399388 words/s, in_qsize 30, out_qsize 1
2020-07-31 13:22:09,379 INFO: EPOCH 1 - PROGRESS: at 18.13% examples, 1400447 words/s, in_qsize 29, out_qsize 2
2020-07-31 13:22:10,388 INFO: EPOCH 1 - PROGRESS: at 19.27% examples, 1406652 words/s, in_qsize 31, out_qsize 0
2020-07-31 13:22:11,392 INFO: EPOCH 1 - PROGRESS: at 20.35% examples, 1412626 words/s, in_qsize 29, out_qsize 2
2020-07-31 13:22:12,395 INFO: EPOCH 1 - PROGRESS: at 21.54% examples, 1424993 words/s, in_qsize 30, out_qsize 1
2020-07-31 13:22:13,399 INFO: EPOCH 1 - PROGRESS: at 22.79% examples, 1437218 words/s, in_qsize 31, out_qsize 0
2020-07-31 13:22:14,400 INFO: EPOCH 1 - PROGRESS: at 23.99% examples, 1446832 words/s, in_qsize 28, out_qsize 4
2020-07-31 13:22:15,405 INFO: EPOCH 1 - PROGRESS: at 25.16% examples, 1455819 words/s, in_qsize 29, out_qsize 2
2020-07-31 13:22:16,406 INFO: EPOCH 1 - PROGRESS: at 26.53% examples, 1469288 words/s, in_qsize 29, out_qsize 2
2020-07-31 13:22:17,409 INFO: EPOCH 1 - PROGRESS: at 27.84% examples, 1481901 words/s, in_qsize 30, out_qsize 1
2020-07-31 13:22:18,409 INFO: EPOCH 1 - PROGRESS: at 29.08% examples, 1492260 words/s, in_qsize 31, out_qsize 0
2020-07-31 13:22:19,415 INFO: EPOCH 1 - PROGRESS: at 30.32% examples, 1499888 words/s, in_qsize 29, out_qsize 2
2020-07-31 13:22:20,418 INFO: EPOCH 1 - PROGRESS: at 31.51% examples, 1506514 words/s, in_qsize 30, out_qsize 1
2020-07-31 13:22:21,418 INFO: EPOCH 1 - PROGRESS: at 32.70% examples, 1513681 words/s, in_qsize 29, out_qsize 2
2020-07-31 13:22:22,419 INFO: EPOCH 1 - PROGRESS: at 33.88% examples, 1520697 words/s, in_qsize 31, out_qsize 0
2020-07-31 13:22:23,421 INFO: EPOCH 1 - PROGRESS: at 35.11% examples, 1525848 words/s, in_qsize 30, out_qsize 1
2020-07-31 13:22:24,430 INFO: EPOCH 1 - PROGRESS: at 36.29% examples, 1529542 words/s, in_qsize 30, out_qsize 1
2020-07-31 13:22:25,433 INFO: EPOCH 1 - PROGRESS: at 37.56% examples, 1535647 words/s, in_qsize 31, out_qsize 0
2020-07-31 13:22:26,433 INFO: EPOCH 1 - PROGRESS: at 38.69% examples, 1535730 words/s, in_qsize 31, out_qsize 0
2020-07-31 13:22:27,437 INFO: EPOCH 1 - PROGRESS: at 39.82% examples, 1536726 words/s, in_qsize 31, out_qsize 0
2020-07-31 13:22:28,442 INFO: EPOCH 1 - PROGRESS: at 40.83% examples, 1533979 words/s, in_qsize 30, out_qsize 1
2020-07-31 13:22:29,449 INFO: EPOCH 1 - PROGRESS: at 41.91% examples, 1534758 words/s, in_qsize 31, out_qsize 0
2020-07-31 13:22:30,455 INFO: EPOCH 1 - PROGRESS: at 43.08% examples, 1536308 words/s, in_qsize 32, out_qsize 0
2020-07-31 13:22:31,472 INFO: EPOCH 1 - PROGRESS: at 44.24% examples, 1538663 words/s, in_qsize 31, out_qsize 0
2020-07-31 13:22:32,476 INFO: EPOCH 1 - PROGRESS: at 45.46% examples, 1542211 words/s, in_qsize 29, out_qsize 2
2020-07-31 13:22:33,477 INFO: EPOCH 1 - PROGRESS: at 46.65% examples, 1544562 words/s, in_qsize 31, out_qsize 0
2020-07-31 13:22:34,486 INFO: EPOCH 1 - PROGRESS: at 47.79% examples, 1546760 words/s, in_qsize 31, out_qsize 0
2020-07-31 13:22:35,489 INFO: EPOCH 1 - PROGRESS: at 48.92% examples, 1547774 words/s, in_qsize 31, out_qsize 0
2020-07-31 13:22:36,499 INFO: EPOCH 1 - PROGRESS: at 49.78% examples, 1539974 words/s, in_qsize 29, out_qsize 2
2020-07-31 13:22:37,520 INFO: EPOCH 1 - PROGRESS: at 50.99% examples, 1542646 words/s, in_qsize 28, out_qsize 3
2020-07-31 13:22:38,533 INFO: EPOCH 1 - PROGRESS: at 52.19% examples, 1543733 words/s, in_qsize 30, out_qsize 1
2020-07-31 13:22:39,550 INFO: EPOCH 1 - PROGRESS: at 53.28% examples, 1541904 words/s, in_qsize 30, out_qsize 1
2020-07-31 13:22:40,550 INFO: EPOCH 1 - PROGRESS: at 54.42% examples, 1542848 words/s, in_qsize 28, out_qsize 3
2020-07-31 13:22:41,553 INFO: EPOCH 1 - PROGRESS: at 55.45% examples, 1541626 words/s, in_qsize 31, out_qsize 0
2020-07-31 13:22:42,558 INFO: EPOCH 1 - PROGRESS: at 56.61% examples, 1542230 words/s, in_qsize 32, out_qsize 0
2020-07-31 13:22:43,559 INFO: EPOCH 1 - PROGRESS: at 57.84% examples, 1546282 words/s, in_qsize 31, out_qsize 1
2020-07-31 13:22:44,565 INFO: EPOCH 1 - PROGRESS: at 58.97% examples, 1547196 words/s, in_qsize 31, out_qsize 0
2020-07-31 13:22:45,570 INFO: EPOCH 1 - PROGRESS: at 60.08% examples, 1547264 words/s, in_qsize 30, out_qsize 1
2020-07-31 13:22:46,586 INFO: EPOCH 1 - PROGRESS: at 61.31% examples, 1549707 words/s, in_qsize 30, out_qsize 1
2020-07-31 13:22:47,590 INFO: EPOCH 1 - PROGRESS
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值