Task05:基于深度学习的文本分类2(4天)

Task05:基于深度学习的文本分类2(4天)


183-NEOWISE(nlp)-tang

使用gensim训练word2vec

import logging
import random

import numpy as np
import torch

logging.basicConfig(level=logging.INFO, format='%(asctime)-15s %(levelname)s: %(message)s')

# set seed
seed = 666
random.seed(seed)
np.random.seed(seed)
torch.cuda.manual_seed(seed)
torch.manual_seed(seed)

# split data to 10 fold
fold_num = 10
data_file = './data/train_set.csv'
import pandas as pd

def all_data2fold(fold_num, num=10000):
    fold_data = []
    f = pd.read_csv(data_file, sep='\t', encoding='UTF-8')
    texts = f['text'].tolist()[:num]
    labels = f['label'].tolist()[:num]
    
    total = len(labels)
    
    index = list(range(total))
    np.random.shuffle(index)
    
    all_texts = []
    all_labels = []
    for i in index:
        all_texts.append(texts[i])
        all_labels.append(labels[i])
        
    label2id = {}
    for i in range(total):
        label = str(all_labels[i])
        if label not in label2id:
            label2id[label] = [i]
        else:
            label2id[label].append(i)
    
    all_index = [[] for _ in range(fold_num)]
    for label, data in label2id.items():
        # print(label, len(data))
        batch_size = int(len(data) / fold_num)
        other = len(data) - batch_size * fold_num
        for i in range(fold_num):
            cur_batch_size = batch_size + 1 if i < other else batch_size
            # print(cur_batch_size)
            batch_data = [data[i * batch_size + b] for b in range(cur_batch_size)]
            all_index[i].extend(batch_data)
    batch_size = int(total / fold_num)
    other_texts = []
    other_labels = []
    other_num = 0
    start = 0
    for fold in range(fold_num):
        num = len(all_index[fold])
        texts = [all_texts[i] for i in all_index[fold]]
        labels = [all_labels[i] for i in all_index[fold]]
        
        if num > batch_size:
            fold_texts = texts[:batch_size]
            other_texts.extend(texts[batch_size:])
            fold_labels = labels[:batch_size]
            other_labels.extend(labels[batch_size:])
            other_num += num - batch_size
        elif num < batch_size:
            end = start + batch_size - num
            fold_texts = texts + other_texts[start: end]
            fold_labels = labels + other_labels[start: end]
            start = end
        else:
            fold_texts = texts
            fold_labels = labels
        
        assert batch_size == len(fold_labels)
        
        # shuffle
        index = list(range(batch_size))
        np.random.shuffle(index)

        shuffle_fold_texts = []
        shuffle_fold_labels = []
        for i in index:
            shuffle_fold_texts.append(fold_texts[i])
            shuffle_fold_labels.append(fold_labels[i])

        data = {'label': shuffle_fold_labels, 'text': shuffle_fold_texts}
        fold_data.append(data)

    logging.info("Fold lens %s", str([len(data['label']) for data in fold_data]))

    return fold_data


fold_data = all_data2fold(10)


# build train data for word2vec
fold_id= 9

train_texts = []
for i in range(0, fold_id):
    data = fold_data[i]
    train_texts.extend(data['text'])

logging.info('Total %d docs.' % len(train_texts))

logging.info('Start training...')
from gensim.models.word2vec import Word2Vec

num_features = 100 # Word vector dimensionality
num_workers = 8 # Number of threads to run in parallel

train_texts = list(map(lambda x: list(x.split()), train_texts))
model = Word2Vec(train_texts, workers=num_workers, size=num_features)
model.init_sims(replace=True)

# save model
model.save("./word2vec.bin")
2020-08-01 22:50:23,295 INFO: Fold lens [1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000]
2020-08-01 22:50:23,482 INFO: Total 9000 docs.
2020-08-01 22:50:23,482 INFO: Start training...
2020-08-01 22:50:38,480 INFO: 'pattern' package not found; tag filters are not available for English
2020-08-01 22:50:39,321 INFO: collecting all words and their counts
2020-08-01 22:50:39,321 INFO: PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-08-01 22:50:41,681 INFO: collected 5295 word types from a corpus of 8191447 raw words and 9000 sentences
2020-08-01 22:50:41,681 INFO: Loading a fresh vocabulary
2020-08-01 22:50:41,915 INFO: effective_min_count=5 retains 4335 unique words (81% of original 5295, drops 960)
2020-08-01 22:50:41,915 INFO: effective_min_count=5 leaves 8189498 word corpus (99% of original 8191447, drops 1949)
2020-08-01 22:50:42,014 INFO: deleting the raw counts dictionary of 5295 items
2020-08-01 22:50:42,014 INFO: sample=0.001 downsamples 61 most-common words
2020-08-01 22:50:42,015 INFO: downsampling leaves estimated 7070438 word corpus (86.3% of prior 8189498)
2020-08-01 22:50:42,029 INFO: estimated required memory for 4335 words and 100 dimensions: 5635500 bytes
2020-08-01 22:50:42,030 INFO: resetting layer weights
2020-08-01 22:50:43,263 INFO: training model with 8 workers on 4335 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
2020-08-01 22:50:44,314 INFO: EPOCH 1 - PROGRESS: at 19.77% examples, 1368768 words/s, in_qsize 14, out_qsize 1
2020-08-01 22:50:45,317 INFO: EPOCH 1 - PROGRESS: at 41.43% examples, 1460910 words/s, in_qsize 14, out_qsize 1
2020-08-01 22:50:46,329 INFO: EPOCH 1 - PROGRESS: at 68.01% examples, 1590679 words/s, in_qsize 12, out_qsize 3
2020-08-01 22:50:47,330 INFO: EPOCH 1 - PROGRESS: at 95.84% examples, 1677691 words/s, in_qsize 15, out_qsize 0
2020-08-01 22:50:47,456 INFO: worker thread finished; awaiting finish of 7 more threads
2020-08-01 22:50:47,458 INFO: worker thread finished; awaiting finish of 6 more threads
2020-08-01 22:50:47,458 INFO: worker thread finished; awaiting finish of 5 more threads
2020-08-01 22:50:47,459 INFO: worker thread finished; awaiting finish of 4 more threads
2020-08-01 22:50:47,465 INFO: worker thread finished; awaiting finish of 3 more threads
2020-08-01 22:50:47,466 INFO: worker thread finished; awaiting finish of 2 more threads
2020-08-01 22:50:47,475 INFO: worker thread finished; awaiting finish of 1 more threads
2020-08-01 22:50:47,479 INFO: worker thread finished; awaiting finish of 0 more threads
2020-08-01 22:50:47,479 INFO: EPOCH - 1 : training on 8191447 raw words (7022093 effective words) took 4.2s, 1685516 effective words/s
2020-08-01 22:50:48,507 INFO: EPOCH 2 - PROGRESS: at 23.98% examples, 1661406 words/s, in_qsize 15, out_qsize 0
2020-08-01 22:50:49,508 INFO: EPOCH 2 - PROGRESS: at 47.52% examples, 1649308 words/s, in_qsize 14, out_qsize 1
2020-08-01 22:50:50,509 INFO: EPOCH 2 - PROGRESS: at 73.96% examples, 1725580 words/s, in_qsize 14, out_qsize 1
2020-08-01 22:50:51,431 INFO: worker thread finished; awaiting finish of 7 more threads
2020-08-01 22:50:51,436 INFO: worker thread finished; awaiting finish of 6 more threads
2020-08-01 22:50:51,436 INFO: worker thread finished; awaiting finish of 5 more threads
2020-08-01 22:50:51,437 INFO: worker thread finished; awaiting finish of 4 more threads
2020-08-01 22:50:51,442 INFO: worker thread finished; awaiting finish of 3 more threads
2020-08-01 22:50:51,443 INFO: worker thread finished; awaiting finish of 2 more threads
2020-08-01 22:50:51,448 INFO: worker thread finished; awaiting finish of 1 more threads
2020-08-01 22:50:51,451 INFO: worker thread finished; awaiting finish of 0 more threads
2020-08-01 22:50:51,451 INFO: EPOCH - 2 : training on 8191447 raw words (7021549 effective words) took 4.0s, 1771778 effective words/s
2020-08-01 22:50:52,458 INFO: EPOCH 3 - PROGRESS: at 27.94% examples, 1955311 words/s, in_qsize 15, out_qsize 0
2020-08-01 22:50:53,465 INFO: EPOCH 3 - PROGRESS: at 57.50% examples, 1996619 words/s, in_qsize 13, out_qsize 2
2020-08-01 22:50:54,468 INFO: EPOCH 3 - PROGRESS: at 85.71% examples, 2001185 words/s, in_qsize 14, out_qsize 1
2020-08-01 22:50:54,963 INFO: worker thread finished; awaiting finish of 7 more threads
2020-08-01 22:50:54,964 INFO: worker thread finished; awaiting finish of 6 more threads
2020-08-01 22:50:54,965 INFO: worker thread finished; awaiting finish of 5 more threads
2020-08-01 22:50:54,965 INFO: worker thread finished; awaiting finish of 4 more threads
2020-08-01 22:50:54,970 INFO: worker thread finished; awaiting finish of 3 more threads
2020-08-01 22:50:54,972 INFO: worker thread finished; awaiting finish of 2 more threads
2020-08-01 22:50:54,978 INFO: worker thread finished; awaiting finish of 1 more threads
2020-08-01 22:50:54,979 INFO: worker thread finished; awaiting finish of 0 more threads
2020-08-01 22:50:54,979 INFO: EPOCH - 3 : training on 8191447 raw words (7020390 effective words) took 3.5s, 1991853 effective words/s
2020-08-01 22:50:55,986 INFO: EPOCH 4 - PROGRESS: at 28.64% examples, 2007478 words/s, in_qsize 14, out_qsize 1
2020-08-01 22:50:56,996 INFO: EPOCH 4 - PROGRESS: at 58.80% examples, 2038568 words/s, in_qsize 14, out_qsize 1
2020-08-01 22:50:58,003 INFO: EPOCH 4 - PROGRESS: at 87.96% examples, 2053981 words/s, in_qsize 15, out_qsize 0
2020-08-01 22:50:58,378 INFO: worker thread finished; awaiting finish of 7 more threads
2020-08-01 22:50:58,381 INFO: worker thread finished; awaiting finish of 6 more threads
2020-08-01 22:50:58,381 INFO: worker thread finished; awaiting finish of 5 more threads
2020-08-01 22:50:58,381 INFO: worker thread finished; awaiting finish of 4 more threads
2020-08-01 22:50:58,386 INFO: worker thread finished; awaiting finish of 3 more threads
2020-08-01 22:50:58,387 INFO: worker thread finished; awaiting finish of 2 more threads
2020-08-01 22:50:58,393 INFO: worker thread finished; awaiting finish of 1 more threads
2020-08-01 22:50:58,395 INFO: worker thread finished; awaiting finish of 0 more threads
2020-08-01 22:50:58,395 INFO: EPOCH - 4 : training on 8191447 raw words (7022406 effective words) took 3.4s, 2057703 effective words/s
2020-08-01 22:50:59,404 INFO: EPOCH 5 - PROGRESS: at 28.84% examples, 2021668 words/s, in_qsize 15, out_qsize 0
2020-08-01 22:51:00,411 INFO: EPOCH 5 - PROGRESS: at 58.46% examples, 2026828 words/s, in_qsize 14, out_qsize 1
2020-08-01 22:51:01,412 INFO: EPOCH 5 - PROGRESS: at 87.00% examples, 2031432 words/s, in_qsize 13, out_qsize 2
2020-08-01 22:51:01,823 INFO: worker thread finished; awaiting finish of 7 more threads
2020-08-01 22:51:01,825 INFO: worker thread finished; awaiting finish of 6 more threads
2020-08-01 22:51:01,826 INFO: worker thread finished; awaiting finish of 5 more threads
2020-08-01 22:51:01,826 INFO: worker thread finished; awaiting finish of 4 more threads
2020-08-01 22:51:01,831 INFO: worker thread finished; awaiting finish of 3 more threads
2020-08-01 22:51:01,833 INFO: worker thread finished; awaiting finish of 2 more threads
2020-08-01 22:51:01,837 INFO: worker thread finished; awaiting finish of 1 more threads
2020-08-01 22:51:01,839 INFO: worker thread finished; awaiting finish of 0 more threads
2020-08-01 22:51:01,839 INFO: EPOCH - 5 : training on 8191447 raw words (7020599 effective words) took 3.4s, 2039707 effective words/s
2020-08-01 22:51:01,840 INFO: training on a 40957235 raw words (35107037 effective words) took 18.6s, 1889859 effective words/s
2020-08-01 22:51:01,840 INFO: precomputing L2-norms of word weight vectors
2020-08-01 22:51:01,899 INFO: saving Word2Vec object under ./word2vec.bin, separately None
2020-08-01 22:51:01,899 INFO: not storing attribute vectors_norm
2020-08-01 22:51:01,900 INFO: not storing attribute cum_table
2020-08-01 22:51:02,088 INFO: saved ./word2vec.bin

©️2020 CSDN 皮肤主题: 大白 设计师:CSDN官方博客 返回首页