【NLP】LDA2Vec笔记(中)

代码结构分析

  • 代码来源:GitHub

examples-hacker_news

执行顺序

一级目录

examples-hacker_news
  • 以examples-hacker_news(新闻)为例。据我观察,首先,应当运行data-preprocess.py(此代码同时包括用于下载数据的代码),进行数据预处理工作,处理完成后保存产物(如下图):
examples-hacker_news-data-preprocess.py

 

  • 随后,运行examples\hacker_news\lda2vec-lda2vec_run.py,正式跑模型。(该程序用到了预处理产物,生成lda2vec.hdf5)
examples\hacker_news\lda2vec-lda2vec_run.py

 

注意事项

  • 慎用"print"

调试代码时常加入大量print 监测每一步的输出。但是对于大批量数据处理,print 往往增加几十倍的耗时,严重影响效率。

  • 进度条
#参考:https://blog.csdn.net/The_Time_Runner/article/details/87735801
#通过sys.stdout.write()实现进度条(直接粘到代码ok)

import time,sys
for i in range(100):
    percent = i / 100
    sys.stdout.write("\r{0}{1}".format("|"*i , '%.2f%%' % (percent * 100)))
    sys.stdout.flush()
    time.sleep(1)

 

 

preprocess.py

# Author: Chris Moody <chrisemoody@gmail.com>
# License: MIT

# This example loads a large 800MB Hacker News comments dataset
# and preprocesses it. This can take a few hours, and a lot of
# memory, so please be patient!

from lda2vec import preprocess, Corpus
import numpy as np
import pandas as pd
import logging
import cPickle as pickle
import os.path

logging.basicConfig()

max_length = 250   # Limit of 250 words per comment
min_author_comments = 50  # Exclude authors with fewer comments
nrows = None  # Number of rows of file to read; None reads in full file

fn = "hacker_news_comments.csv"
url = "https://zenodo.org/record/45901/files/hacker_news_comments.csv"
if not os.path.exists(fn):
    import requests
    response = requests.get(url, stream=True, timeout=2400)
    with open(fn, 'w') as fh:
        # Iterate over 1MB chunks
        for data in response.iter_content(1024**2):
            fh.write(data)


features = []
# Convert to unicode (spaCy only works with unicode)
features = pd.read_csv(fn, encoding='utf8', nrows=nrows)
# Convert all integer arrays to int32
for col, dtype in zip(features.columns, features.dtypes):
    if dtype is np.dtype('int64'):
        features[col] = features[col].astype('int32')

# Tokenize the texts
# If this fails it's likely spacy. Install a recent spacy version.
# Only the most recent versions have tokenization of noun phrases
# I'm using SHA dfd1a1d3a24b4ef5904975268c1bbb13ae1a32ff
# Also try running python -m spacy.en.download all --force
texts = features.pop('comment_text').values
tokens, vocab = preprocess.tokenize(texts, max_length, n_threads=4,
                                    merge=True)
del texts

# Make a ranked list of rare vs frequent words
corpus = Corpus()
corpus.update_word_count(tokens)
corpus.finalize()

# The tokenization uses spaCy indices, and so may have gaps
# between indices for words that aren't present in our dataset.
# This builds a new compact index
compact = corpus.to_compact(tokens)
# Remove extremely rare words
pruned = corpus.filter_count(compact, min_count=10)
# Words tend to have power law frequency, so selectively
# downsample the most prevalent words
clean = corpus.subsample_frequent(pruned)
print "n_words", np.unique(clean).max()

# Extract numpy arrays over the fields we want covered by topics
# Convert to categorical variables
author_counts = features['comment_author'].value_counts()
to_remove = author_counts[author_counts < min_author_comments].index
mask = features['comment_author'].isin(to_remove).values
author_name = features['comment_author'].values.copy()
author_name[mask] = 'infrequent_author'
features['comment_author'] = author_name
authors = pd.Categorical(features['comment_author'])
author_id = authors.codes
author_name = authors.categories
story_id = pd.Categorical(features['story_id']).codes
# Chop timestamps into days
story_time = pd.to_datetime(features['story_time'], unit='s')
days_since = (story_time - story_time.min()) / pd.Timedelta('1 day')
time_id = days_since.astype('int32')
features['story_id_codes'] = story_id
features['author_id_codes'] = story_id
features['time_id_codes'] = time_id

print "n_authors", author_id.max()
print "n_stories", story_id.max()
print "n_times", time_id.max()

# Extract outcome supervised features
ranking = features['comment_ranking'].values
score = features['story_comment_count'].values

# Now flatten a 2D array of document per row and word position
# per column to a 1D array of words. This will also remove skips
# and OoV words
feature_arrs = (story_id, author_id, time_id, ranking, score)
flattened, features_flat = corpus.compact_to_flat(pruned, *feature_arrs)
# Flattened feature arrays
(story_id_f, author_id_f, time_id_f, ranking_f, score_f) = features_flat

# Save the data
pickle.dump(corpus, open('corpus', 'w'), protocol=2)
pickle.dump(vocab, open('vocab', 'w'), protocol=2)
features.to_pickle('features.pd')
data = dict(flattened=flattened, story_id=story_id_f, author_id=author_id_f,
            time_id=time_id_f, ranking=ranking_f, score=score_f,
            author_name=author_name, author_index=author_id)
np.savez('data', **data)
np.save(open('tokens', 'w'), tokens)

 

 

  • 3
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 2
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值