NLP_TFIDF

最新推荐文章于 2024-06-14 12:39:34 发布

Robin_just

最新推荐文章于 2024-06-14 12:39:34 发布

阅读量996

点赞数

分类专栏：大数据开发

本文链接：https://blog.csdn.net/shaguabufadai/article/details/74781784

版权

大数据开发专栏收录该内容

12 篇文章 0 订阅

订阅专栏

TF-IDF action on MapReduce

IDF action

step1:Extract article about "it",tatol 508。
step2:The 508 articles are pre processed and each article is taken as a line of data and indexed(to be a string),then get the tfidf_input.data。
# python convert.py input_tfidf_dir > tfidf_input.data
step3:Makes the statistics for how many words are there in the map_idf phase(具体做法是对于每篇文章，出现某个单词(每个单词在本篇文章只记录一次，意义就是本篇文章包含了这个单词)，就追加一个1，即[word, "1"]。对于该word打印了多少条[word, "1"]，就代表有多少篇文章包含了该单词)。
step4:Count the idf value for every word(即每个单词被多少文章包含)。
step5:Local test，cat tfidf_input.data | python map_idf.py | sort -k1 | python red_idf.py > idf_out.tmp
step6:Test on Cluster。Then the IDF is over。

TF action

step7:Local test
cat tfidf_input.data | python map_tfidf.py | python red_tfidf.py  reducer_func idf_mr_out_tmp
step8:Test on Cluster。Then the TF-IDF is over。

Code

step1(first_convert.py )

#!/usr/bin/python
# -*- coding:utf-8 -*-
import os
import sys
import gzip


test_dir = sys.argv[1]  #获取目录input_tfidf_dir

def get_file_handler(f):
    file_in = open(f, 'r')   #读取(打开)目录input_tfidf_dir
    return file_in

index = 0  #为每篇文章设置索引
for fd in os.listdir(test_dir):

    txt_list = []  #声明一个数组，数组里每个元素就是一篇文章

    file_fd = get_file_handler(test_dir + '/' + fd)  #file_fd取出input_tfidf_dir里的文章
    for line in file_fd:  #对于file_id里的每条文章作为一个元素加入到txt_list数组里
        txt_list.append(line.strip())


    print '\t'.join([str(index), ' '.join(txt_list)])  #为file_id里的每条文章加上一个索引

    index += 1

step2(shell command)

# python first_convert.py input_tfidf_dir > tfidf_input.data

step3(map_idf.py )

#!/usr/bin/python

import sys

for line in sys.stdin:
    ss = line.strip().split('\t', 1)
    doc_index = ss[0].strip()
    doc_context = ss[1].strip()

    word_list = doc_context.split(' ')

    word_set = set()
    for word in word_list:
        word_set.add(word)

    for word in word_set:
        print '\t'.join([word, "1"])

step4(red_idf.py)

#!/usr/bin/python

import sys
import math

current_word = None
count_pool = []
sum = 0

docs_cnt = 508

for line in sys.stdin:
    ss = line.strip().split('\t')
    if len(ss) != 2:
        continue

    word, val = ss

    if current_word == None:
        current_word = word

    if current_word != word:
        for count in count_pool:
            sum += count
        idf_score = math.log(float(docs_cnt) / (float(sum) + 1))
        print "%s\t%s" % (current_word, idf_score)

        current_word = word
        count_pool = []
        sum = 0

    count_pool.append(int(val))


for count in count_pool:
    sum += count
idf_score = math.log(float(docs_cnt) / (float(sum) + 1))
print "%s\t%s" % (current_word, idf_score)

step5(local test)

cat tfidf_input.data | python map_idf.py | sort -k1 | python red_idf.py > idf_out.tmp

step6(shell command)

HADOOP_CMD="/usr/local/src/hadoop-1.2.1/bin/hadoop"
STREAM_JAR_PATH="/usr/local/src/hadoop-1.2.1/contrib/streaming/hadoop-streaming-1.2.1.jar"

INPUT_FILE_PATH_1="/tfidf_input.data"
OUTPUT_PATH="/idf_output"
OUTPUT_PATH_ALL="/tfidf_output"

$HADOOP_CMD fs -rmr -skipTrash $OUTPUT_PATH
$HADOOP_CMD fs -rmr -skipTrash $OUTPUT_PATH_ALL
# Step 1.
$HADOOP_CMD jar $STREAM_JAR_PATH \
    -input $INPUT_FILE_PATH_1 \
    -output $OUTPUT_PATH \
    -mapper "python map_idf.py" \
    -reducer "python red_idf.py" \
    -file ./map_idf.py \
    -file ./red_idf.py

step7(local test)

cat tfidf_input.data | python map_tfidf.py | python red_tfidf.py  reducer_func idf_mr_out_tmp

step8(shell command)


HADOOP_CMD="/usr/local/src/hadoop-1.2.1/bin/hadoop"
STREAM_JAR_PATH="/usr/local/src/hadoop-1.2.1/contrib/streaming/hadoop-streaming-1.2.1.jar"

INPUT_FILE_PATH_1="/tfidf_input.data"
OUTPUT_PATH="/idf_output"
OUTPUT_PATH_ALL="/tfidf_output"

$HADOOP_CMD fs -rmr -skipTrash $OUTPUT_PATH
$HADOOP_CMD fs -rmr -skipTrash $OUTPUT_PATH_ALL
# Step 1.
$HADOOP_CMD jar $STREAM_JAR_PATH \
    -input $INPUT_FILE_PATH_1 \
    -output $OUTPUT_PATH \
    -mapper "python map_idf.py" \
    -reducer "python red_idf.py" \
    -file ./map_idf.py \
    -file ./red_idf.py

# Step 2.
$HADOOP_CMD jar $STREAM_JAR_PATH \
    -input $INPUT_FILE_PATH_1 \
    -output $OUTPUT_PATH_ALL \
    -mapper "python map_tfidf.py" \
    -reducer "python red_tfidf.py reducer_func idf_dict" \
    -cacheFile "hdfs://master:9000/idf_out.tmp#idf_dict" \
    -file ./map_tfidf.py \
    -file ./red_tfidf.py

step9(map_tfidf.py)

import sys

for line in sys.stdin:
    print line.strip()

step10(red_tfidf.py)

#!/usr/bin/python
# -*- coding: utf-8 -*-
import os
import sys
import gzip

def read_local_file_func(f):
    idf_map = {}
    file_in = open(f,'r')
    for line in file_in:
        ss = line.strip().split('\t')
        if len(ss)!=2:
            continue
        word = ss[0].strip()
        idf =  ss[1].strip()
        idf_map[word] = float(idf)
    return idf_map

def reducer_func(idf_mr_out_tep):
    idf_map = read_local_file_func(idf_mr_out_tep)
    for line in sys.stdin:
        ss = line.strip().split('\t')
        if len(ss)!=2:
            continue

        docid = ss[0].strip()
        context = ss[1].strip()
        tf_map = {}
        for t_word in context.strip().split(' '):
            if t_word not in tf_map:  #这儿的if判断挺精巧的，该篇文章里每新出现的单词，在词典tf_map中都初始化为0；如果不是新出现的，(在if语句块后面)累计其出现次数
                tf_map[t_word] = 0
            tf_map[t_word] += 1  #累计其出现次数

        tfidf_map = {}
        for w, tf in tf_map.items():
            if w not in idf_map:
                continue
            idf = idf_map[w]
            tfidf_score = tf * idf
            tfidf_map[w] = tfidf_score

        tmp_list = []
        for key, val in tfidf_map.items():
            tmp_list.append((key, val))
        final_list = sorted(tmp_list, key=lambda x : x[1], reverse=True)[:5]

        word_score_list = []
        for t in final_list:
            word_score_list.append(':'.join([t[0], str(t[1])]))

        print docid + '\t' + ','.join(word_score_list)


if __name__ == "__main__":
    module = sys.modules[__name__]
    print module
    func = getattr(module,sys.argv[1])
    args = None
    print len(sys.argv)
    if len(sys.argv) > 1:
        args = sys.argv[2:]
        print args
    func(*args)

Process Screenshot

这里写图片描述

note


hadoop fs -text /lcs_output/part* > result.txt

wc -l result.txt

wc -l lcs_input.data

split('\t',1) 分割字符串，只分割出第一个

grep --color -nrw 'xxx' .  | wc -l
grep --color -nrcw 'xxx' .  | wc -l

cp -raf allfiles/*it* input_tfidf_dir/

Robin_just

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
NLP_TFIDF

TF-IDF action on MapReduceIDF actionstep1:Extract article about "it",tatol 508。step2:The 508 articles are pre processed and each article is taken as a line of data and indexed(to be a string),then get
复制链接

扫一扫

专栏目录