step1:Extract article about "it",tatol 508。
step2:The 508 articles are pre processed andeach article is taken asalineof data and indexed(to be astring),thengetthe tfidf_input.data。
# python convert.py input_tfidf_dir > tfidf_input.data
step3:Makes the statistics for how many words are there inthe map_idf phase(具体做法是对于每篇文章,出现某个单词(每个单词在本篇文章只记录一次,意义就是本篇文章包含了这个单词),就追加一个1,即[word, "1"]。对于该word打印了多少条[word, "1"],就代表有多少篇文章包含了该单词)。
step4:Count the idf valuefor every word(即每个单词被多少文章包含)。
step5:Local test,cat tfidf_input.data | python map_idf.py | sort -k1 | python red_idf.py > idf_out.tmp
step6:Test onCluster。ThentheIDFisover。
TF action
step7:Local test
cat tfidf_input.data | python map_tfidf.py | python red_tfidf.py reducer_func idf_mr_out_tmp
step8:Test on Cluster。Then the TF-IDF is over。
Code
step1(first_convert.py )
#!/usr/bin/python# -*- coding:utf-8 -*-import os
import sys
import gzip
test_dir = sys.argv[1] #获取目录input_tfidf_dirdefget_file_handler(f):
file_in = open(f, 'r') #读取(打开)目录input_tfidf_dirreturn file_in
index = 0#为每篇文章设置索引for fd in os.listdir(test_dir):
txt_list = [] #声明一个数组,数组里每个元素就是一篇文章
file_fd = get_file_handler(test_dir + '/' + fd) #file_fd取出input_tfidf_dir里的文章for line in file_fd: #对于file_id里的每条文章作为一个元素加入到txt_list数组里
txt_list.append(line.strip())
print'\t'.join([str(index), ' '.join(txt_list)]) #为file_id里的每条文章加上一个索引
index += 1
TF-IDF action on MapReduceIDF actionstep1:Extract article about "it",tatol 508。step2:The 508 articles are pre processed and each article is taken as a line of data and indexed(to be a string),then get