【大数据】带词频属性的文档倒排索引算法实现

一、倒排索引介绍

倒排索引(Inverted Index)被用来存储在全文搜索下某个单词在一个文档或者一组文档中的存储位置的映射,是目前几乎所有支持全文索引的搜索引擎都需要依赖的一个数据结构。基于索引结构,给出一个词(term),能取得含有这个term的文档列表(the list of documents)。

示例:

Map:map输出的value除了文件名,还给出了该词所在行的偏移值。输出格式:filename#offset

import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class InvertedIndexMapper extends Mapper<Object, Text, Text, Text> 
{	@Override
	protected void map(Object key, Text value, Context context)  
			throws IOException, InterruptedException 
	// default RecordReader: LineRecordReader; key: line offset; value: line string
	{	FileSplit fileSplit = (FileSplit)context.getInputSplit();
		String fileName = fileSplit.getPath().getName();
		Text word = new Text();
		Text fileName_lineOffset = new Text(fileName+”#”+key.toString());
		StringTokenizer itr = new StringTokenizer(value.toString());
		for(; itr.hasMoreTokens(); ) 
		{      word.set(itr.nextToken());
		        context.write(word, fileName_lineOffset);
		}
	}
}

Reduce:

import java.io.IOException;
import java.util.Collections;
import java.util.Iterator;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class InvertedIndexReducer extends Reducer<Text, Text, Text, Text> {
	protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {	
        Iterator<Text> it = values.iterator();
		StringBuilder all = new StringBuilder();
		if(it.hasNext())  all.append(it.next().toString());
		for(; it.hasNext(); ) {
            all.append(“;");
			all.append(it.next().toString());
        }
		context.write(key, new Text(all.toString()));
	} //最终输出键值对示例:(“fish", “doc1#0; doc1#8;doc2#0;doc2#8 ")
}

Main:

public class InvertedIndexer
{
    public static void main(String[] args) 
    {
        try {
            Configuration conf = new Configuration();
		    job = new Job(conf, "invert index");
		    job.setJarByClass(InvertedIndexer.class);
		    job.setInputFormatClass(TextInputFormat.class);
		    job.setMapperClass(InvertedIndexMapper.class);
		    job.setReducerClass(InvertedIndexReducer.class);
		    job.setOutputKeyClass(Text.class);
		    job.setOutputValueClass(Text.class);
		    FileInputFormat.addInputPath(job, new Path(args[0]));
		    FileOutputFormat.setOutputPath(job, new Path(args[1]));
		    System.exit(job.waitForCompletion(true) ? 0 : 1);
	       
        } catch (Exception e) {
            e.printStackTrace();
        }
      }
}	

二、带词频属性的文档倒排索引算法

如果考虑单词在每个文档中出现的词频、位置、对应Web文档的URL等诸多属性,则前述简单的倒排算法就不足以有效工作。我们把这些词频、位置等诸多属性称为有效负载(Payload)。

基本的倒排索引结构:

  • 一个倒排索引由大量的postings list构成
  • 一个postings list由多个posting构成(doc id排序)
  • 一个postings list与一个term关联
  • 一个posting包含一个document id和一个payload
  • payload上载有termdocument中出现情况相关的信息(e.g. term frequency, positions, term properties)
  • 同时还有对应Web文档到其URL的映射doc_id->URL

示例:

三、停词表

来源:停词表_百度百科

人类语言包含很多功能词。与其他词相比,功能词没有什么实际含义。最普遍的功能词是限定词(“the”、“a”、“an”、“that”、和“those”),这些词帮助在文本中描述名词和表达概念,如地点或数量。介词如:“over”,“under”,“above” 等表示两个词的相对位置。

这些功能词的两个特征促使在搜索引擎的文本处理过程中对其特殊对待。第一,这些功能词极其普遍。记录这些词在每一个文档中的数量需要很大的磁盘空间。第二,由于它们的普遍性和功能,这些词很少单独表达文档相关程度的信息。如果在检索过程中考虑每一个词而不是短语,这些功能词基本没有什么帮助。

在信息检索中,这些功能词的另一个名称是:停用词(stopword)。称它们为停用词是因为在文本处理过程中如果遇到它们,则立即停止处理,将其扔掉。将这些词扔掉减少了索引量,增加了检索效率,并且通常都会提高检索的效果。停用词主要包括英文字符、数字、数学字符、标点符号及使用频率特高的单汉字等。

本实验使用的停词表stop_words_eng.txt在最后部分

五、实验要求

在本地 eclipse 上编写带词频属性的对英文文档的文档倒排索引程序,要求程序能够实现对 stop-words(如 a,an,the,in,of 等词)的去除,能够统计单词在每篇文档中出现的频率。

实验中只对数字和字母进行了统计,特殊字符忽略不计。

实验结果的输出类似如下格式:

tonight <MACBETH.txt,10>;<OTHELLO.txt,24>;<total,34>.
took <MACBETH.txt,2>;<OTHELLO.txt,4>;<total,6>.
tooth <MACBETH.txt,2>;<OTHELLO.txt,1>;<total,3>.

五、实现思路

1. Map的输入<key, value>对为<行起始位置(偏移量), 行的内容>,输出为<word#filename, value>,其中value表示出现次数;由于数据量较大,先对每个Map的输出进行一次Combine操作,将本地的value进行整合;此时由于key为word#filename,如果直接传给Reduce,可能会将同一个word的数据传到不同Reduce节点,故需要先进行Partitioner操作,基本思想是把组合键word#filename 临时拆开,“蒙骗”partitioner按照word而不是word#filename进行分区选择正确的Reducer,这样可保证同一个word下的一组键值对一定被分区到同一个Reducer。

2. 由于需要用到停词表,在Map之前进行setup()操作将停词表中内容存在vector中;setup()方法仅被MapReduce框架执行一次,在执行Map任务前,进行相关变量或者资源的集中初始化工作。

3. 在Reduce()中,设置5个全局变量:lastword记录上一个word,lastfile记录上一个文档名,count记录当前文档中的单词计数值,totalcount记录单词总计数值,待输出的字符串str;将key按照“#”分割为当前的word和filename。

  1) lastword和lastfile初始化为null,count和totalcount初始化为0,str初始化为””

  2) 若lastword和lastfile为空,则将当前的word和filename送给它们

  3) 每次判断:若为当前word和lastword不同,则说明遇到了一个新的word,将上一个word 对应的str更新后输出,并重置相关变量,return;若当前filename和lastfile不同,说明遇到了新文档,更新str,重置相关变量,return;其他情况只需要统计更新count和totalcount即可。

4. 由于Reduce最后一次未输出<key, value>对,需要进行cleanup()操作,cleanup()方法仅被MapReduce框架执行一次,在执行完毕Map任务后,进行相关变量或资源的释放工作。

关于setup()和cleanup()可以查看:mapreduce中的setup()与cleanup()的使用

注:代码中对输入文件进行了保留数字和字母的操作,即不考虑特殊字符,只分析数字和字母等。

 

六、代码

package test2;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.StringTokenizer;
import java.util.Vector;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.partition.HashPartitioner;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;

public class InvertedIndex {
	
	public static class Map extends Mapper<Object, Text, Text, IntWritable> {
		/**
		 * setup():读取停词表到vector stop_words中
		 */
        Vector<String> stop_words;//停词表
        protected void setup(Context context) throws IOException {
        	stop_words = new Vector<String>();//初始化停词表
        	Configuration conf = context.getConfiguration();
        	//读取停词表文件
        	BufferedReader reader = new BufferedReader(new InputStreamReader(FileSystem.get(conf).open(new Path("hdfs://localhost:9000/input/ex2/stop_words_eng.txt"))));
        	String line;
            while ((line = reader.readLine()) != null) {//按行处理
            	StringTokenizer itr=new StringTokenizer(line);
        		while(itr.hasMoreTokens()){//遍历词
        			stop_words.add(itr.nextToken());//存入vector
        		}
            }
            reader.close();
        }
        
        /**
         * map():
         * 对输入的Text切分为多个word
         * 输入:key:当前行偏移位置, value:当前行内容
         * 输出:key:word#filename, value:1
         */
        protected void map(Object key, Text value, Context context) throws IOException, InterruptedException {
        	FileSplit fileSplit = (FileSplit) context.getInputSplit();
            String fileName = fileSplit.getPath().getName();//.toLowerCase();//获取文件名,转换为小写
            String line = value.toString().toLowerCase();//将行内容全部转为小写字母
            //只保留数字和字母
            String new_line="";
            for(int i = 0; i < line.length(); i ++) {
            	if((line.charAt(i)>=48 && line.charAt(i)<=57) || (line.charAt(i)>=97 && line.charAt(i)<=122)) {
            		new_line += line.charAt(i);
            	} else {
            		new_line +=" ";//其他字符保存为空格
            	}
            }
            //line = new_line; 
            line = new_line.trim();//去掉开头和结尾的空格
            StringTokenizer strToken=new StringTokenizer(line);//按照空格拆分
            while(strToken.hasMoreTokens()){
            	String str = strToken.nextToken();
            	if(!stop_words.contains(str)) {//不是停词则输出key-value对
            		context.write(new Text(str+"#"+fileName), new IntWritable(1));
            	}
            }
        }
	}

    public static class Combine extends Reducer<Text, IntWritable, Text, IntWritable> {
        /**
         * 将Map输出的中间结果相同key部分的value累加,减少向Reduce节点传输的数据量
         * 输出:key:word#filename, value:累加和
         */
        protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val : values) {
                sum ++;
            }
            context.write(key, new IntWritable(sum));
        }
    }

    public static class Partition extends HashPartitioner<Text, IntWritable> {
    //基于哈希值的分片方法
        /**
         * 为了将同一个word的键值对发送到同一个Reduce节点,对key进行临时处理
         * 将原key的(word, filename)临时拆开,使Partitioner只按照word值进行选择Reduce节点
         */
        public int getPartition(Text key, IntWritable value, int numReduceTasks) {
        //第三个参数numPartitions表示每个Mapper的分片数,也就是Reducer的个数
            String term = key.toString().split("#")[0];//获取word#filename中的word
            return super.getPartition(new Text(term), value, numReduceTasks);//按照word分配reduce节点
            /*调用HashPartitioner中的分片方法
            public int getPartition(K2 key, V2 value, int numReduceTasks) {
    			return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
			  }*/           
        }
    }

    public static class Reduce extends Reducer<Text, IntWritable, Text, Text> {
        private String lastfile = null;//存储上一个filename
        private String lastword = null;//存储上一个word
        private String str = "";//存储要输出的value内容
        private int count = 0;
        private int totalcount = 0;
        //private StringBuilder out = new StringBuilder();//临时存储输出的value部分
        /**
         * 利用每个Reducer接收到的键值对中,word是排好序的
         * 将word#filename拆分开,将filename与累加和拼到一起,存在str中
         * 每次比较当前的word和上一次的word是否相同,若相同则将filename和累加和附加到str中
         * 否则输出:key:word,value:str
         * 并将新的word作为key继续
         * 输入:key:word#filename, value:[NUM,NUM,...]
         * 输出:key:word, value:filename:NUM;filename:NUM;...
         */
        protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
            String[] tokens = key.toString().split("#");//将word和filename存在tokens数组中
            if(lastword == null) {
            	lastword = tokens[0];
            }
            if(lastfile == null) {
            	lastfile = tokens[1];
            }
            if (!tokens[0].equals(lastword)) {//此次word与上次不一样,则将上次的word进行处理并输出
                str += "<"+lastfile+","+count+">;<total,"+totalcount+">.";
                context.write(new Text(lastword), new Text(str));//value部分拼接后输出
                lastword = tokens[0];//更新word
                lastfile = tokens[1];//更新filename
                count = 0;
                str="";
                for (IntWritable val : values) {//累加相同word和filename中出现次数
                	count += val.get();//转为int
                }
                totalcount = count;
                return;
            }
            
            if(!tokens[1].equals(lastfile)) {//新的文档
            	str += "<"+lastfile+","+count+">;";
            	lastfile = tokens[1];//更新文档名
            	count = 0;//重设count值
            	for (IntWritable value : values){//计数
            		count += value.get();//转为int
                }
            	totalcount += count;
            	return;
            }
            
            //其他情况,只计算总数即可
            for (IntWritable val : values) {
            	count += val.get();
            	totalcount += val.get();
            }
        }

        /**
         * 上述reduce()只会在遇到新word时,处理并输出前一个word,故对于最后一个word还需要额外的处理
         * 重载cleanup(),处理最后一个word并输出
         */
        public void cleanup(Context context) throws IOException, InterruptedException {
            str += "<"+lastfile+","+count+">;<total,"+totalcount+">.";
            context.write(new Text(lastword), new Text(str));
            
            super.cleanup(context);
        }
    }
	
	public static void main(String args[]) throws Exception {
		Configuration conf = new Configuration();
		conf.set("fs.defaultFS", "hdfs://localhost:9000");
        //String[] otherArgs=new String[]{"/input/ex2/txt_input","/output/ex2"}; //直接设置输入参数 设置程序运行参数
        //String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
        if(args.length != 2) {
            System.err.println("Usage: Relation <in> <out>");
            System.exit(2);
        }
        
        Job job = Job.getInstance(conf, "InvertedIndex");//设置环境参数
        job.setJarByClass(InvertedIndex.class);//设置整个程序的类名
        job.setMapperClass(Map.class);//设置Mapper类
        job.setCombinerClass(Combine.class);//设置combiner类
        job.setPartitionerClass(Partition.class);//设置Partitioner类
        job.setReducerClass(Reduce.class);//设置reducer类
        job.setOutputKeyClass(Text.class);//设置Mapper输出key类型
        job.setOutputValueClass(IntWritable.class);//设置Mapper输出value类型
        FileInputFormat.addInputPath(job, new Path(args[0]));//输入文件目录
        FileOutputFormat.setOutputPath(job, new Path(args[1]));//输出文件目录
        System.exit(job.waitForCompletion(true) ? 0 : 1);//参数true表示检查并打印 Job 和 Task 的运行状况
	}
	
}

 

附:使用的停词表 

a
aaaaah	
aaahhh	
aaaoe	
aah	
about
above
ac
according
accordingly
across
actually
ad
adj
af
after
afterwards
again
against
al
albeit
all
almost
alone
along
already
als
also
although
always
am
among
amongst
amoungst
amount
an
and
another
any
anybody
anyhow
anyone
anything
anyway
anywhere
ap
apart
apparently
are
aren
arise
around
as
aside
at
au
auf
aus
aux
av
avec
away
b
back
be
became
because
become
becomes
becoming
been
before
beforehand
began
begin
beginning
begins
behind
bei
being
below
beside
besides
best
better
between
beyond
bill
billion
both
bottom
briefly
but
by
c
call
came
can
cannot
canst
cant
caption
captions
certain
certainly
cf
choose
chooses
choosing
chose
chosen
clear
clearly
co
come
comes
computer
con
contrariwise
cos
could
couldn
couldnt
cry
cu
d
da
dans
das
day
de
degli
dei
del
della
delle
dem
den
der
deren
des
describe
detail
di
did
didn
die
different
din
do
does
doesn
doing
don
done
dos
dost
double
down
du
dual
due
durch
during
e
each
ed
eg
eight
eighty
either
el
eleven
else
elsewhere
em
empty
en
end
ended
ending
ends
enough
es
especially
et
etc
even
ever
every
everybody
everyone
everything
everywhere
except
excepted
excepting
exception
excepts
exclude
excluded
excludes
excluding
exclusive
f
fact
facts
far
farther
farthest
few
ff
fifteen
fifty
fify
fill
finally
find
fire
first
five
foer
follow
followed
following
follows
for
former
formerly
forth
forty
forward
found
four
fra
frequently
from
front
fuer
full
further
furthermore
furthest
g
gave
general
generally
get
gets
getting
give
given
gives
giving
go
going
gone
good
got
great
greater
h
had
haedly
half
halves
hardly
has
hasn
hasnt
hast
hath
have
haven
having
he
hence
henceforth
her
here
hereabouts
hereafter
hereby
herein
hereto
hereupon
hers
herself
het
high
higher
highest
him
himself
hindmost
his
hither
how
however
howsoever
hundred
hundreds
i
ie
if
ihre
ii
im
immediately
important
in
inasmuch
inc
include
included
includes
including
indeed
indoors
inside
insomuch
instead
interest
into
inward
is
isn
it
its
itself
j
ja
journal
journals
just
k
kai
keep
keeping
kept
kg
kind
kinds
km
l
la
large
largely
larger
largest
las
last
later
latter
latterly
le
least
les
less
lest
let
like
likely
little
ll
long
longer
los
low
lower
lowest
ltd
m
made
mainly
make
makes
making
many
may
maybe
me
meantime
meanwhile
med
might
mill
million
mine
miss
mit
more
moreover
most
mostly
move
mr
mrs
ms
much
mug
must
my
myself
n
na
nach
name
namely
nas
near
nearly
necessarily
necessary
need
needed
needing
needs
neither
nel
nella
never
nevertheless
new
next
nine
ninety
no
nobody
none
nonetheless
noone
nope
nor
nos
not
note
noted
notes
nothing
noting
notwithstanding
now
nowadays
nowhere
o
obtain
obtained
obtaining
obtains
och
of
off
often
og
ohne
ok
old
om
on
once
onceone
one
only
onto
or
ot
other
others
otherwise
ou
ought
our
ours
ourselves
out
outside
over
overall
owing
own
p
par
para
part
particular
particularly
past
per
perhaps
please
plenty
plus
por
possible
possibly
pour
poured
pouring
pours
predominantly
previously
pro
probably
prompt
promptly
provide
provided
provides
providing
put
q
quite
r
rather
re
ready
really
recent
recently
regardless
relatively
respectively
reuters
round
s
said
same
sang
save
saw
say
second
see
seeing
seem
seemed
seeming
seems
seen
sees
seldom
self
selves
send
sending
sends
sent
serious
ses
seven
seventy
several
shall
shalt
she
short
should
shouldn
show
showed
showing
shown
shows
si
side
sideways
significant
similar
similarly
simple
simply
since
sincere
sing
single
six
sixty
sleep
sleeping
sleeps
slept
slew
slightly
small
smote
so
sobre
some
somebody
somehow
someone
something
sometime
sometimes
somewhat
somewhere
soon
spake
spat
speek
speeks
spit
spits
spitting
spoke
spoken
sprang
sprung
staves
still
stop
strongly
substantially
successfully
such
sui
sulla
sung
supposing
sur
system
t
take
taken
takes
taking
te
ten
tes
than
that
the
thee
their
theirs
them
themselves
then
thence
thenceforth
there
thereabout
thereabouts
thereafter
thereby
therefor
therefore
therein
thereof
thereon
thereto
thereupon
these
they
thick
thin
thing
things
third
thirty
this
those
thou
though
thousand
thousands
three
thrice
through
throughout
thru
thus
thy
thyself
til
till
time
times
tis
to
together
too
top
tot
tou
toward
towards
trillion
trillions
twelve
twenty
two
u
ueber
ugh
uit
un
unable
und
under
underneath
unless
unlike
unlikely
until
up
upon
upward
us
use
used
useful
usefully
user
users
uses
using
usually
v
van
various
ve
very
via
vom
von
voor
vs
w
want
was
wasn
way
ways
we
week
weeks
well
went
were
weren
what
whatever
whatsoever
when
whence
whenever
whensoever
where
whereabouts
whereafter
whereas
whereat
whereby
wherefore
wherefrom
wherein
whereinto
whereof
whereon
wheresoever
whereto
whereunto
whereupon
wherever
wherewith
whether
whew
which
whichever
whichsoever
while
whilst
whither
who
whoever
whole
whom
whomever
whomsoever
whose
whosoever
why
wide
widely
will
wilt
with
within
without
won
worse
worst
would
wouldn
wow
x
xauthor
xcal
xnote
xother
xsubj
y
ye
year
yes
yet
yipee
you
your
yours
yourself
yourselves
yu
z
za
ze
zu
zum

 

  • 8
    点赞
  • 34
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值