搜索引擎搜索功能（三）

青苔理想

于 2024-07-27 09:41:35 发布

阅读量456

点赞数 24

文章标签：开发语言 java maven servlet 青少年编程

本文链接：https://blog.csdn.net/a13931329858/article/details/140729433

版权

SearchEngine · 王宇璇/submit - 码云 - 开源中国 (gitee.com)https://gitee.com/yxuan-wang/submit/tree/master/SearchEngine

搜索功能：

对用户输入的词语，短语或句子进行分词，然后将分词后的结果进行查询，得到多个数组，将数组中的内容合并按照权重排序，然后将结果返回给前端进行展示。这是我们的一个主要的思路。

直接给大家说当我编写完成之后这个程序中的问题以及我的处理方法：

停用词

1.首先因为我们直接分词进行的查找，而最终我观察结果发现程序的查找结果存在一个bug就是只要是输入句子或者短语就会将所有的文件展示出来。仔细思考发现是分词的时候会将空格也算作一个term，这就导致之后所有的文档都会被匹配到，这显然是不合理的。而我在网络上寻找这种问题的解决方法，发现专门有一个专门的文档叫做停用词文档，这个文档总储存的内容就是生活中常用的词语但没有实际意义。因为这个文档比较小，我将内容复制下来，大家可以新建一个txt文档进行保存（ps：第一行是空格，可别给删了）

 
'd
'll
'm
're
's
't
've
ZT
ZZ
a
a's
able
about
above
abst
accordance
according
accordingly
across
act
actually
added
adj
adopted
affected
affecting
affects
after
afterwards
again
against
ah
ain't
all
allow
allows
almost
alone
along
already
also
although
always
am
among
amongst
an
and
announce
another
any
anybody
anyhow
anymore
anyone
anything
anyway
anyways
anywhere
apart
apparently
appear
appreciate
appropriate
approximately
are
area
areas
aren
aren't
arent
arise
around
as
aside
ask
asked
asking
asks
associated
at
auth
available
away
awfully
b
back
backed
backing
backs
be
became
because
become
becomes
becoming
been
before
beforehand
began
begin
beginning
beginnings
begins
behind
being
beings
believe
below
beside
besides
best
better
between
beyond
big
biol
both
brief
briefly
but
by
c
c'mon
c's
ca
came
can
can't
cannot
cant
case
cases
cause
causes
certain
certainly
changes
clear
clearly
co
com
come
comes
concerning
consequently
consider
considering
contain
containing
contains
corresponding
could
couldn't
couldnt
course
currently
d
date
definitely
describe
described
despite
did
didn't
differ
different
differently
discuss
do
does
doesn't
doing
don't
done
down
downed
downing
downs
downwards
due
during
e
each
early
ed
edu
effect
eg
eight
eighty
either
else
elsewhere
end
ended
ending
ends
enough
entirely
especially
et
et-al
etc
even
evenly
ever
every
everybody
everyone
everything
everywhere
ex
exactly
example
except
f
face
faces
fact
facts
far
felt
few
ff
fifth
find
finds
first
five
fix
followed
following
follows
for
former
formerly
forth
found
four
from
full
fully
further
furthered
furthering
furthermore
furthers
g
gave
general
generally
get
gets
getting
give
given
gives
giving
go
goes
going
gone
good
goods
got
gotten
great
greater
greatest
greetings
group
grouped
grouping
groups
h
had
hadn't
happens
hardly
has
hasn't
have
haven't
having
he
he's
hed
hello
help
hence
her
here
here's
hereafter
hereby
herein
heres
hereupon
hers
herself
hes
hi
hid
high
higher
highest
him
himself
his
hither
home
hopefully
how
howbeit
however
hundred
i
i'd
i'll
i'm
i've
id
ie
if
ignored
im
immediate
immediately
importance
important
in
inasmuch
inc
include
indeed
index
indicate
indicated
indicates
information
inner
insofar
instead
interest
interested
interesting
interests
into
invention
inward
is
isn't
it
it'd
it'll
it's
itd
its
itself
j
just
k
keep
keeps
kept
keys
kg
kind
km
knew
know
known
knows
l
large
largely
last
lately
later
latest
latter
latterly
least
less
lest
let
let's
lets
like
liked
likely
line
little
long
longer
longest
look
looking
looks
ltd
m
made
mainly
make
makes
making
man
many
may
maybe
me
mean
means
meantime
meanwhile
member
members
men
merely
mg
might
million
miss
ml
more
moreover
most
mostly
mr
mrs
much
mug
must
my
myself
n
n't
na
name
namely
nay
nd
near
nearly
necessarily
necessary
need
needed
needing
needs
neither
never
nevertheless
new
newer
newest
next
nine
ninety
no
nobody
non
none
nonetheless
noone
nor
normally
nos
not
noted
nothing
novel
now
nowhere
number
numbers
o
obtain
obtained
obviously
of
off
often
oh
ok
okay
old
older
oldest
omitted
on
once
one
ones
only
onto
open
opened
opening
opens
or
ord
order
ordered
ordering
orders
other
others
otherwise
ought
our
ours
ourselves
out
outside
over
overall
owing
own
p
page
pages
part
parted
particular
particularly
parting
parts
past
per
perhaps
place
placed
places
please
plus
point
pointed
pointing
points
poorly
possible
possibly
potentially
pp
predominantly
present
presented
presenting
presents
presumably
previously
primarily
probably
problem
problems
promptly
proud
provides
put
puts
q
que
quickly
quite
qv
r
ran
rather
rd
re
readily
really
reasonably
recent
recently
ref
refs
regarding
regardless
regards
related
relatively
research
respectively
resulted
resulting
results
right
room
rooms
run
s
said
same
saw
say
saying
says
sec
second
secondly
seconds
section
see
seeing
seem
seemed
seeming
seems
seen
sees
self
selves
sensible
sent
serious
seriously
seven
several
shall
she
she'll
shed
shes
should
shouldn't
show
showed
showing
shown
showns
shows
side
sides
significant
significantly
similar
similarly
since
six
slightly
small
smaller
smallest
so
some
somebody
somehow
someone
somethan
something
sometime
sometimes
somewhat
somewhere
soon
sorry
specifically
specified
specify
specifying
state
states
still
stop
strongly
sub
substantially
successfully
such
sufficiently
suggest
sup
sure
t
t's
take
taken
taking
tell
tends
th
than
thank
thanks
thanx
that
that'll
that's
that've
thats
the
their
theirs
them
themselves
then
thence
there
there'll
there's
there've
thereafter
thereby
thered
therefore
therein
thereof
therere
theres
thereto
thereupon
these
they
they'd
they'll
they're
they've
theyd
theyre
thing
things
think
thinks
third
this
thorough
thoroughly
those
thou
though
thoughh
thought
thoughts
thousand
three
throug
through
throughout
thru
thus
til
tip
to
today
together
too
took
toward
towards
tried
tries
truly
try
trying
ts
turn
turned
turning
turns
twice
two
u
un
under
unfortunately
unless
unlike
unlikely
until
unto
up
upon
ups
us
use
used
useful
usefully
usefulness
uses
using
usually
uucp
v
value
various
very
via
viz
vol
vols
vs
w
want
wanted
wanting
wants
was
wasn't
way
ways
we
we'd
we'll
we're
we've
wed
welcome
well
wells
went
were
weren't
what
what'll
what's
whatever
whats
when
whence
whenever
where
where's
whereafter
whereas
whereby
wherein
wheres
whereupon
wherever
whether
which
while
whim
whither
who
who'll
who's
whod
whoever
whole
whom
whomever
whos
whose
why
widely
will
willing
wish
with
within
without
won't
wonder
words
work
worked
working
works
world
would
wouldn't
www
x
y
year
years
yes
yet
you
you'd
you'll
you're
you've
youd
young
younger
youngest
your
youre
yours
yourself
yourselves
z
zero
zt
zz

文件重复：

举例：如果搜索array list 结果将如何展示？

一个文档中如果同时由array和list按照上述描述就会出现两次，但是很明显这样是不合理的，我们需要将这两个文件的权重相加再通过权重比较进行展示。

此时我选择用HashSet来保存停用词数据，这样就可以用contains来辨别一个词语是否为停用词。

private HashSet<String> stopWords = new HashSet<>();

public void loadStopWord() {
        try(BufferedReader bf = new BufferedReader(new FileReader(STOP_WORD_PATH))){
            while(true){
                String line = bf.readLine();
                if(line == null){
                    break;
                }
                stopWords.add(line);
            }
        }catch (IOException e){
            e.printStackTrace();
        }
    }

用此方法来加载停用词。

别忘记我们还有上一期保存的正排索引和倒排索引数据需要调用

所以这些前置条件是必不可少的（我的路径是云服务器中的路径，大家先改为自己电脑中停用词文档所在路径）

搜索框架：

依照权重排序结果：

 public List<Result> search(String query) {
        //针对分词结果对暂停词进行过滤
        List<Term> oldTerms = ToAnalysis.parse(query).getTerms();
        List<Term> terms = new ArrayList<>();
        for (Term term : oldTerms) {
            if(stopWords.contains(term.getName())) {
               continue;
            }
            terms.add(term);
            System.out.println(term.getName());
        }


        //1.[分词]针对query这个查询词进行分词
        //List<Term> terms = ToAnalysis.parse(query).getTerms();

        //将每一个分词结果分别进行查询，查询结果综合到一个数组当中去
        //List<Weight> allTermResult = new ArrayList<>();
        List<List<Weight>> termResults = new ArrayList<>();
        //2.[触发]针对分词结果查倒排
        for (Term term : terms) {
            String word = term.getName();
            //虽然倒排索引中有很多的词，但是这里的词一定是之前文档中存在的，如果不存在就说明文档中没有，就返回空
            List<Weight> invertedList = index.getInverted(word);
            if (invertedList == null) {
                //说明所有文档中都不存在
                continue;
            }
            termResults.add(invertedList);
            //allTermResult.addAll(invertedList);
        }

        //3.合并，针对二维数组实现多路并归（按DocId顺序对文档进行排序），合并DocId相同的文档，
        List<Weight> allTermResult = merger(termResults);


        //3.[排序]针对触发的结果按照权重排序
        allTermResult.sort(new Comparator<Weight>() {
            @Override
            public int compare(Weight o1, Weight o2) {
                //需要进行降序排序
                //升序：o1 - o2
                //降序：o2 - o1   后面如果忘记就随便写，看结果对不对。
                return o2.getWeight() - o1.getWeight();
            }
        });
        //4.[包装结果]针对排序后的结果去查正排返回想要的数据
        List<Result> results = new ArrayList<>();
        for (Weight weight : allTermResult) {
            DocInfo docInfo = index.getDocInfo(weight.getDocId());
            Result result = new Result();
            result.setTitle(docInfo.getTitle());
            result.setUrl(docInfo.getUrl());
            result.setDesc(GenDesc(docInfo.getContent(), terms));
            results.add(result);
        }
        return results;
    }

此过程就是进行分词，如果停用词中包含分词后的词语则不对此词语进行索引，将需要进行索引的词语保存到一个数组中并进行打印验证。

对每个词语在倒排索引中进行查找，每个词语对应一个由Weight构成的数组，Weight中包含DocId和Weight。此时若是直接对这个二维数组进行合并成一维数组对权重进行排序的话，无疑会导致同一文档出现两次，就出现上述文档重复的情况。

所以在这里合并文档我选择用数组的多路并归来进行处理：

static class Pos{
        public int row;
        public int col;
        public Pos(int row, int col) {
            this.row = row;
            this.col = col;
        }
    }

    private List<Weight> merger(List<List<Weight>> source) {
        //合并的时候把多个行合成一行，合并过程中需要操做其中的每一个元素，也就是需要通过行列来描述一个元素的位置
        //1.先针对每一行按照id进行升序排序
        for(List<Weight> curRow : source){
            curRow.sort(new Comparator<Weight>() {
                @Override
                public int compare(Weight o1, Weight o2) {
                    return o1.getDocId() - o2.getDocId();
                }
            });
        }
        //2.1借助一个优先队列针对这些行进行合并。
        //target 表示最终合并的结果
        List<Weight> target = new ArrayList<>();
        //创建一个优先级队列并指定比较规则
        PriorityQueue<Pos> queue= new PriorityQueue<>(new Comparator<Pos>() {
            @Override
            public int compare(Pos o1, Pos o2) {
                //刚才已经比较出每行最小的对象，按升序排序就在数组第一个
                //此时我们需要用Pos定位到每行的第一个元素互相进行比较，此时就可以得到最小的元素

                //然后最小元素入队列，此元素所在行指针向后移动一个元素进行比较，循环此过程，得到一个按DocId升序的队列。
                Weight w1 = source.get(o1.row).get(o1.col);
                Weight w2 = source.get(o2.row).get(o2.col);
                return w1.getDocId()-w2.getDocId();
            }
        });
        //2.2对数组的头元素放置于优先级队列当中进行比较，取出最小元素，整合到数组中
        for(int row = 0;row < source.size();row++){
            queue.offer(new Pos(row , 0));
        }
        //2.3循环取队首元素
        while (!queue.isEmpty()){
            Pos minPos = queue.poll();
            Weight curWeight = source.get(minPos.row).get(minPos.col);
            //2.4看看取到的Weight中的DocId和上一个是否相同
            //相同就合并
            if(!target.isEmpty()){
                Weight lastWeight = target.get(target.size() - 1);
                if(lastWeight.getDocId() == curWeight.getDocId()){
                    //合并Weight
                    lastWeight.setWeight(lastWeight.getWeight() + curWeight.getWeight());
                }else{
                    target.add(curWeight);
                }
            }else{
                target.add(curWeight);
            }
            //要把对应光标向后移动来取下一个元素
            Pos newPos = new Pos(minPos.row, minPos.col + 1);
            if(newPos.col >= source.get(newPos.row).size()){
                //说明光标指向已经超出这一行的末尾,这一行处理完毕，把剩下的行以及队列中剩余元素比较完就可以
                continue;
            }
                queue.offer(newPos);
        }
        return target;
    }

此代码就是按照上述思想进行的排序。

先设定一个内部类通过行列来表示二维数组中的元素

将每一行取出按升序排序，创建一个优先级队列，指定其排序规则是依照DocId升序。

然后依次取出二维数组第一列的元素进行比较，取出最小值，看其与上一个元素的DocId是否相同，相同则合并，不同就添加。之后向优先级队列中添加上一个处理过的元素的下一个，也就是（row，col+1）

allTermResult.sort(new Comparator<Weight>() {
            @Override
            public int compare(Weight o1, Weight o2) {
                //需要进行降序排序
                //升序：o1 - o2
                //降序：o2 - o1   后面如果忘记就随便写，看结果对不对。
                return o2.getWeight() - o1.getWeight();
            }
        });

依照权重排序展示结果，此部分代码完成。

展示标题，url，描述：

观察网页大多数由标题，url，描述三部分构成，所以我们需要新建一个类result，也是由这三部分组成，记得生成get和set方法。我们用result数组来保存我们解析出的每个搜索条目的结果，最后返回这个数组给前端就完成了。

 List<Result> results = new ArrayList<>();
        for (Weight weight : allTermResult) {
            DocInfo docInfo = index.getDocInfo(weight.getDocId());
            Result result = new Result();
            result.setTitle(docInfo.getTitle());
            result.setUrl(docInfo.getUrl());
            result.setDesc(GenDesc(docInfo.getContent(), terms));
            results.add(result);
        }
        return results;

这里标题和url的设置都极为简单，只需要通过Weight中的DocId查找到DocInfo就可以得到。

描述的生成就比较繁琐。

描述的生成：

那么很明显，我们需要在正文中得到这个搜索词的位置然后返回他的上下文。

我的第一个想法是既然要在文中搜索关键词，不就是前后都有空格的此词语吗

例如搜索list，此时结果中就会出现关于list的搜索结果，描述中也就是 list 单独作为一个词语出现，这就避免了搜索到ArrayList等词语。但是在后续的查看中发现一个问题，以" "+list+" "进行匹配的话，（list）诸如此类一旦含有标点符号的词语就匹配不到了，这样就导致描述无法生成。

看到这里，相信大家也会想到正则表达式，但是这里有个尴尬的情况就是indexOf是不支持正则表达式的匹配的。这我就让我苦苦思索，好像真的没什么办法。直到后来在网络上搜索看到一个解决办法，也是非常的简单，就是用replaceAll（支持正则表达式），将内容中的关键词左右的间隙用/b进行匹配，替换成空格，此时我们再用Index定位关键词所在位置即可。

关于/b就是除0-9，a-z，A-Z其余符号都会被匹配到。

之后我选择如过文中出现此搜索词就返回其前后160字符作为描述，否则则返回空字符串

private String GenDesc(String content, List<Term> terms) {
        int firstPos = -1;
        //1.先遍历分词结果，看看哪个结果在分词中存在
        for(Term term : terms){
            //分词直接针对词进行转小写
            String word = term.getName();
            //此时需要把content也转化为小写再进行查询。
            //左右带空格说明是一个独立的词语
            //此处全字匹配更加严谨的话需要使用正则表达式。（文章头尾不能识别）
            //" " + word + " "不是很严谨，如果左边右边有括号或者其他符号，那么此word也算一个单词，但是识别不出来
            //要解决上述问题还是需要正则表达式 , \b匹配单词与空格，单词与标点符号的单词边界,indexOf不支持正则匹配
            //replace/replaceAll都是支持正则，此时我们就可以用replaceAll将左右带有标点符号的单词转化为左右带空格的单词
            content.toLowerCase().replaceAll("\\b"+word+"\\b"," "+word+" ");
            firstPos = content.toLowerCase().indexOf(" " + word + " ");

            if(firstPos >= 0) {
                //找到位置
                break;
            }
        }
            if(firstPos == -1){
                //所有分词都不在正文中存在。概率极小,此时返回一个空字符串
                return "";
                //if(content.length() > 100){
                    //return content.subString(0,100) + "...";
                //}
                //return content;
                //
            }
            String desc = "";
            int descBeg = firstPos < 60 ? 0 : firstPos - 60;
            if(descBeg + 160 < content.length()){
                desc = content.substring(descBeg, descBeg + 160);
            }else{
                desc = content.substring(descBeg) + "...";
            }
            //在此处加上替换操作，把描述中和分词结果相同的部分加上<i>标签，可以通过replace方式进行替换
        for(Term term : terms){
            String word = term.getName();
            desc = desc.replaceAll("(?i) "+word+" ", "<i> "+word+" </i>");
        }
            return desc;
    }

    public void loadStopWord() {
        try(BufferedReader bf = new BufferedReader(new FileReader(STOP_WORD_PATH))){
            while(true){
                String line = bf.readLine();
                if(line == null){
                    break;
                }
                stopWords.add(line);
            }
        }catch (IOException e){
            e.printStackTrace();
        }
    }

验证：

此时我们的后端代码就已经完成了，大家可以新建一个main方法进行验证

public static void main(String[] args) {
        DocSearcher searcher = new DocSearcher();
        Scanner sc = new Scanner(System.in);
        while(true){
            System.out.println("请输入要查询的词语->");
            String query = sc.next();
            List<Result> results = searcher.search(query);
            for(Result result : results){
                System.out.println("===========================");
                System.out.println(result);
            }
        }
    }

若有bug大家可以去码云看看我的源代码进行比对。