1.认识倒排索引
有两段话:
doc1: I really liked my small dogs ,and I think my mom also liked them.
doc2:He never liked any dogs,so I hope that my mom will not expect me to liked him.
word doc1 doc2
I * *
really *
liked * *
my * *
small *
and *
think *
mom * *
also *
them *
He *
never *
any *
so *
hope *
that *
will *
not *
except *
me *
him *
上述过程,就是倒排索引建立的一个过程。
搜索:mother like little dog 不可能有任何结果,但是这个绝对不是我们想要的结果。
mother和mom:同义词
like和liked:时态
little和small:同义词
dog和dogs:单复数
其实并没有很大的区别
2.normalization,建立倒排索引的时候,会执行一个操作,也就是说对拆分出的各个单词进行相应的处理,以提升后面搜索的时候能够搜索到相关联的文档的概率。
就是进行了同义词的转换,时态词的转换,单复数的转化等。
mom->mother
liked->like
small->little
dogs->dog
重新建立倒排索引,加入normalization,再次使用mother like little dog搜索,就可以查找到了。
建立倒排索引以及normalization 的过程
word doc1 doc2
I * *
really *
liked * * liked ->like
my * *
small * small->little
dogs * * dogs->dog
and *
think *
mom * * mom->mother
also *
them *
He *
never *
any *
so *
hope *
that *
will *
not *
except *
me *
him *
这样,就可以艘搜索到doc1,doc2