基于vue前端框架/scrapy爬虫框架/结巴分词实现的小型搜索引擎

最新推荐文章于 2023-10-10 10:18:42 发布

置顶

xujingguo58

最新推荐文章于 2023-10-10 10:18:42 发布

阅读量3.3k

点赞数 1

分类专栏：搜索引擎文章标签：搜索引擎前端框架爬虫 php 结巴分词

本文链接：https://blog.csdn.net/xu90868075/article/details/70332242

版权

小型搜索引擎(tinySearchEngine)

基于scrapy爬虫框架，结巴分词，php和vue.js实现的小型搜索引擎。

a tiny search engine based on vue.js and use scrapy,jieba,php to accomplish it

Build Setup

# install dependencies
npm install

# serve with hot reload at localhost:8080
npm run dev

# build for production with minification
npm run build

# build for production and view the bundle analyzer report
npm run build --report

整体实现

大体流程如下：

１．爬虫爬取网页数据，保存在文件中，

２．python读取文件内容，存到数据库表中，使用结巴分词对网页内容进行分词，并获得TF-IDF值，构建倒排索引保存到数据库中。

３．前端界面接受用户输入，使用POST请求将数据发送到后端。

４．后端接受到数据进行分词，然后在倒排索引数据库查询，结果取并集，然后根据倒排索引数据库结果在结果数据库中查询，返回网页的具体信息。

５．前端收到返回后，将结果呈现出来。

具体实现

１．爬虫

爬虫采用的是python的爬虫库scrapy，只需要进行简单的配置就可以使用，如果要递归爬取，可以采用class DmozSpider(CrawlSpider)。

要获得的数据网页数据主要有：url,title,description,keywords，具体配置如下：

item['title'] = response.selector.xpath('//title/text()').extract()
item['keywords'] = response.selector.xpath('//meta[@name="keywords"]/@content').extract()
item['description'] = response.selector.xpath('//meta[@name="description"]/@content').extract()

同时，为了保存数据，需要定义items，在items.py中添加如下：

url = scrapy.Field()
title = scrapy.Field()
keywords = scrapy.Field()
description = scrapy.Field()

在终端中运行scrapy crawl dmoz -o items.json -t json，可以把数据存到items.json中。

２．分词

分词我选用的是python环境下的结巴分词,　在考虑了好几种分词后，最后选择了结巴分词，主要是安装简单(可以直接通过pip安装)，使用方便，并且在社区的贡献下，衍生出了不同语言版本(在后端中，我采用的是结巴分词的php版本)。

结巴分词直接提供了基于TF-IDF算法的关键词提取功能：

jieba.analyse.extract_tags(sentence, topK=20, withWeight=False, allowPOS=())

s

最低0.47元/天解锁文章

xujingguo58

关注

1
点赞
踩
10

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录