python新闻爬虫_基于Python的网络新闻爬虫与检索

最新推荐文章于 2024-06-11 23:05:55 发布

weixin_39623355

最新推荐文章于 2024-06-11 23:05:55 发布

阅读量547

点赞数

文章标签： python新闻爬虫

龙源期刊网

http://www.qikan.com.cn

基于

Python

的网络新闻爬虫与检索

作者：陈欢

黄勃

刘文竹

来源：《软件导刊》

2019

年第

05

期

摘

要：网络上存在众多新闻门户网站，新闻信息繁多，造成严重的新闻信息过载。针对

该类问题，设计一个基于

Python

的网络新闻信息搜集与检索系统。该系统通过使用

Scrapy

网

络爬虫框架进行网络新闻信息搜集，同时对新闻链接、标题进行去重，最后使用

Slor

检索服务

对爬虫获得的新闻数据进行全文检索。与传统方法相比，该系统设计的去重方法在保证链接不

重复的情况下，对标题进行去重，并引入

Solr

检索服务，可以帮助读者更快速地找到想要阅读

的新闻。

关键词：爬虫；信息检索；

Scrapy

；

Solr

；数据去重

DOI

：

10. 11907/rjdk. 191232

中图分类号：

TP393

文献标识码：

A

文章编号：

1672-7800

（

2019

）

005-0168-04

Abstract

：

There are many news portals on the Internet

，

and there are many news information

which causes serious news information overload. Aiming at this kind of problem

，

this paper designs

a Python-based network news information collection and retrieval system. The system uses the Scrapy

web crawler framework to collect online news information

，

and at the same time de-weights the

news links and titles

，

and finally uses the Slor search service. The full-text search was carried out on

the news data obtained by the crawler. Compared with the traditional method

，

the de-duplication

method of the system deduplicates the title without link repeat

，

and introduces the Solr search

service

，

which can help readers quickly locate the news that they want to read.

Key Words

：

web crawler

；

information retrieval

；

Scrapy

；

Solr

；

data deduplication

0

引言

新闻作为社会事件的记录，是反映日常生活的常见文体之一，具有十分重要的意义。在互

联网时代，网络新闻具有传播速度快、内容多面化、来源多渠道等特点，但在给公众快速提供

新闻信息的同时，也因信息量过大导致信息过载，使公众反而无法全面了解新闻事件真相。如

何从海量新闻数据中获取高质量新闻信息，帮助新闻用户快速获得自己感兴趣的网络新闻是本

文要解决的问题。

网络爬虫又称网络蜘蛛、网络机器人，指按照一定规则自动从网络上抓取信息的一段程序

或者脚本。使用爬虫技术能够获取海量网络数据并进行结构化存储

[1-2]

。文献

[3]

使用主题爬

虫的方法对新闻网进行抓取，与本文使用方法不同的是主题爬虫通过网页和主题相关度计算判

断网页价值；文献

[4]

根据网络爬虫原理和爬虫算法，对数据存储等基本信息进行全面、细致

weixin_39623355

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
python新闻爬虫_基于Python的网络新闻爬虫与检索

龙源期刊网http://www.qikan.com.cn基于Python的网络新闻爬虫与检索作者：陈欢黄勃刘文竹来源：《软件导刊》2019年第05期摘要：网络上存在众多新闻门户网站，新闻信息繁多，造成严重的新闻信息过载。针对该类问题，设计一个基于Python的网络新闻信息搜集与检索系统。该系统通过使用Scrapy网络爬虫框架进行网络新闻信息搜集，同时对新闻链接、标题进行去重，最后使用Slor检索服...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。