新闻信息检索(二)

最新推荐文章于 2019-08-27 17:26:57 发布

albert future

最新推荐文章于 2019-08-27 17:26:57 发布

阅读量2.7k

点赞数 2

分类专栏： Python学习笔记文章标签：新闻网易存储

本文链接：https://blog.csdn.net/berguiliu/article/details/50461301

版权

这个是我的github上的代码库，欢迎大家点星！

正式开始抓取一下网易新闻的内容

抓取网易的国内、国际、军事、政务网站信息
由于新闻信息有一定的时效性，因此在一定的程度上内容并不会改变很多，因此需要定时的抓取
且防止程序退出后，重新抓取同样的网页信息，因此我们在抓取的过程中将我们抓取的网站信息存储到pkl文件中
并且我们要尽可能的对网页新闻内容进行关键词提取

下面我们一步步来完善我们的程序

重新修改一下spyder文件

#!/usr/bin/env python
# coding=utf-8

from scrapy.selector import Selector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from News_Scrapy.items import NewsScrapyItem
from scrapy.conf import settings
import os,pickle,signal
import sys

## MySelf define the Global Variable
SAVED_URL = set()
if os.path.isfile(settings["SAVED_URL_PATH"]):
    with open(settings["SAVED_URL_PATH"],"rb") as handle:
        SAVED_URL = pickle.load(handle)

def save_url_pkl(sig,frame):
    with open(settings["SAVED_URL_PATH"],"wb") as handle:
        pickle.dump(SAVED_URL,handle)
    sys.exit(0)

signal.signal(signal.SIGINT,save_url_pkl)

class NetEaseSpider(CrawlSpider):
    name = "News_Scrapy"
    allowed_domains = ["news.163.com"]
    start_urls = ["http://news.163.com/domestic/","http://news.163.com/world/","http://news.163.com/shehui/","http://war.163.com/","http://gov.163.com/"]
    rules = [
        Rule(SgmlLinkExtractor(allow=(r'http://news.163.com/[0-9]{2}/[0-9]{3,4}/[0-9]{1,2}/[a-zA-Z0-9]+.html')),callback="parse_item"),
        Rule(SgmlLinkExtractor(allow=(r'http://war.163.com/[0-9]{2}/[0-9]{3,4}/[0-9]{1,2}/[a-zA-Z0-9]+.html')),callback="parse_item"),
        Rule(SgmlLinkExtractor(allow=(r'http://gov.163.com/[0-9]{2}/[0-9]{3,4}/[0-9]{1,2}/[a-zA-Z0-9]+.html')),callback="parse_item"),
    ]

    def parse_item(self,response):
        if response.url not in SAVED_URL:
            SAVED_URL.add(response.url)
            sel_resp = Selector(response)
            news_item = NewsScrapyItem()
            news_item["news_title"] = sel_resp.xpath('//*[@id="h1title"]/text()').extract()
            news_item["news_date"] = sel_resp.xpath('//*[@id="epContentLeft"]/div[1]/div[1]/text()').re(r'[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}')
            news_item["news_source"] = sel_resp.xpath('//*[@id="ne_article_source"]/text()').extract()
            news_item["news_content"] = sel_resp.xpath('//*[@id="endText"]').extract()
            return news_item

进行新闻内容中关键字提取出来

我们使用中文分析的一个开源库结巴分词
如果直接使用pip或者easy_install不能直接安装，我们手动安装
- 先找到我们本地的python的外部库的位置

fighter@pc:~$ ipython #使用ipython
In [1]: import site; site.getsitepackages()
Out[1]: ['/usr/local/lib/python2.7/dist-packages', '/usr/lib/python2.7/dist-packages']

我们进入local路径下面的dist-package目录

fighter@pc:/usr/local/lib/python2.7/dist-packages$ sudo git clone https://github.com/fxsjy/jieba.git

或者你可以在任意的路径下下载jieba进行安装

fighter@pc:~/Downloads$ git clone https://github.com/fxsjy/jieba.git
Cloning into 'jieba'...
remote: Counting objects: 2287, done.
remote: Total 2287 (delta 0), reused 0 (delta