正式开始抓取一下网易新闻的内容
- 抓取网易的国内、国际、军事、政务网站信息
- 由于新闻信息有一定的时效性,因此在一定的程度上内容并不会改变很多,因此需要定时的抓取
- 且防止程序退出后,重新抓取同样的网页信息,因此我们在抓取的过程中将我们抓取的网站信息存储到pkl文件中
- 并且我们要尽可能的对网页新闻内容进行关键词提取
下面我们一步步来完善我们的程序
重新修改一下spyder文件
#!/usr/bin/env python
# coding=utf-8
from scrapy.selector import Selector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from News_Scrapy.items import NewsScrapyItem
from scrapy.conf import settings
import os,pickle,signal
import sys
## MySelf define the Global Variable
SAVED_URL = set()
if os.path.isfile(settings["SAVED_URL_PATH"]):
with open(settings["SAVED_URL_PATH"],"rb") as handle:
SAVED_URL = pickle.load(handle)
def save_url_pkl(sig,frame):
with open(settings["SAVED_URL_PATH"],"wb") as handle:
pickle.dump(SAVED_URL,handle)
sys.exit(0)
signal.signal(signal.SIGINT,save_url_pkl)
class NetEaseSpider(CrawlSpider):
name = "News_Scrapy"
allowed_domains = ["news.163.com"]
start_urls = ["http://news.163.com/domestic/","http://news.163.com/world/","http://news.163.com/shehui/","http://war.163.com/","http://gov.163.com/"]
rules = [
Rule(SgmlLinkExtractor(allow=(r'http://news.163.com/[0-9]{2}/[0-9]{3,4}/[0-9]{1,2}/[a-zA-Z0-9]+.html')),callback="parse_item"),
Rule(SgmlLinkExtractor(allow=(r'http://war.163.com/[0-9]{2}/[0-9]{3,4}/[0-9]{1,2}/[a-zA-Z0-9]+.html')),callback="parse_item"),
Rule(SgmlLinkExtractor(allow=(r'http://gov.163.com/[0-9]{2}/[0-9]{3,4}/[0-9]{1,2}/[a-zA-Z0-9]+.html')),callback="parse_item"),
]
def parse_item(self,response):
if response.url not in SAVED_URL:
SAVED_URL.add(response.url)
sel_resp = Selector(response)
news_item = NewsScrapyItem()
news_item["news_title"] = sel_resp.xpath('//*[@id="h1title"]/text()').extract()
news_item["news_date"] = sel_resp.xpath('//*[@id="epContentLeft"]/div[1]/div[1]/text()').re(r'[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}')
news_item["news_source"] = sel_resp.xpath('//*[@id="ne_article_source"]/text()').extract()
news_item["news_content"] = sel_resp.xpath('//*[@id="endText"]').extract()
return news_item
进行新闻内容中关键字提取出来
- 我们使用中文分析的一个开源库结巴分词
如果直接使用pip或者easy_install不能直接安装,我们手动安装
- 先找到我们本地的python的外部库的位置
fighter@pc:~$ ipython #使用ipython
In [1]: import site; site.getsitepackages()
Out[1]: ['/usr/local/lib/python2.7/dist-packages', '/usr/lib/python2.7/dist-packages']
- 我们进入local路径下面的dist-package目录
fighter@pc:/usr/local/lib/python2.7/dist-packages$ sudo git clone https://github.com/fxsjy/jieba.git
- 或者你可以在任意的路径下下载jieba进行安装
fighter@pc:~/Downloads$ git clone https://github.com/fxsjy/jieba.git
Cloning into 'jieba'...
remote: Counting objects: 2287, done.
remote: Total 2287 (delta 0), reused 0 (delta