利用Scrapy爬取网易新闻
本次利用Scrapy爬取网易新闻当天的新闻标题,内容,来源等信息并存储到csv文件中,具体操作如下。
爬取
- 在items.py中提前设置好相关的爬取内容函数:
import scrapy
class NewsItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
news_thread=scrapy.Field()
news_title=scrapy.Field()
news_url=scrapy.Field()
news_time=scrapy.Field()
news_source=scrapy.Field()
source_url=scrapy.Field()
news_body=scrapy.Field()
pass
- 在spider目录下找到所创建的爬取py文件,写入函数。若出现无法爬取的情况,可以参考前面文章,有相关解决办法。
# -*- coding: utf-8 -*-
import scrapy
from news.items import NewsItem
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider,Rule
#"https://news.163.com/20/0412/12/FA107S71000189FH.html
#"https://news.163.com/20/0412/09/FA0KR27300019K82.html
#/20/0412/\b+/.*?html#正则表达
#https://news.163.com/20/0503/15/FBNBQHAR000189FH.html
class News2019Spider(CrawlSpider):
name = 'news2019'
allowed_domains = ['news.163.com']
start_urls = ['http://news.163.com/']
rules=(
Ru