Python+Scrapy爬取36氪网新闻
一.准备工作:
①安装python3
②安装scrapy
③安装docker,用来运行splash,splash是用来提供js渲染服务(python中还需利用pip安装scrapy-splash 库)
注意:win10需要安装docker客户端(https://www.docker.com/),Linux的安装就比较简单了(自行百度)。安装上docker后,拉取镜像(docker pull scrapinghub/splash),运行镜像(docker run -p 8050:8050 scrapinghub/splash)。
二.36氪网站分析(https://36kr.com/)
①网站首页如下图,36氪网站首页展示给我们的是文章的列表,观察一下我们会发现36氪的文章都是按时间的顺序排列的,也就是最新的文章都会排在最前面。我们需要爬取的就是下图标注的文章的内容,我使用的方法的先爬取列表所有文章的地址,然后根据每篇文章的地址再去爬取文章的具体内容。
②在36氪首页查看网页源码内容,我使用的是谷歌浏览器(分别按F12和F5),呈现给我们的内容如下图,改变page参数我们可以拿到36氪每页的信息,我们循环给page赋值,就会不断地拿到网站的数据。所以爬取36氪文章列表地址的请求路径是:https://36kr.com/api/search-column/mainsite?per_page=20&page=(参数)
③看一下https://36kr.com/api/search-column/mainsite?per_page=20&page=请求返回的内容,里面有我们想要的文章列表信息,如下图。我们需要做的就是获取response中的data,然后获取每篇文章对应的id及type(post代表文章,video代表视频),通过这两个可以拼接出文章或视频的具体地址,文章(https://36kr.com/p/5158958.html),视频(https://36kr.com/video/21852),不难看出文章地址是https://36kr.com/p+id的形式,视频是https://36kr.com/video+id的形式。这样一来,有了文章的具体地址,我们就可以爬取文章的具体内容了。获取文章地址的局部代码下面也已给出。
target_url='https://36kr.com/api/search-column/mainsite?per_page=20&page='+str(i)+' '
target_request = request.Request(url=target_url)
target_response = request.urlopen(target_request).read().decode('utf_8')
jsonData = json.loads(target_response)
#提取所需信息
newsInfo = jsonData['data']['items']
article_Head = 'https://36kr.com/p/'
video_Head = 'https://36kr.com/video/'
for i in newsInfo:
if ('title' in i):
#文章一些其他信息
title = i['title']
id = i['id']
type = i['_type']
summary = i['summary']
published_at = i['published_at']
extraction_tags = i['extraction_tags']
column_name = i['column_name']
cover_pic = i['cover']
if type=="video":
#文章url
url_id = video_Head + str(id)
else:
#视频url
url_id = article_Head + str(id)+".html"
④对于爬取出的文章url,我保存在了redis里面,然后爬取文章具体内容,从redis取文章url,爬取文章具体内容。爬取文章的具体内容这里需要使用scrapy-splash提供js渲染服务,因为网页的内容是通过js渲染出来的。这样爬取内容更便捷一些。最后通过xpath提取爬取到的内容,具体如何使用xpath提取,可以自行了解一下xpath使用方法,再看看需要提取哪些标签的内容就可以了。代码如下(在我本地粘贴进去比较乱,懒得调整了,哈哈):
def parse(self, response):
p_info = ""
image = ""
article_content = ""
item = MyspiderItem()
article_url = response.url
# 提取文章具体内容,图片等信息,根据需要自行调整
if "video" in article_url:
type = "video"
author = response.xpath('//div[@class="author am-fl"]/div[@class="am-fl"]/a[@class="user"]/text()').extract()
else:
type = "article"
author = response.xpath('//div[@class="content-font"]//div[@class="author am-fl"]'
'//span[@class="name"]/text()').extract()
content = response.xpath('//div[@class="content-font"]/div/section[@class="textblock"]').extract()
infos = response.xpath('//div[@class="content-font"]/div/section[@class="textblock"]//p/span/text()|'
'//div[@class="content-font"]/div/section[@class="textblock"]//p/text()').extract()
imgs = response.xpath('//div[@class="content-font"]/div/section[@class="textblock"]//img/@data-src').extract()
三.配置:
scrapy的结构不熟悉的需要先了解一下(https://scrapy-chs.readthedocs.io/zh_CN/latest/intro/overview.html)
①下面是setting的配置,这里面有数据库,代理,自定义的类等配置,有的可能不需要,因人而异。
BOT_NAME = 'ke'
SPIDER_MODULES = ['myspider.spiders']
NEWSPIDER_MODULE = 'myspider.spiders'
# ip代理池
ITEM_PIPELINES = {
'myspider.pipelines.MyspiderPipeline': 1,
}
MYSQL_HOST = 'x.x.x.x'
MYSQL_DBNAME = 'x' #数据库名字,请修改
MYSQL_USER = 'x' #数据库账号,请修改
MYSQL_PASSWD = 'x' #数据库密码,请修改
MYSQL_PORT = 3306 #数据库端口
ROBOTSTXT_OBEY = False
SPLASH_URL = 'http://x.x.x.x:8050'
#开启两个下载中间件,并调整HttpCompressionMiddlewares的次序
DOWNLOADER_MIDDLEWARES = { 'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware':725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware':810,
'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware':None,
'myspider.middlewares.MyproxiesSpiderMiddleware':None,
'myspider.middlewares.MyspiderResponseMiddleware':None
}
#设置去重过滤器
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
#用来支持cache_args(可选)
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware':100,
}
DUPEFILTER_CLASS ='scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE ='scrapy_splash.SplashAwareFSCacheStorage'
②items
import scrapy
class MyspiderItem(scrapy.Item):
title = scrapy.Field()
info = scrapy.Field()
content = scrapy.Field()
img = scrapy.Field()
type = scrapy.Field()
info_type = scrapy.Field()
info_platform = scrapy.Field()
article_url = scrapy.Field()
author = scrapy.Field()
abstract = scrapy.Field()
article_time = scrapy.Field()
tags = scrapy.Field()
tag = scrapy.Field()
cover_pic = scrapy.Field()
③pipelines
import pymysql
import pymysql.cursors
import requests
import os.path as Path
from twisted.enterprise import adbapi
class MyspiderPipeline(object):
@classmethod
def from_settings(cls, settings):
dbargs = dict(
host=settings['MYSQL_HOST'],
db=settings['MYSQL_DBNAME'],
user=settings['MYSQL_USER'],
passwd=settings['MYSQL_PASSWD'],
port=settings['MYSQL_PORT'],
charset='utf8',
cursorclass=pymysql.cursors.DictCursor,
use_unicode=True,
)
dbpool = adbapi.ConnectionPool('pymysql', **dbargs)
return cls(dbpool)
def __init__(self, dbpool):
self.dbpool = dbpool
# pipeline默认调用
def process_item(self, item, spider):
d = self.dbpool.runInteraction(self._conditional_insert, item, spider) # 调用插入的方法
d.addErrback(self._handle_error, item, spider) # 调用异常处理方法
d.addBoth(lambda _: item)
return d
def _conditional_insert(self, conn, item, spider):
# 保存图片
print(item['img'])
for img in item['img'].splitlines():
ir = requests.get(img)
if ir.status_code == 200:
fileName = Path.join("d:/image",
"%s%s" % (
(img).replace("/", "_").replace(".", "_").replace(":", "_"),
".jpg"))
open(fileName, 'wb').write(ir.content)
# 插入数据库
conn.execute(
'insert into dingding(info_type,info_platform,title,article_url,info,content,cover_pic,pic,abstract,author,article_time,tag,tags,type) '
'value (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)',
(item['info_type'], item['info_platform'], item['title'], item['article_url'], item['info'], item['content'],item['cover_pic'],
item['img'], item["abstract"],item["author"],item['article_time'],item['tag'],item['tags'],item['type']))
def _handle_error(self, failue, item, spider):
print(failue)
④爬虫的具体业务代码(二.36氪网站分析)已经给出了主要的部分,如何把提取的数据存到item里,自行学习一下吧,很简单,上述代码也只是我根据自己的业务需要写的,爬取的方法我觉得说的挺明白了,爬取到的数据具体怎么处理,各位根据自己的情况写代码吧。