目标,用Scrapy爬取每个专题的前十篇文章的概要信息。
1.先在主网页抓取所有的详细页面的href进行拼接
2.进入详细页面提取信息
值得注意的是主网页和详细页面都是动态网页,都是Ajax加载的,不过规律很容易被发现,在谷歌开发者工具观察一下header就不难发现规律了。属于进阶一丢丢的练手实例。
经发现主页面加载最多到36页。。就构造url咯
items.py
import scrapy
class JianshuHotIssueItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
# title = scrapy.Field()
# summary = scrapy.Field()
# number = scrapy.Field()
# people = scrapy.Field()
title = scrapy.Field()
summary = scrapy.Field()
author = scrapy.Field()
comments = scrapy.Field()
likes = scrapy.Field()
money = scrapy.Field()
jianshu_spider.py
值得注意的新方法是response.urljoin(href)这个方法,会自动拼接好你要爬的网站的url,十分的方便。
还有就是.extract_first()并不是每次用它都是最好的,就如我提取comments的时候,其实要提取的子字符串在第二个,所以是.extract()[1]才对。
另外还要注意,money的提取可能会空掉,要判断一哈
import scrapy
from jianshu_hot_issue.items import JianshuHotIssueItem
from scrapy.selector import Selector
class JianshuSpiderSpider(scrapy.Spider):
name = 'jianshu_spider'
allowed_domains = ['jianshu.com']
start_urls = ['https://www.jianshu.com/recommendations/collections?page=1&order_by=hot']
def parse(self, response):
'''解析外面页面'''
selector = Selector(response)
partical_urls = selector.re('<div class="count"><a target="_blank" href="(.*?)">',selector)
for url in partical_urls:
right_url = response.urljoin(url)
# print(real_url)测试成功
parts = ['?order_by=added_at&page={0}'.format(k) for k in range(1,11)]
for part in parts:
# 要爬取每个专题的前10个文章的信息
real_url = right_url + part
yield scrapy.Request(real_url,callback=self.parse_detail)
links = ['https://www.jianshu.com/recommendations/collections?page={}&order_by=hot'.format(i) for i in range(2,36)]
for link in links:
request = scrapy.Request(link,callback=self.parse)
yield request
def parse_detail(self,response):
selector = Selector(response)
content = selector.xpath('//div[@class="content"]')
for detail in content:
try:
title = detail.xpath('a[1]/text()').extract_first()
summary = detail.xpath('p/text()').extract_first().strip()
author = detail.xpath('div/a[1]/text()').extract_first()
# comments = detail.xpath('div/a[2]/text()').extract_first()这样是不行的
comments = detail.xpath('div/a[2]/text()').extract()[1].strip()
likes = detail.xpath('div/span[1]/text()').extract_first().strip()
money = detail.xpath('div/span[2]/text()').extract_first()
item = JianshuHotIssueItem()
if money is not None:
item['title'] = title
item['summary'] = summary
item['author'] = author
item['comments'] = comments
item['likes'] = likes
item['money'] = money.strip()
else :
item['title'] = title
item['summary'] = summary
item['author'] = author
item['comments'] = comments
item['likes'] = likes
print(item)
yield item
except:
pass
piplines.py
此处存储为json文件到本地
class JianshuHotIssuePipeline(object):
def __init__(self):
self.file = open('D://jianshu_hot_issues.json','wb')
def process_item(self, item, spider):
self.file.write(bytes(str(item),encoding='utf-8'))
return item
settings.py
设置LOG_LEVEL = ‘WARNING’,可以让前面烦人的东西不显示,只显示你想看到的要提取的内容
BOT_NAME = 'jianshu_hot_issue'
SPIDER_MODULES = ['jianshu_hot_issue.spiders']
NEWSPIDER_MODULE = 'jianshu_hot_issue.spiders'
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.162 Safari/537.36'
DOWNLOAD_DELAY = 1
ITEM_PIPELINES = {
'jianshu_hot_issue.pipelines.JianshuHotIssuePipeline': 300,
}
LOG_LEVEL = 'WARNING'
然后 scrapy crawl …就完事了
————————————————