Scrapy入门实例2:爬取简书网热门专题信息(动态网页,双重Ajax接口)

目标,用Scrapy爬取每个专题的前十篇文章的概要信息。

1.先在主网页抓取所有的详细页面的href进行拼接

2.进入详细页面提取信息

值得注意的是主网页和详细页面都是动态网页,都是Ajax加载的,不过规律很容易被发现,在谷歌开发者工具观察一下header就不难发现规律了。属于进阶一丢丢的练手实例。

经发现主页面加载最多到36页。。就构造url咯

items.py

import scrapy




class JianshuHotIssueItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # title = scrapy.Field()
    # summary = scrapy.Field()
    # number = scrapy.Field()
    # people = scrapy.Field()
    title = scrapy.Field()
    summary = scrapy.Field()
    author = scrapy.Field()
    comments = scrapy.Field()
    likes = scrapy.Field()
    money = scrapy.Field()

    

jianshu_spider.py

值得注意的新方法是response.urljoin(href)这个方法,会自动拼接好你要爬的网站的url,十分的方便。

还有就是.extract_first()并不是每次用它都是最好的,就如我提取comments的时候,其实要提取的子字符串在第二个,所以是.extract()[1]才对。

另外还要注意,money的提取可能会空掉,要判断一哈

import scrapy
from jianshu_hot_issue.items import JianshuHotIssueItem
from scrapy.selector import Selector




class JianshuSpiderSpider(scrapy.Spider):
    name = 'jianshu_spider'
    allowed_domains = ['jianshu.com']
    start_urls = ['https://www.jianshu.com/recommendations/collections?page=1&order_by=hot']




    def parse(self, response):
        '''解析外面页面'''
        selector = Selector(response)
        partical_urls = selector.re('<div class="count"><a target="_blank" href="(.*?)">',selector)
        for url in partical_urls:
            right_url = response.urljoin(url)
            # print(real_url)测试成功
            parts = ['?order_by=added_at&page={0}'.format(k) for k in range(1,11)]
            for part in parts:
            # 要爬取每个专题的前10个文章的信息
                real_url = right_url + part
                yield scrapy.Request(real_url,callback=self.parse_detail)








        links = ['https://www.jianshu.com/recommendations/collections?page={}&order_by=hot'.format(i) for i in range(2,36)]
        for link in links:
            request = scrapy.Request(link,callback=self.parse)
            yield request




    def parse_detail(self,response):
        selector = Selector(response)
        content = selector.xpath('//div[@class="content"]')
        for detail in content:
            try:
                title = detail.xpath('a[1]/text()').extract_first()
                summary = detail.xpath('p/text()').extract_first().strip()
                author = detail.xpath('div/a[1]/text()').extract_first()
                # comments = detail.xpath('div/a[2]/text()').extract_first()这样是不行的
                comments = detail.xpath('div/a[2]/text()').extract()[1].strip()
                likes = detail.xpath('div/span[1]/text()').extract_first().strip()
                money = detail.xpath('div/span[2]/text()').extract_first()




                item = JianshuHotIssueItem()
                if money is not None:
                    item['title'] = title
                    item['summary'] = summary
                    item['author'] = author
                    item['comments'] = comments
                    item['likes'] = likes
                    item['money'] = money.strip()
                else :
                    item['title'] = title
                    item['summary'] = summary
                    item['author'] = author
                    item['comments'] = comments
                    item['likes'] = likes
                print(item)
                yield item
            except:
                pass

                

piplines.py

此处存储为json文件到本地

class JianshuHotIssuePipeline(object):
    def __init__(self):
        self.file = open('D://jianshu_hot_issues.json','wb')




    def process_item(self, item, spider):
        self.file.write(bytes(str(item),encoding='utf-8'))
        return item

settings.py

设置LOG_LEVEL = ‘WARNING’,可以让前面烦人的东西不显示,只显示你想看到的要提取的内容

BOT_NAME = 'jianshu_hot_issue'




SPIDER_MODULES = ['jianshu_hot_issue.spiders']
NEWSPIDER_MODULE = 'jianshu_hot_issue.spiders'




USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.162 Safari/537.36'
DOWNLOAD_DELAY = 1
ITEM_PIPELINES = {
   'jianshu_hot_issue.pipelines.JianshuHotIssuePipeline': 300,
}
LOG_LEVEL = 'WARNING'

然后 scrapy crawl …就完事了

————————————————

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值