使用Scrapy爬取CSDN博客首页文章

最新推荐文章于 2024-03-14 15:11:16 发布

__HelloWorld__

最新推荐文章于 2024-03-14 15:11:16 发布

阅读量1.2k

点赞数

分类专栏： Python 前端文章标签： Scrapy Python CSDN博客

本文链接：https://blog.csdn.net/kangkanglou/article/details/78856906

版权

前端同时被 2 个专栏收录

51 篇文章 2 订阅

订阅专栏

Python

41 篇文章 2 订阅

订阅专栏

Scrapy, a fast high-level web crawling & scraping framework for Python

CSDN博客首页如下，包括：推荐、资讯、人工智能等栏目

这里写图片描述

每一个栏目下有不同的推荐文章列表，我们使用Scrapy来读取这些栏目的推荐文章列表

这里写图片描述

定义爬虫如下

class QuotesSpider(scrapy.Spider):
    name = "csdn"

    def start_requests(self):
        urls = [
            'http://blog.csdn.net'
        ]
        tag = getattr(self, 'tag', None)
        print("tag is %s" % tag)

        for url in urls:
            print("do request on %s" % url)
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        # page = response.url.split("/")[-2]
        # filename = 'data/quotes-%s.html' % page
        # with open(filename, 'wb') as f:
        #     f.write(response.body)
        # self.log('Saved file %s' % filename)

        for ul in response.css(".nav_com ul li"):
            yield {
                'nav_text': ul.css("a::text").extract_first(),
                'nav_href': ul.css("a::attr(href)").extract_first(),
            }
            yield response.follow(ul.css("a::attr(href)").extract_first(), callback=self.get_article_list)

            # for a in response.css(".list_con"):
            #     yield {
            #         'article_title': a.css("a::text").extract_first(),
            #         'article_href': a.css("a::attr(href)").extract_first()
            #     }

            # next_page = response.css('.nav_com ul li a::attr(href)')[2].extract()
            # if next_page is not None:
            #     print("next_page %s" % next_page)
            #     yield response.follow(next_page, callback=self.parse)

    @staticmethod
    def get_article_list(response):
        for a in response.css(".list_con"):
            yield {
                'article_title': a.css("a::text").extract_first(),
                'article_href': a.css("a::attr(href)").extract_first()
            }

爬取结果如下
这里写图片描述

__HelloWorld__

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
使用Scrapy爬取CSDN博客首页文章

Scrapy, a fast high-level web crawling & scraping framework for PythonCSDN博客首页如下，包括：推荐、资讯、人工智能等栏目每一个栏目下有不同的推荐文章列表，我们使用Scrapy来读取这些栏目的推荐文章列表定义爬虫如下class QuotesSpider(scrapy.Spider): name = "csdn"
复制链接

扫一扫