创新实训(5)-博客园首页爬虫(一)

最新推荐文章于 2021-03-14 11:26:56 发布

ttxs69

最新推荐文章于 2021-03-14 11:26:56 发布

阅读量167

点赞数

分类专栏：创新实训文章标签： html xpath

本文链接：https://blog.csdn.net/qq_34842847/article/details/106916978

版权

创新实训专栏收录该内容

12 篇文章 1 订阅

订阅专栏

创新实训(5)-博客园首页爬虫(一)

1. 定义Item

需要采集标题，url，正文，标签和更新时间。

import scrapy

class CnblogItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # 定义需要保存的字段
    title = scrapy.Field()
    url = scrapy.Field()
    content = scrapy.Field()
    tags = scrapy.Field()
    update_time = scrapy.Field()

2. 分析博客园页面，编写spider

2.1 首页URL

https://www.cnblogs.com/sitehome/p/1

通过上图可以看到，博客园的首页共提供了200页博客，并且每一页的格式都是https://www.cnblogs.com/sitehome/p/ 加一个数字。

然后通过数数可以得知，每页共20篇博文。所以，通过博客园的首页我们共可以获得4000条博文。

2.2 博文URL

博文url

通过上图可以看到，每一个博文的链接都是一个带有class="titlelnk"的a标签，使用XPath提取就是

//a[@class="titlelnk"]

这里只需要提取url，spider代码如下

# 获取当前页面的所有博文链接
urls = response.xpath('//a[@class="titlelnk"]/@href').extract()

然后再去根据url获取文章内容：

for url in urls:
            yield scrapy.Request(url=url, callback=self.parse_content)

然后去解析下一页：

if self.offset < 200:        #爬取到第几页
            self.offset += 1
        url2 = self.url+str(self.offset)    #拼接url
        yield scrapy.Request(url=url2, callback=self.parse)

2.3 博文标题

博文标题

根据上图可以看到，博文标题是一个id=cb_post_title_url的a标签，XPath提取：

# 标题
item['title'] = response.xpath('//a[@id="cb_post_title_url"]/text()').extract()
# url
item['url'] = response.xpath('//a[@id="cb_post_title_url"]/@href').extract()

2.4 博文正文

博文内容

通过上图可以看到，博文的正文是一个id=cnblogs_post_body的div标签，XPath提取之：

# 正文
item['content'] = response.xpath('//div[@id="cnblogs_post_body"]').extract()

2.5 博文发布时间

发布时间是id=post-date的span标签

# 发布时间
item['update_time'] = response.xpath('//span[@id="post-date"]/text()').extract()

ttxs69

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
创新实训(5)-博客园首页爬虫(一)

创新实训(5)-博客园首页爬虫(一)1. 定义Item需要采集标题，url，正文，标签和更新时间。import scrapyclass CnblogItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() # 定义需要保存的字段 title = scrapy.Field() url = scrapy.Field() content =
复制链接

扫一扫