Python Scrapy 多线程爬取网易云音乐热门歌单信息（手把手教学）

同稚君

已于 2024-03-14 20:41:13 修改

阅读量3.7k

点赞数 28

文章标签： python 开发语言数据挖掘

于 2022-01-02 19:09:51 首次发布

本文链接：https://blog.csdn.net/qq_52181283/article/details/122277915

版权

下面我将向大家介绍使用 Scrapy 爬虫获取网易云音乐的热门歌单信息。

这里是网易云音乐的歌单页面，可以看到歌单信息非常得结构化，是非常适合爬虫来爬取的。

URL：全部歌单 - 歌单 - 网易云音乐 (163.com)

爬取结果预览（爬取时间提早于写这篇文章时间约一周，所以歌单信息部分有变化）：

一、首先来看一下Scrapy的组成：

Scrapy框架主要由五大组件组成，它们分别是调度器(Scheduler)、下载器(Downloader)、爬虫（Spider）和实体管道(Item Pipeline)、Scrapy引擎(Scrapy Engine)。下面我们分别介绍各个组件的作用。

(1)、调度器(Scheduler):

调度器，说白了把它假设成为一个URL（抓取网页的网址或者说是链接）的优先队列，由它来决定下一个要抓取的网址是什么，同时去除重复的网址（不做无用功）。用户可以自己的需求定制调度器。

(2)、下载器(Downloader):

下载器，是所有组件中负担最大的，它用于高速地下载网络上的资源。Scrapy的下载器代码不会太复杂，但效率高，主要的原因是Scrapy下载器是建立在twisted这个高效的异步模型上的(其实整个框架都在建立在这个模型上的)。

(3)、爬虫（Spider）:

爬虫，是用户最关心的部份。用户定制自己的爬虫(通过定制正则表达式等语法)，用于从特定的网页中提取自己需要的信息，即所谓的实体(Item)。用户也可以从中提取出链接,让Scrapy继续抓取下一个页面。

(4)、实体管道(Item Pipeline):

实体管道，用于处理爬虫(spider)提取的实体。主要的功能是持久化实体、验证实体的有效性、清除不需要的信息。

(5)、Scrapy引擎(Scrapy Engine):

Scrapy引擎是整个框架的核心.它用来控制调试器、下载器、爬虫。实际上，引擎相当于计算机的CPU,它控制着整个流程。

重点：一个Scrapy项目的文件目录结构如下：

我们需要编辑的一般只有 spiders 、items.py、 pipeline.py、settings.py

在桌面新建一个项目文件夹，然后使用pycharm打开，在终端（Terminal）中输入：

scrapy startproject 爬虫项目名称 #创建一个Scrapy爬虫项目

cd my 爬虫项目名称 #进入到此项目中

如本文是：

scrapy startproject wyyMusic

cd wyyMusic

这样一个网易云音乐爬虫项目就创建好了。

二、编写具体爬虫代码

1. 设置settings.py

在settings.py中写上一下代码：（用于设置爬虫的一些全局配置信息）

 #去除掉日志中其他描述性的信息，只输出我们需要的信息
LOG_LEVEL = "WARNING"  

USER_AGENT =  '自己浏览器的user agent'

#默认为True，更改为False，即不遵守君子协定
ROBOTSTXT_OBEY = False     

#下载延迟，可以设为每下载一次暂停2秒，以防下载过快被禁止访问
DOWNLOAD_DELAY = 2   


DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  # 'Accept-Language': 'en',    #不要这条代码
}

2. 设置items.py：（定义需要爬取的字段）

import scrapy
class MusicListItem(scrapy.Item):

    SongsListID = scrapy.Field()   #歌单id号
    SongListName = scrapy.Field()  #歌单名
    AmountOfPlay = scrapy.Field()  #播放量   
    Labels = scrapy.Field()        #标签名
    Url = scrapy.Field()           #歌单域名，为下一次详细爬取留备份
    Collection = scrapy.Field()    #歌单收藏量  
    Forwarding = scrapy.Field()    #转发量   
    Comment = scrapy.Field()       #评论量   
    NumberOfSongs = scrapy.Field() #歌曲数量
    CreationDate = scrapy.Field()  #歌单创建日期
    AuthorID = scrapy.Field()      #作者id

3. 创建歌单爬虫 MusicList.py：

在spiders包下新建一个 MusicList.py，创建后的目录结构如下

在MusicList.py中获取歌单信息

import scrapy   #导入scrapy 包

#使用相对路径从我们刚刚编写的items.py中导入MusicListItem类
from ..items import MusicListItem 

#导入深拷贝包，用于在爬取多个页面时保存到pipeline中的歌单信息顺序不会乱，防止出现重复，非常关键
from copy import deepcopy


class MusicListSpider(scrapy.Spider):
    name = "MusicList"      #必须要写name属性，在pipeline.py中会用到
    allowed_domains = ["music.163.com"]   #设置爬虫爬取范围
    start_urls = ["https://music.163.com/discover/playlist"]  #起始爬取的页面，即歌单第一面
    offset = 0  #自己设置的一个指针，用于记录当前爬取的页码

    def parse(self, response):
        #使用.xpath语法来从HTML页面中解析需要的信息
        #获取一页中的全部歌单，保存到liList中
        liList = response.xpath("//div[@id='m-disc-pl-c']/div/ul[@id='m-pl-container']/li")
        
        #对liList中的歌单，一个一个遍历，获取歌单详细页面的信息
        for li in liList:
            itemML = MusicListItem()
            a_href = li.xpath("./div/a[@class = 'msk']/@href").extract_first()
            itemML["SongsListID"]= a_href[13:]

            #获取歌单详细页面的Url地址
            Url = "https://music.163.com" + a_href
            itemML["Url"] = Url
            #调用SongsListPageParse来获取歌单详细页面的信息
            yield scrapy.Request(Url, callback=self.SongsListPageParse, meta={"itemML" : deepcopy(itemML)})


        #爬取下一页
        if self.offset < 37:
            self.offset += 1
            #获取下一页的Url地址
            nextpage_a_url="https://music.163.com/discover/playlist/?order=hot&cat=%E5%85%A8%E9%83%A8&limit=35&offset=" + str(self.offset*35)
            print(self.offset ,nextpage_a_url)
            yield scrapy.Request(nextpage_a_url, callback=self.parse)
            print("开始爬下一页")


    #用于爬取每一个歌单中的详细页面信息
    def SongsListPageParse(self, response):
        cntc = response.xpath("//div[@class='cntc']")
        itemML = response.meta["itemML"]

        SongListName = cntc.xpath("./div[@class='hd f-cb']/div/h2//text()").extract_first()
        itemML["SongListName"] = SongListName #获取歌单名

        user_url = cntc.xpath("./div[@class='user f-cb']/span[@class='name']/a/@href").extract_first()
        user_id = user_url[14:]
        itemML["AuthorID"] = user_id           #获取歌单创作者id号

        time = cntc.xpath("./div[@class='user f-cb']/span[@class='time s-fc4']/text()").extract_first()
        itemML["CreationDate"] = time[0:10]     #获取歌单创建日期

        aList = cntc.xpath("./div[@id='content-operation']/a")
        Collection = aList[2].xpath("./@data-count").extract_first()
        itemML["Collection"] = Collection  #获取收藏量
        Forwarding = aList[3].xpath("./@data-count").extract_first()
        itemML["Forwarding"] = Forwarding  #获取转发量
        Comment = aList[5].xpath("./i/span[@id='cnt_comment_count']/text()").extract_first()
        itemML["Comment"] = Comment        #获取评论量

        tags = ""
        tagList = cntc.xpath("./div[@class='tags f-cb']/a")
        for a in tagList:
            tags = tags + a.xpath("./i/text()").extract_first() + " "
        itemML["Labels"] = tags

        songtbList = response.xpath("//div[@class='n-songtb']/div")
        NumberOfSongs = songtbList[0].xpath("./span[@class='sub s-fc3']/span[@id='playlist-track-count']/text()").extract_first()
        itemML["NumberOfSongs"] = NumberOfSongs
        AmountOfPlay = songtbList[0].xpath("./div[@class='more s-fc3']/strong[@id='play-count']/text()").extract_first()
        itemML["AmountOfPlay"] = AmountOfPlay
        yield itemML  #将爬取的信息传给 pipelines.py

每一页的每一个歌单，都对应一个 li 标签，li标签中的a标签就是歌单详细页面的地址

进入到一个歌单的详细信息页面：

我们爬取的信息就是上图中画红框的地方，它们对应的字段名为：

SongsListID = scrapy.Field()   #歌单id号
SongListName = scrapy.Field()  #歌单名
AmountOfPlay = scrapy.Field()  #播放量   
Labels = scrapy.Field()        #标签名
Url = scrapy.Field()           #歌单域名，为下一次详细爬取留备份
Collection = scrapy.Field()    #歌单收藏量  
Forwarding = scrapy.Field()    #转发量   
Comment = scrapy.Field()       #评论量   
NumberOfSongs = scrapy.Field() #歌曲数量
CreationDate = scrapy.Field()  #歌单创建日期
AuthorID = scrapy.Field()      #作者id

它们都是在 SongsListPageParse 函数中，通过解析歌单详细信息页面的来获取。

爬取下一页：

获取下一页的方法有两种：

一是从每页的“下一页” a标签中获取下一页的url地址

二是根据翻页的规律，每页的url中的offset参数相差35（即每页有35个歌单），因此只要令 offset+= 35 进行循环就可以一直爬取到下一页，直到 offset <= 35 * 37 为止，37是有37页。

其实每次，所以在爬取下一页的时候没有用for ，而只是用 if 来判断offset

yield scrapy.Request(nextpage_a_url, callback=self.parse)

其实就是一个递归，调用parse函数自身。

由于第二种方法更为简便，所以这里使用的第二种方法来爬取下一页

4. 设置pipelines.py 来保存获取到的信息（即item）

from scrapy.exporters import CsvItemExporter

class WyymusicPipeline:
    def __init__(self):
        self.MusicListFile = open("MusicList.csv", "wb+")   #保存为csv格式
        self.MusicListExporter = CsvItemExporter(self.MusicListFile, encoding='utf8')
        self.MusicListExporter.start_exporting()

    def process_item(self, item, spider):
        if spider.name == 'MusicList':
            self.MusicListExporter.export_item(item)
            return item

5.终于到了激动人心的时刻！—— 启动爬虫

在终端（Terminal）中输入：

scrapy crawl MusicList

（注意：在此之前要保证是在wyyMusic爬虫目录下，若不在，则可以通过 cd wyyMusic 来进入到爬虫目录下。）