练习小项目：音乐爬虫

最新推荐文章于 2024-08-16 11:38:09 发布

今天周几

最新推荐文章于 2024-08-16 11:38:09 发布

阅读量344

点赞数 2

分类专栏：爬虫学习文章标签： python http 经验分享 javascript xpath

本文链接：https://blog.csdn.net/riiki/article/details/105982066

版权

爬虫学习专栏收录该内容

3 篇文章 0 订阅

订阅专栏

本文作者分享了使用Python爬虫抓取百度音乐歌手歌曲的实战经验，包括分析网页结构，利用requests和xpath解析数据，以及如何获取歌曲播放链接。文章适合初级爬虫学习者，推荐使用Postman辅助分析URL参数。

摘要由CSDN通过智能技术生成

学习爬虫也有些日子了（虽说还是个菜鸡），最近找工作也是各大网站投递，然后都石沉大海。实在闲来无事，想着再写写小项目练练手吧，Lets go<(￣︶￣)↗[GO!]
自定需求

门户网站：百度音乐
使用技术：requests，xpath，re
爬取内容：输入某歌手名，爬取下载该歌手所有歌曲

开始吧

首先第一步，还是分析网站结构，个人觉得这是很难的，很掉头发的，不过也是爬虫的灵魂。直接在html中就能找到的就不说了，稍稍熟悉前端或学过几天requests的都能将他解析出来。第一个麻烦点的地方在，歌手歌曲详情页的翻页

点击几次后，发现和多数的网站不同，并不是在url中传递page的值，那明显就是动态加载的了，OK，打开检查，刷新下页面，翻页后发现，数据都在这个数据包中
在这里插入图片描述
并且是直接传入的html代码进行渲染，那么直接请求这个数据包，再拿出data中的html代码，就可以通过正则等方式解析出我们想要的数据了。代码中解析出每首歌的href和title即可。

接着我们进入歌曲的播放页面，播放歌曲，并在检查的media选项中找到了歌曲的接口，是这样的：http://audio04.dmhmusic.com/71_53_T10038986645_128_4_1_0_sdk-cpm/cn/0208/M00/E5/61/ChR46119DqeAJGANAD3PrR3qZCk162.mp3?xcode=4455e866b557087b64c3a4a34bfcc1036e7e2ee
看的一脸懵逼，这个路径的组成部分，我这种菜鸡看来是没法从中发现规律了。不过没事，XHR啊JS啊啥的都点开看看，常常都会有意外的收获。然后在无意中，发现了这样一个js文件

这其中的file_link正是想要请求的歌曲url，OK，分析结束。
思路清晰后，那么接下来的敲代码就容易了，最后推荐一下跟我一样的爬虫初心者，尝试着使用下postman，因为它能非常方便的帮你分析出url中必要与可有可无的params，这样可以快速的，省去对很多没用的参数的困扰。比如该项目中，放下载链接的js的url也是很长：http://musicapi.taihe.com/v1/restserver/ting?method=baidu.ting.song.playAAC&format=jsonp&callback=jQuery172012729838958251238_1588855404448&songid=87967607&from=web&_=1588855423310
很多奇怪的id参杂在其中，postman筛选后，会发现

这样4个参数就是足够的，而这4个参数的含义也很好理解。
最后，附上源码，比较粗糙，不足之处还请各位补充

import requests
from fake_useragent import UserAgent
from lxml import etree
import re
import os

artist_url = 'http://music.taihe.com/artist'
song_url = 'http://music.taihe.com/data/user/getsongs'
json_url = 'http://musicapi.taihe.com/v1/restserver/ting'

headers = {
    'User-Agent': UserAgent().random
}

session = requests.session()


# 获得歌手列表
def get_artist():
    response = session.get(url=artist_url, headers=headers)
    response.encoding = response.apparent_encoding
    text = response.text
    tree = etree.HTML(text)
    dl_list = tree.xpath('//div[@class="hot-head clearfix"]/dl')
    artist_list = []
    for dl in dl_list:
        title = dl.xpath('./dd/a/@title')[0]
        href = dl.xpath('./dd/a/@href')[0].split('/')[-1]
        artist_list.append((title, href))
    li_list = tree.xpath('//ul[@class="container"]/li[1]/ul[@class="clearfix"]/li')
    for li in li_list:
        try:
            title = li.xpath('./a/@title')[0]
            href = li.xpath('./a/@href')[0].split('/')[-1]
            artist_list.append((title, href))
        except Exception as e:
            # print(e)
            continue
    return artist_list


# 获得歌曲列表
def get_song(artist: tuple, start=15):
    song_list = []
    while start != 0:
        params = {
            'start': start - 15,
            'size': 15,
            'ting_uid': artist[1],
        }
        response = session.get(url=song_url, headers=headers, params=params)
        response.encoding = response.apparent_encoding
        text = response.json()['data']['html']
        datas = re.findall(r'<a href="/song/(.*?)" target="_blank" class="namelink " title="(.*?)"', text)
        song_list += datas
        start -= 15
    return song_list


# 获得下载链接
def get_mp3(song: tuple):
    params = {
        'method': 'baidu.ting.song.playAAC',
        'format': 'jsonp',
        'songid': song[0],
        'from': 'web',
    }
    response = session.get(url=json_url, headers=headers, params=params)
    download_url = response.json()['bitrate']['file_link']
    return download_url, song[1]


# 存储
def save(download_url: str, artist: str, title: str):
    response = session.get(url=download_url, headers=headers)
    if not os.path.exists('./' + artist):
        os.mkdir('./' + artist)
    with open('./' + artist + '/' + title + '.mp3', 'wb') as file:
        print(f'{artist}/{title} \t 正在下载 ...')
        file.write(response.content)


def main():
    while True:
        artist_list = get_artist()
        index = 0
        for artist in artist_list:
            index += 1
            if index % 8 == 0:
                print(artist[0])
            else:
                print(artist[0] + '\t', end='')
        singer = input('输入歌手>>>')
        for artist in artist_list:
            if singer == artist[0]:
                count = input('输入下载数量>>>')
                song_list = get_song(artist, eval(count))
                for song in song_list:
                    download_url, title = get_mp3(song)
                    save(download_url, singer, title)
                print('下载完成'.center(100, '-'))
                break


if __name__ == '__main__':
    main()