一、页面分析
本次案例以爬取喜马拉雅上的英文儿歌为例,网址https://www.ximalaya.com/album/55952392
1.1首页数据分析
打开F12,查看各个请求的返回结果,发现如下请求能获取到我想要的数据,第一页所有歌曲的ID和歌名。
获取第一页每首歌的名字与ID的代码如下:
url = 'https://www.ximalaya.com/revision/album/v1/getTracksList?albumId=55952392&pageNum=1&pageSize=30'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
}
res = requests.get(url=url, headers=headers)
tracks = res.json()['data']['tracks']
# 歌曲名
title = [i.get('title') for i in tracks]
# 歌曲ID
trackId = [i.get('trackId') for i in tracks]
1.2音频地址获取
随便点击列表中的一首歌曲进入到一个音频的详细页面,在该页面中查找音频文件,搜索m4a。
当搜索m4a时,发现列出了当前页面所有歌曲的m4a文件,我们只需要找到当前歌曲的m4a地址就可以了。
获取当前音频的m4a地址代码:
audio_url = f'https://www.ximalaya.com/m-revision/page/track/queryRelativeTracksById?trackId={trackId}&preOffset=1&nextOffset=8&countKeys=play&order=2'
audio_res = requests.get(url=audio_url, headers=headers)
m4a_list = audio_res.json()['data']
for j in m4a_list:
if j['id'] == trackId:
m4a_url = j['trackInfo']['playPath']
二、文件存储
获取到m4a地址后,直接get请求,将二进制文件写入为mp3文件即可。
audio_info = requests.get(url=m4a_url, headers=headers)
fp = open(f'{title}.mp3', 'wb')
fp.write(audio_info.content)
三、运行结果
全部代码如下:
# 这里爬的是英文儿歌
import requests
import time
page_num = 1
# 获取一页数据并下载
def get_one_page(page_num):
url = 'https://www.ximalaya.com/revision/album/v1/getTracksList?albumId=55952392&pageNum=' + str(
page_num) + '&pageSize=30'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
}
try:
res = requests.get(url=url, headers=headers)
# print(res.status_code)
except Exception as e:
print(e)
time.sleep(3)
res = requests.get(url=url, headers=headers)
tracks = res.json()['data']['tracks']
# 歌曲名
# title = [i.get('title') for i in tracks]
# 歌曲ID
# trackId = [i.get('trackId') for i in tracks]
for i in tracks:
title = i['title']
trackId = i['trackId']
audio_url = f'https://www.ximalaya.com/m-revision/page/track/queryRelativeTracksById?trackId={trackId}&preOffset=1&nextOffset=8&countKeys=play&order=2'
audio_res = requests.get(url=audio_url, headers=headers)
m4a_list = audio_res.json()['data']
for j in m4a_list:
if j['id'] == trackId:
m4a_url = j['trackInfo']['playPath']
audio_info = requests.get(url=m4a_url, headers=headers)
print(f'正在下载歌曲{title}')
fp = open(f'{title}.mp3', 'wb')
fp.write(audio_info.content)
# 获取所有页的数据并下载
def get_all_page(page_counts):
for i in range(page_counts):
print(f'正在获取第{i + 1}的歌曲')
get_one_page(page_num=i + 1)
get_all_page(3)
运行结果如下: