Python爬虫使用实例-网抑云歌手热门单曲top50

镜花照无眠

已于 2024-09-26 20:19:07 修改

阅读量1k

点赞数 14

分类专栏： # Python 文章标签： python 爬虫开发语言

于 2024-09-26 20:17:44 首次发布

本文链接：https://blog.csdn.net/weixin_45693567/article/details/142573719

版权

Python 专栏收录该内容

32 篇文章 2 订阅

订阅专栏

目标网址：

https://music.163.com/#/artist?id=××××××××

与排行榜的一样，类似的，只需要替换url
例如：排行榜：url='.../discover/toplist?id=××××××××'
歌手热门单曲：url='.../artist?id=××××××××'

Python爬虫使用实例_网易云歌榜

防止创建title时字符报错导致中断
文件名按歌手自定义，根据 <h2> 标签的内容动态生成 filename
多线程下载 URL列表中的音乐数据，使用 Python 的 concurrent.futures 模块

一、防止字符报错导致中断

出现报错：

FileNotFoundError: [Errno 2] No such file or directory: ‘music\霄玉若惜【仙剑奇侠传四（玄霄/夙玉）】.mp3’

原因分析：
出现无法写入的字符/了
在这里插入图片描述

这些字符在Windows操作系统中具有特殊的意义或用途，如果允许在文件名中使用，可能会导致系统无法正确处理文件或引发错误。具体来说：

反斜杠（\）和正斜杠（/）‌：这些字符在Windows路径中用于分隔目录和文件名。如果在文件名中使用这些字符，可能会导致路径解析错误。
冒号（:）：用于表示驱动器的分隔符（例如C:），因此不能在文件名中使用。
星号（*）和问号（?）‌：这些字符在Windows中用作通配符，用于匹配文件名模式。在文件名中使用它们可能会导致意外的匹配或过滤错误。
双引号（"）‌：在命令行中，双引号用于标识参数，如果文件名中包含双引号，可能会导致命令解析错误。
尖括号（< >）和竖线（|）‌：这些字符在命令行中有特定的用途，如输入/输出重定向和管道操作。在文件名中使用它们可能会干扰这些操作。

解决方法：
在文件名创建之前使用正则表达式来替换掉不允许的字符
比如：替换/
当title中出现\/:*?"<>|任意字符时，将其替换为一个空格
在这里插入图片描述

实现代码：

import os
import re
import requests

filename = 'music\\'
if not os.path.exists(filename):
    os.mkdir(filename)
url = 'https://music.163.com/artist?id=××××××××'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36'
}
response = requests.get(url=url, headers=headers)
html_data = re.findall('<li><a href="/song\?id=(\d+)">(.*?)</a>', response.text) # 使用正则表达式匹配歌曲ID和标题
invalid_chars_pattern = r'[\/:*?"<>|]' # 定义需要替换的非法字符
for num_id, title in html_data:
    title = re.sub(invalid_chars_pattern, ' ', title) # 将非法字符替换为空格
    music_url = f'https://music.163.com/song/media/outer/url?id={num_id}.mp3' # 创建音乐文件的URL
    music_content = requests.get(url=music_url, headers=headers).content
    with open(os.path.join(filename, title.strip() + '.mp3'), mode='wb') as f:
        f.write(music_content)
    print(num_id, title)

替换url，去除url中间的#/

运行结果：
在这里插入图片描述

在这里插入图片描述

二、代码优化

减少冗余结构清晰功能明确
从而提高代码的可读性，使代码易于维护和扩展。

功能划分: 函数划分使得每个部分的功能更明确。
- create_music_folder(): 用于创建音乐保存文件夹。
- get_music_data(): 获取音乐数据，返回歌曲ID和标题的列表。
- sanitize_title(): 对标题进行清理，替换非法字符。
- download_music(): 下载指定歌曲并保存。
- main(): 程序的主入口。
减少全局变量的使用: 避免使用过多的全局变量，通过参数传递所需的信息。
字符串格式化: 使用f-string（如f"{value}"）提高代码的可读性。

import os
import re
import requests

# 创建音乐保存文件夹
def create_music_folder(folder_name='music1'): #def create_music_folder(folder_name='music'):
    if not os.path.exists(folder_name):
        os.mkdir(folder_name)
# 获取数据 返回歌曲ID和标题的列表
def get_music_data(url, headers):
    response = requests.get(url, headers=headers)
    return re.findall(r'<li><a href="/song\?id=(\d+)">(.*?)</a>', response.text)
# 标题清理 替换非法字符
def sanitize_title(title):
    return re.sub(r'[\/:*?"<>|]', ' ', title)
# 下载歌曲 并保存
def download_music(num_id, title, folder_name='music1'): #def download_music(num_id, title, folder_name='music'):
    music_url = f'https://music.163.com/song/media/outer/url?id={num_id}.mp3'
    music_content = requests.get(music_url).content
    file_path = os.path.join(folder_name, f"{sanitize_title(title).strip()}.mp3")
    with open(file_path, 'wb') as f:
        f.write(music_content)
# 程序的主入口
def main():
    create_music_folder()
    url = 'https://music.163.com/artist?id=××××××××' # url = 'https://music.163.com/artist?id=12480034'
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36'
    }
    music_data = get_music_data(url, headers)
    for num_id, title in music_data:
        download_music(num_id, title)

if __name__ == "__main__":
    main()

三、文件名按歌手自定义

在这里插入图片描述

函数 get_artist_name() 从页面的 HTML 中提取 <h2> 标签里的内容作为文件夹名称。

在这里插入图片描述

修改id即可

import os
import re
import requests
# create a folder to save the music
def create_music_folder(folder_name):
    if not os.path.exists(folder_name):
        os.mkdir(folder_name)
# extracts a valid artist name or list name from the tag <h2>
def get_artist_name(html):
    match = re.search(r'<h2[^>]*>(.*?)</h2>', html)
    if match:
        return match.group(1).strip()
    return "Unknown Artist"
# get data, return a list of song ids and titles
def get_music_data(url, headers):
    response = requests.get(url, headers=headers)
    return re.findall(r'<li><a href="/song\?id=(\d+)">(.*?)</a>', response.text), response.text
# title cleanup, replaces illegal characters
def sanitize_title(title):
    return re.sub(r'[\/:*?"<>|]', ' ', title)
# download the song and save it
def download_music(num_id, title, folder_name):
    music_url = f'https://music.163.com/song/media/outer/url?id={num_id}.mp3'
    music_content = requests.get(music_url).content
    file_path = os.path.join(folder_name, f"{sanitize_title(title).strip()}.mp3")
    with open(file_path, 'wb') as f:
        f.write(music_content)
# main entry point
def main():
    url = 'https://music.163.com/artist?id=××××' # can be changed to the appropriate URL
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36'
    }
    # get music data and html content
    music_data, html = get_music_data(url, headers)
    # extract the artist name or list name as the folder name
    artist_name = get_artist_name(html)
    create_music_folder(artist_name)
    # download
    for num_id, title in music_data:
        download_music(num_id, title, artist_name)

if __name__ == "__main__":
    main()

四、多线程下载 URL列表中的音乐数据

可以使用 Python 的 concurrent.futures 模块。传入一个 URL 列表，并使用 ThreadPoolExecutor 来实现多线程下载。

import os
import re
import requests
from concurrent.futures import ThreadPoolExecutor

# create a folder to save the music
def create_music_folder(folder_name):
    if not os.path.exists(folder_name):
        os.mkdir(folder_name)

# extracts a valid artist name or list name from the tag <h2>
def get_artist_name(html):
    match = re.search(r'<h2[^>]*>(.*?)</h2>', html)
    if match:
        return match.group(1).strip()
    return "Unknown Artist"

# get data, return a list of song ids and titles
def get_music_data(url, headers):
    response = requests.get(url, headers=headers)
    return re.findall(r'<li><a href="/song\?id=(\d+)">(.*?)</a>', response.text), response.text

# title cleanup, replaces illegal characters
def sanitize_title(title):
    return re.sub(r'[\/:*?"<>|]', ' ', title)

# download the song and save it
def download_music(num_id, title, folder_name):
    music_url = f'https://music.163.com/song/media/outer/url?id={num_id}.mp3'
    music_content = requests.get(music_url).content
    file_path = os.path.join(folder_name, f"{sanitize_title(title).strip()}.mp3")
    with open(file_path, 'wb') as f:
        f.write(music_content)

# download songs
def download_artist_music(url, headers):
    # get music data and html content
    music_data, html = get_music_data(url, headers)
    # extract the artist name or list name as the folder name
    artist_name = get_artist_name(html)
    create_music_folder(artist_name)
    # download
    for num_id, title in music_data:
        download_music(num_id, title, artist_name)

def main(urls):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36'
    }
    # multi-threading download music
    with ThreadPoolExecutor() as executor:
        executor.map(lambda url: download_artist_music(url, headers), urls)

if __name__ == "__main__":
    urls = [
        'https://music.163.com/artist?id=××××××××',
        'https://music.163.com/artist?id=××××'  # add more URLs
    ]
    main(urls)