使用Python 和 Selenium 抓取酷狗音乐专辑附源码

LIY若依

已于 2024-07-31 23:43:47 修改

阅读量2.9k

点赞数 32

文章标签： python 开发语言

于 2024-07-31 23:21:48 首次发布

本文链接：https://blog.csdn.net/m0_74972192/article/details/140834632

版权

在这篇博客中，我将分享如何使用Python和Selenium抓取酷狗音乐网站上的歌曲信息。我们将使用BeautifulSoup解析HTML内容，并提取歌曲和专辑信息。

依赖库

requests
beautifulsoup4
selenium

准备工作

首先，我们需要安装一些必要的库：

pip install requests beautifulsoup4 selenium

步骤

第一步：初始化参数

我们使用Options配置Chrome浏览器为无头模式，并设置其他参数以确保浏览器在服务器环境中正常运行。

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')

第二步：URL参数编码

我们使用urllib.parse.quote对输入的歌手名进行URL编码，以便在搜索URL中使用。

import urllib.parse

keyword = input('请输入歌手：')
search_url = f'https://www.kugou.com/yy/html/search.html#searchType=song&searchKeyWord={urllib.parse.quote(keyword)}'

第三步：使用Selenium打开页面

我们使用Selenium打开酷狗音乐的搜索页面，并等待页面加载完成。

driver = webdriver.Chrome(options=chrome_options)
driver.get(search_url)
driver.implicitly_wait(10)
html_content = driver.page_source
driver.quit()

第四步：解析HTML内容

我们使用BeautifulSoup解析页面源代码，并提取歌曲和专辑信息。

from bs4 import BeautifulSoup as be

soup = be(html_content, 'html.parser')
albums = soup.find_all('a', class_='album_name')
songs = soup.find_all('a', class_='song_name')

第五步：打印结果

我们迭代提取的歌曲和专辑信息，并打印每首歌的名称、专辑和链接。

import requests

assert len(songs) == len(albums)

for song, album in zip(songs, albums):
    song_title = song.get('title')
    album_title = album.get('title')
    album_url = album.get('href')

    if not album_title:
        album_title = "无专辑"

    print(f'歌名: {song_title}, 专辑: {album_title}, url: {album_url}')

    album_response = requests.get(album_url)
    album_soup = be(album_response.text, 'html.parser')
    audio_elements = album_soup.find_all('audio')

    for audio in audio_elements:
        mp3_url = audio.get('src')
        if mp3_url:
            print(f'专辑链接: {mp3_url}')

完整代码

以下是完整的代码：

import os
import requests
from bs4 import BeautifulSoup as be
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import urllib.parse

# 初始化参数
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')

# URL参数编码
keyword = input('请输入歌手：')
search_url = f'https://www.kugou.com/yy/html/search.html#searchType=song&searchKeyWord={urllib.parse.quote(keyword)}'

# 第一步：使用Selenium打开页面
driver = webdriver.Chrome(options=chrome_options)
driver.get(search_url)

# 等待页面加载完成
driver.implicitly_wait(10)

# 获取页面源代码
html_content = driver.page_source

# 关闭浏览器
driver.quit()

# 第二步：解析HTML内容以提取所需的歌曲信息
soup = be(html_content, 'html.parser')
albums = soup.find_all('a', class_='album_name')

songs = soup.find_all('a', class_='song_name')

# 确保 songs 和 albums 的长度相同
assert len(songs) == len(albums)

# 同时迭代 songs 和 albums
for song, album in zip(songs, albums):
    song_title = song.get('title')
    album_title = album.get('title')
    album_url = album.get('href')

    # 如果专辑名为空，打印 "无专辑"
    if not album_title:
        album_title = "无专辑"

    print(f'歌名: {song_title}, 专辑: {album_title}, url: {album_url}')

    # 请求专辑页面
    album_response = requests.get(album_url)
    album_soup = be(album_response.text, 'html.parser')

    # 查找专辑页面中的音频文件链接
    audio_elements = album_soup.find_all('audio')

    for audio in audio_elements:
        mp3_url = audio.get('src')
        if mp3_url:
            print(f'专辑链接: {mp3_url}')

代码解析：

初始化参数：我们使用Options配置Chrome浏览器为无头模式，并设置其他参数以确保浏览器在服务器环境中正常运行。
URL参数编码：我们使用urllib.parse.quote对输入的歌手名进行URL编码，以便在搜索URL中使用。
使用Selenium打开页面：我们使用Selenium打开酷狗音乐的搜索页面，并等待页面加载完成。
解析HTML内容：我们使用BeautifulSoup解析页面源代码，并提取歌曲和专辑信息。
打印结果：我们迭代提取的歌曲和专辑信息，并打印每首歌的名称、专辑和链接。

运行结果：

爬虫项目推荐

其他项目推荐

扩展示例 1：保存歌曲信息到 CSV 文件

我们可以将抓取到的歌曲信息保存到 CSV 文件中，以便后续分析和处理。

import csv

with open('songs.csv', 'w', newline='', encoding='utf-8') as csvfile:
    fieldnames = ['歌名', '专辑', '链接']
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

    writer.writeheader()
    for song, album in zip(songs, albums):
        song_title = song.get('title')
        album_title = album.get('title')
        song_url = song.get('href')

        if not album_title:
            album_title = "无专辑"

        writer.writerow({'歌名': song_title, '专辑': album_title, '链接': song_url})

扩展示例 2：多线程抓取

为了提高抓取效率，我们可以使用多线程来并行抓取歌曲信息。

import threading

def fetch_song_info(song, album):
    song_title = song.get('title')
    album_title = album.get('title')
    song_url = song.get('href')

    if not album_title:
        album_title = "无专辑"

    print(f'歌名: {song_title}, 专辑: {album_title}, url: {song_url}')

    song_response = requests.get(song_url)
    song_soup = be(song_response.text, 'html.parser')
    lyrics = song_soup.find('div', class_='lyrics')

    if lyrics:
        print(f'歌词: {lyrics.text}')

threads = []
for song, album in zip(songs, albums):
    thread = threading.Thread(target=fetch_song_info, args=(song, album))
    threads.append(thread)
    thread.start()

for thread in threads:
    thread.join()

扩展示例 3：使用代理

为了避免被网站封禁，我们可以使用代理来进行抓取。

proxies = {
    'http': 'http://your_proxy:port',
    'https': 'https://your_proxy:port',
}

for song, album in zip(songs, albums):
    song_title = song.get('title')
    album_title = album.get('title')
    song_url = song.get('href')

    if not album_title:
        album_title = "无专辑"

    print(f'歌名: {song_title}, 专辑: {album_title}, url: {song_url}')

    song_response = requests.get(song_url, proxies=proxies)
    song_soup = be(song_response.text, 'html.parser')
    lyrics = song_soup.find('div', class_='lyrics')

    if lyrics:
        print(f'歌词: {lyrics.text}')