如何使用Python、Selenium 爬取酷狗音乐网站的歌曲信息

LIY若依

已于 2024-07-29 20:41:24 修改

阅读量3.6k

点赞数 8

文章标签： selenium 爬虫 chrome

于 2024-07-29 20:33:21 首次发布

本文链接：https://blog.csdn.net/m0_74972192/article/details/140780589

版权

在这篇文章中，我们将学习如何使用Python、Selenium和BeautifulSoup进行网络爬虫。我们将创建一个简单的爬虫，用于从酷狗音乐网站抓取歌曲信息。

依赖库

我们将使用以下工具和库：

Python：我们的编程语言
Selenium：一个用于网页自动化的工具，可以模拟用户的浏览行为
BeautifulSoup：一个用于解析HTML和XML文档的Python库
urllib：一个用于处理URL的Python模块

完整代码如下：

import urllib.parse
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

# 设置Chrome选项
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')

# 初始化浏览器对象
driver = webdriver.Chrome(options=chrome_options)

# 添加headers，模拟浏览器请求
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 '
                  '(KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}

# URL参数编码
keyword = input('请输入歌手：')
search_url = f'https://www.kugou.com/yy/html/search.html#searchType=song&searchKeyWord={urllib.parse.quote(keyword)}'

# 第一步：使用Selenium打开页面
driver.get(search_url)

# 获取页面源代码
html_content = driver.page_source

# 关闭浏览器
driver.quit()

# 第二步：解析HTML内容以提取所需的歌曲信息
soup = BeautifulSoup(html_content, 'html.parser')
songs = soup.find_all('a', class_='song_name')

# 打印所有歌曲的信息
for song in songs:
    title = song.get('title')
    print(f'Title: {title}')

运行结果：

该代码运行后会根据用户输入的歌手名，在酷狗音乐网站搜索相关歌曲，并打印出歌曲名称。

代码解析

设置Chrome选项

首先，我们需要设置Chrome的一些选项。这些选项包括：

--headless：这个选项让Chrome在后台运行，也就是无头模式。
--disable-gpu：这个选项禁用了GPU硬件加速。
--no-sandbox：这个选项禁用了沙箱模式。
--disable-dev-shm-usage：这个选项禁止使用/dev/shm。

chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')

driver = webdriver.Chrome(options=chrome_options)

添加headers

我们需要添加headers来模拟浏览器请求。这是因为一些网站可能会阻止没有headers的请求。

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 '
                  '(KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}

URL参数编码

我们需要对URL参数进行编码，以便在URL中使用。我们使用urllib模块的quote函数来实现这一点。

import urllib.parse

keyword = input('请输入歌手：')
search_url = f'https://www.kugou.com/yy/html/search.html#searchType=song&searchKeyWord={urllib.parse.quote(keyword)}'

使用Selenium打开页面

我们使用Selenium的get方法来打开页面。

html_content = driver.page_source

获取页面源代码

我们使用Selenium的page_source属性来获取页面源代码。

driver.get(search_url)

关闭浏览器

完成所有操作后，我们需要关闭浏览器。

driver.quit()

解析HTML内容以提取所需的歌曲信息

我们使用BeautifulSoup来解析HTML内容，并提取我们需要的歌曲信息。

soup = BeautifulSoup(html_content, 'html.parser')
songs = soup.find_all('a', class_='song_name')

for song in songs:
    title = song.get('title')
    print(f'Title: {title}')

其他文章推荐

总结

这就是我们如何使用Python、Selenium和BeautifulSoup进行网络爬虫的全部内容。希望你喜欢这篇文章，并从中学到一些新的知识。如果你有任何问题或建议，欢迎在下面的评论区留言。谢谢你的阅读！

结论

使用Python、Selenium和BeautifulSoup进行网络爬虫是一个强大的组合，可以帮助我们自动化地从网页中提取信息。本文的示例展示了如何从酷狗音乐网站抓取歌曲信息，提供了详细的代码和步骤解析，希望能为你的网络爬虫项目提供一些有用的参考。欢迎在评论区留言。继续探索和学习，祝你在深度学习的旅程中取得更多的成果！🚀

希望这个结论对你有所帮助！如果你有任何其他问题或需要进一步的帮助，请随时告诉我。😊