爬虫(汪峰歌词实战)

最新推荐文章于 2023-12-09 09:54:22 发布

365JHWZGo

最新推荐文章于 2023-12-09 09:54:22 发布

阅读量323

点赞数

分类专栏：实践中的细节文章标签：爬虫 python 数据挖掘

本文链接：https://blog.csdn.net/qq_44833392/article/details/124616877

版权

实践中的细节专栏收录该内容

25 篇文章 2 订阅

订阅专栏

本文详细介绍了如何使用Python爬虫技术从指定网站抓取汪峰的所有歌词，包括了解编码问题、定位目标元素并解析链接，以及处理动态加载内容。通过BeautifulSoup库实现，展示了获取网页标题、链接和处理复杂歌名的方法。

摘要由CSDN通过智能技术生成

爬虫(汪峰歌词实战)

下载相关包

pip install lxml
pip install bs4

创建实例

from bs4 import BeautifulSoup
url = "a website which you want to get some datasets"
html = urlopen(url,).read().decode('GB2312',errors="ignore")
soup = BeautifulSoup(html,features='lxml')

解决乱码

html中解码和req中的编码需要一致，如果不一样时需要先查看然后统一就可以

req = requests.get(url)
print(req.status_code)

print(req.encoding)
print(req.apparent_encoding)
print(requests.utils.get_encodings_from_content(req.text))

req.encoding = 'GB2312'

举例

TASK：汪峰歌词网页，如何获取汪峰的所有歌词？

获取网页中标签的内容

获取中的内容（title）和href，可以看到两者皆在含有class标签的h2中，并且不止有着一个title和link

from bs4 import BeautifulSoup
from urllib.request import urlopen

url = "https://www.geci345.com/tag/wf/"
html = urlopen(url,).read().decode('utf-8',errors="ignore")
soup = BeautifulSoup(html,features='lxml')
item = soup.find_all('h2',{'class':'entry-title'})
for i in item:
    print(f'title:{i.a.text}\tlink:{i.a["href"]}')

效果并不是很好，可以看到title并不是我们希望的，所以需要使用split来进行截取

from bs4 import BeautifulSoup
from urllib.request import urlopen

url = "https://www.geci345.com/tag/wf/"
html = urlopen(url,).read().decode('utf-8',errors="ignore")
soup = BeautifulSoup(html,features='lxml')
item = soup.find_all('h2',{'class':'entry-title'})
for i in item:
    print(f'title:{i.a.text.split("歌词")[0]}\tlink:{i.a["href"]}')

写到这里又发现了另一个棘手的问题，那就是有些歌曲是动态加载的及必须需要手动往下滑动才可以进行显示剩余部分。

！！！但是我还不会这部分，just pass～

完整代码

'''
Description: 爬虫获取汪峰歌词
Autor: 365JHWZGo
Date: 2022-05-05 19:45:31
LastEditors: 365JHWZGo
LastEditTime: 2022-05-06 19:13:28
'''
import re
from bs4 import BeautifulSoup
from urllib.request import urlopen

url = "https://www.geci345.com/tag/wf"
html = urlopen(url,).read().decode('utf-8',errors="ignore")
soup = BeautifulSoup(html,features='lxml')

# 获取所有页面
pagenumber = [url]
temp = soup.find_all(lambda tag: tag.name == 'a' and tag.get('class') == ['page-numbers'])
for t in temp:
    pagenumber.append(t['href'])
# 歌词题目
titles = []
# 歌词内容链接
content_link = []
# 循环
for page in pagenumber:
    html = urlopen(page,).read().decode('utf-8',errors="ignore")
    soup = BeautifulSoup(html,features='lxml')
    item = soup.find_all('h2',{'class':'entry-title'})
    for i in item:
        name = i.a.text.split("歌词")[0]
        # 当歌名不重复时
        name = name.replace('汪峰 ','')
        name = name.replace('汪峰','')
        if name not in titles:
            titles.append(name)
            content_link.append(i.a["href"])
# print('pagenumber',pagenumber)
# print(content_link,titles)      
# 获取歌词
with open('./wangfenglyrics.txt','w') as f:
    for link in content_link:
        html = urlopen(link,).read().decode('utf-8',errors="ignore")
        soup = BeautifulSoup(html,features='lxml')
        lyrics = soup.find_all('div',{'class':'single-content'})
        for ly in lyrics:
            f.write(ly.text)

365JHWZGo

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
爬虫(汪峰歌词实战)

爬虫(汪峰歌词实战)下载相关包pip install lxmlpip install bs4创建实例from bs4 import BeautifulSoupurl = "a website which you want to get some datasets"html = urlopen(url,).read().decode('GB2312',errors="ignore")soup = BeautifulSoup(html,features='lxml')解决乱码html中解
复制链接

扫一扫