爬虫(汪峰歌词实战)
下载相关包
pip install lxml
pip install bs4
创建实例
from bs4 import BeautifulSoup
url = "a website which you want to get some datasets"
html = urlopen(url,).read().decode('GB2312',errors="ignore")
soup = BeautifulSoup(html,features='lxml')
解决乱码
html中解码和req中的编码需要一致,如果不一样时需要先查看然后统一就可以
req = requests.get(url)
print(req.status_code)
print(req.encoding)
print(req.apparent_encoding)
print(requests.utils.get_encodings_from_content(req.text))
req.encoding = 'GB2312'
举例
TASK:汪峰歌词网页,如何获取汪峰的所有歌词?
获取网页中标签的内容
获取中的内容(title)和href,可以看到两者皆在含有class标签的h2中,并且不止有着一个title和link
from bs4 import BeautifulSoup
from urllib.request import urlopen
url = "https://www.geci345.com/tag/wf/"
html = urlopen(url,).read().decode('utf-8',errors="ignore")
soup = BeautifulSoup(html,features='lxml')
item = soup.find_all('h2',{'class':'entry-title'})
for i in item:
print(f'title:{i.a.text}\tlink:{i.a["href"]}')
效果并不是很好,可以看到title并不是我们希望的,所以需要使用split来进行截取
from bs4 import BeautifulSoup
from urllib.request import urlopen
url = "https://www.geci345.com/tag/wf/"
html = urlopen(url,).read().decode('utf-8',errors="ignore")
soup = BeautifulSoup(html,features='lxml')
item = soup.find_all('h2',{'class':'entry-title'})
for i in item:
print(f'title:{i.a.text.split("歌词")[0]}\tlink:{i.a["href"]}')
写到这里又发现了另一个棘手的问题,那就是有些歌曲是动态加载的及必须需要手动往下滑动才可以进行显示剩余部分。
!!!但是我还不会这部分,just pass~
完整代码
'''
Description: 爬虫获取汪峰歌词
Autor: 365JHWZGo
Date: 2022-05-05 19:45:31
LastEditors: 365JHWZGo
LastEditTime: 2022-05-06 19:13:28
'''
import re
from bs4 import BeautifulSoup
from urllib.request import urlopen
url = "https://www.geci345.com/tag/wf"
html = urlopen(url,).read().decode('utf-8',errors="ignore")
soup = BeautifulSoup(html,features='lxml')
# 获取所有页面
pagenumber = [url]
temp = soup.find_all(lambda tag: tag.name == 'a' and tag.get('class') == ['page-numbers'])
for t in temp:
pagenumber.append(t['href'])
# 歌词题目
titles = []
# 歌词内容链接
content_link = []
# 循环
for page in pagenumber:
html = urlopen(page,).read().decode('utf-8',errors="ignore")
soup = BeautifulSoup(html,features='lxml')
item = soup.find_all('h2',{'class':'entry-title'})
for i in item:
name = i.a.text.split("歌词")[0]
# 当歌名不重复时
name = name.replace('汪峰 ','')
name = name.replace('汪峰','')
if name not in titles:
titles.append(name)
content_link.append(i.a["href"])
# print('pagenumber',pagenumber)
# print(content_link,titles)
# 获取歌词
with open('./wangfenglyrics.txt','w') as f:
for link in content_link:
html = urlopen(link,).read().decode('utf-8',errors="ignore")
soup = BeautifulSoup(html,features='lxml')
lyrics = soup.find_all('div',{'class':'single-content'})
for ly in lyrics:
f.write(ly.text)