学习的产物总是充实和快乐的,下面就来看一下今天的酷狗音乐排行榜的爬取吧!!!
首先我们通过观看酷狗网页的源代码可以看到:
红色记号画出来的就是我们这次爬取需要的信息,我们通过观察可以找到他们位于哪些标签之中,最后通过select方法提取出来就行了。
下面呈上代码:
import requests
from bs4 import BeautifulSoup
import time
def get_HTML(url):
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.81 Safari/537.36",
"referer": "https://www.kugou.com/yy/rank/home/1-6666.html?from=rank"
}
try:
r = requests.get(url,headers = headers)
r.raise_for_status()
r.encoding = r.apparent_encoding
return r.text
except:
return ""
def explain_HTML(mylist, html):
soup = BeautifulSoup(html,'html.parser')
songs = soup.select('div.pc_temp_songlist > ul > li > a')
ranks = soup.select('span.pc_temp_num')
times = soup.select('span.pc_temp_time')
for rank,song,time in zip(ranks,songs,times):
data = [
rank.get_text().strip(),
song.get_text().split("-")[1],
song.get_text().split("-")[0],
time.get_text().strip()
]
mylist.append(data)
def print_HTML(mylist):
for i in range(500):
x = mylist[i]
with open("F:/KuGou.text",'a',encoding = 'UTF-8') as f:
f.write("{0:<10}\t{1:{4}<25}\t{2:{4}<20}\t{3:<10}\n".format(x[0],x[1],x[2],x[3],chr(12288)))
#用format方法可以删除列表多余的符号使爬取内容更简洁
#chr(12288)表示中文输入情况下的空格可以使结果更有效的对齐
# 但是由于某些歌曲包含了英文格式的数字、字母等所以结果并不是那么整齐
if __name__ == '__main__':
url_0 = 'http://www.kugou.com/yy/rank/home/'
url_1 = '-8888.html'
mylist = []
with open("F:/KuGou.text",'a',encoding = "UTF-8") as f:
f.write("{0:<10}\t{1:{4}<25}\t{2:{4}<20}\t{3:<10}\n".format("排名","歌曲","歌手","时间",chr(12288)))
for j in range(1,24):
url = url_0 + str(j) + url_1
html = get_HTML(url)
explain_HTML(mylist, html)
print_HTML(mylist)
time.sleep(1)#设置爬虫速度防止访问过快造成爬取失败
由于上述注释中的问题,结果的格式并不是很完美,但是如果要爬取的信息只有中文的话应该就很整齐了: