Xpath爬取豆瓣音乐250

最新推荐文章于 2023-09-26 12:04:57 发布

灯下夜无眠

最新推荐文章于 2023-09-26 12:04:57 发布

阅读量201

点赞数

分类专栏：爬虫

本文链接：https://blog.csdn.net/LLMUZI123456789/article/details/111775546

版权

爬虫专栏收录该内容

3 篇文章 0 订阅

订阅专栏

本次爬虫只是为了简单回顾一下request和xpath的用法，便于以后回顾。

# 导入所需的库
import requests
from lxml import etree
import time

# 获取url
urls = ["https://music.douban.com/top250?start={}".format(str(i)) for i in range(0, 250, 25)]
# 伪装请求头
headers = {'user-agent':'Mozilla/5.0'}
# 构造空列表，存储数据
lit = []
# 遍历每个url
for url in urls:
#     发起请求
    res = requests.get(url=url, headers=headers)
#     解析HTML页面
    selector = etree.HTML(res.text)
#     定位到每个table标签
    info_list = selector.xpath('//div[@class="indent"]/table')
#     定位所要抓取的内容
    for i in info_list:
#         构造空字典，存储数据
        dic = {}
#         获取歌曲名称
        dic['name'] = i.xpath('.//div/a/text()')[0]
    
        info = i.xpath('.//div/p/text()')[0]
#         获取歌手名称
        dic['songer'] = info.split('/')[0]
#         获取日期信息
        dic['date'] = info.split('/')[1]
#         获取歌曲类型
        dic['song_type'] = info.split('/')[-1]
#         追加到列表
        lit.append(dic)
#     为每个url设置停顿3秒时间
    time.sleep(3)

# 将数据转换为数据框格式
import pandas as pd
df = pd.DataFrame(lit)
df.head()

	name	songer	date	song_type
0	\n We Sing. We Dance. We Steal Thin...	Jason Mraz	2008-05-13	民谣
1	\n Viva La Vida\n	Coldplay	2008-06-17	摇滚
2	\n 华丽的冒险\n	陈绮贞	2005-09-23	流行
3	\n 范特西\n	周杰伦	2001-09-14	流行
4	\n 後。青春期的詩\n	五月天	2008-10-23	摇滚

# 适当清洗数据
df['name'] = df['name'].apply(lambda x: x.strip())
df.head()

	name	songer	date	song_type
0	We Sing. We Dance. We Steal Things.	Jason Mraz	2008-05-13	民谣
1	Viva La Vida	Coldplay	2008-06-17	摇滚
2	华丽的冒险	陈绮贞	2005-09-23	流行
3	范特西	周杰伦	2001-09-14	流行
4	後。青春期的詩	五月天	2008-10-23	摇滚