python爬虫requests实战_Python爬虫实战(3)-爬取豆瓣音乐Top250数据(超详细)

最新推荐文章于 2024-05-13 15:07:13 发布

weixin_39630498

最新推荐文章于 2024-05-13 15:07:13 发布

阅读量621

点赞数

文章标签： python爬虫requests实战

点击标题下「蓝色微信名」可快速关注

前言

首先我们先来回忆一下上两篇爬虫实战文章：

第一篇：讲到了requests和bs4和一些网页基本操作。

第二篇：用到了正则表达式-re模块

今天我们用lxml库和xpath语法来爬虫实战。

1.安装lxml库

window：直接用pip去安装，注意一定要找到pip的安装路径

Java

pip install lxml

1

pipinstalllxml

2.xpath语法

xpath语法不会的可以参考下面的地址：

http://www.w3school.com.cn/xpath/index.asp

爬虫实战

先上部分效果图：

今天我们来爬一下“豆瓣音乐Top250的数据”

1.观察网页切换规律

https://music.douban.com/top250?start=0

https://music.douban.com/top250?start=25

https://music.douban.com/top250?start=50

从中我们已发现了规律。

2.爬取豆瓣音乐中的歌名、信息、星评爬虫完整代码如下：

Java

import requests

from lxml import etree

headers = {

'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36'}

list=[1]

def getResult():

urls=["https://music.douban.com/top250?start={}".format(str(i)) for i in range(0,250,25)]

for url in urls:

data = requests.get(url, headers=headers)

html = etree.HTML(data.text)

#循环标签

count = html.xpath("//tr[@class='item']")

for info in count:

title = info.xpath("normalize-space(td[2]/div/a/text())")#标题

list[0]=title #因为title用normalize-space去掉空格了，再生产result时标题显示不全，所以我用了list替换它

star = info.xpath("td[2]/div/div/span[2]/text()") # 星评

brief_introduction = info.xpath("td[2]/div/p//text()") #简介

#生成result串

for star, title, brief_introduction in zip(star, list, brief_introduction):

result = {

"title": title,

"star": star,

"brief_introduction": brief_introduction,

}

print(result)

if __name__ == '__main__':

getResult()

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

importrequests

fromlxmlimportetree

headers={

'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36'}

list=[1]

defgetResult():

urls=["https://music.douban.com/top250?start={}".format(str(i))foriinrange(0,250,25)]

forurlinurls:

data=requests.get(url,headers=headers)

html=etree.HTML(data.text)

#循环标签

count=html.xpath("//tr[@class='item']")

forinfoincount:

title=info.xpath("normalize-space(td[2]/div/a/text())")#标题

list[0]=title#因为title用normalize-space去掉空格了，再生产result时标题显示不全，所以我用了list替换它

star=info.xpath("td[2]/div/div/span[2]/text()")#星评

brief_introduction=info.xpath("td[2]/div/p//text()")#简介

#生成result串

forstar,title,brief_introductioninzip(star,list,brief_introduction):

result={

"title":title,

"star":star,

"brief_introduction":brief_introduction,

}

print(result)

if__name__=='__main__':

getResult()

分析：

代码中urls为了循环出所有的url

对xpath不懂的可以去看一下具体的语言，还是比较简单明了的，而且使用非常方便

normalize-space表示通过去掉前导和尾随空白并使用单个空格替换一系列空白字符，使空白标准化。如果省略了该参数，上下文节点的字符串值将标准化并返回

基本上就是这些难点，大家有不会的可以直接问我，另外大家也可以尝试去爬取别的数据，多敲多练！

希望对刚入门的朋友有所帮助！

weixin_39630498

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
python爬虫requests实战_Python爬虫实战(3)-爬取豆瓣音乐Top250数据(超详细)

点击标题下「蓝色微信名」可快速关注前言首先我们先来回忆一下上两篇爬虫实战文章：第一篇：讲到了requests和bs4和一些网页基本操作。第二篇：用到了正则表达式-re模块今天我们用lxml库和xpath语法来爬虫实战。1.安装lxml库window：直接用pip去安装，注意一定要找到pip的安装路径Javapip install lxml1pipinstalllxml2.xpath语法xpath语...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。