python——爬虫网页MP3文件

最新推荐文章于 2024-05-27 09:54:44 发布

yuerwen_python

最新推荐文章于 2024-05-27 09:54:44 发布

阅读量2.1k

点赞数

分类专栏：笔记文章标签： python 爬虫正则表达式

本文链接：https://blog.csdn.net/weixin_41887201/article/details/121133292

版权

笔记专栏收录该内容

23 篇文章 1 订阅

订阅专栏

爬虫的网页：http://www.listeningexpress.com/studioclassroom/ad/

需求：在指定网页上爬虫下载MP3文件

思路：
1、使用request库爬取网页的源码
1.1使用request.get(scr)获取网页的html源码
1.2 request.get(scr).text 转化成字符串
2、使用正则表达式分析MP3文本的地址
3、拼接字符串地址
4、调用第三方wget下载文件：os.system(f’{wget_scr} {fullAddr}’)

import requests,re,os
from urllib.parse import quote

wget_scr = r'D:\tools\wget'
target_scr = r'http://www.listeningexpress.com/studioclassroom/ad/'

# 1、使用request库获取 html网页
ret = requests.get(target_scr)
# 将request.get类转换成 字符串
content = ret.text

# 正则表达式获取MP3文件地址
p = re.compile(r"javascript:p\('(.*?)'")
# 返回的是一个列表
MP3_list = p.findall(content)

for scr in MP3_list:
	# 字符串拼接
	fullAddr = target_scr + quote(scr)
	# 执行下载文件
	os.system(f'{wget_scr} {fullAddr}')

MP3文件中有空格字符，组成下载网址时，需要进行url编码，否则空格会被当成命令行分隔符。参考代码如下所示

>>> from urllib.parse import quote
>>> quote('2019-04-13 NEWSworthy Clips.mp3')
'2019-04-13%20NEWSworthy%20Clips.mp3'

yuerwen_python

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
打赏
2
评论
python——爬虫网页MP3文件

爬虫的网页：http://www.listeningexpress.com/studioclassroom/ad/需求：在指定网页上爬虫下载MP3文件思路：1、使用request库爬取网页的源码1.1使用request.get(scr)获取网页的html源码1.2 request.get(scr).text 转化成字符串2、使用正则表达式分析MP3文本的地址3、拼接字符串地址4、调用第三方wget下载文件：os.system(f’{wget_scr} {fullAddr}’)import
复制链接

扫一扫