不调用网页内容直接批量爬取MP3音频文件

最新推荐文章于 2024-08-16 11:38:09 发布

爱吃零食的水泥大仙

最新推荐文章于 2024-08-16 11:38:09 发布

阅读量3.3k

点赞数

分类专栏：笔记文章标签： python 数据挖掘爬虫

本文链接：https://blog.csdn.net/qq_37268093/article/details/109626776

版权

笔记专栏收录该内容

16 篇文章 1 订阅

订阅专栏

需求：

根据字典数据表中的汉字读音列表爬取单个字的拼音音频文件

目标网址：

https://hanyu.baidu.com/

网页分析：

F12
因为是音频文件。直接在媒体里面找。点击Media，如果是空白的。点击一下页面小喇叭的地方。会跳出文件。根据头部信息header中的url，转到一看。就是所需要的音频文件。
在这里插入图片描述

多搜几个几个字，对比找寻文件地址规律。
云：https://fanyiapp.cdn.bcebos.com/zhdict/mp3/yun2.mp3
牛：https://fanyiapp.cdn.bcebos.com/zhdict/mp3/niu2.mp3
。。。。。。
可以发现，url地址中，末尾的yun、niu是汉字的拼音，后面的数字是该字的声调，第几声就是数字几。

爬取思路：

1、连接数据库获取拼音名
2、直接定位在文件url网址，通过循环直接进行下载。不必再进行头部header伪装请求主网页。
3、保存文件

代码实现：

1、连接数据库获取拼音名
在这里插入图片描述
数据库表内容。因为爬取下来的文件存在命名问题，毕竟中文中同音字太多，得加以汉字名字区别。我们需要的是汉字名和拼音名。就是Chinese和nicksounds两部分。所以需要设置多表连接。

#导入数据库
conn = pymysql.connect("localhost", "root", "123456", "sys")
cursor = conn.cursor()
sql = "SELECT chinese,nicksounds FROM cr_dictcnold "
cursor.execute(sql)
results = cursor.fetchall()
print(results)

2、数据库遍历分片提取。因为有些字是多音字，根据数据库中的表单看。查询出来的一个str列表中会存在一个以上的读音。直接根据查询的结果进行网址拼接，多音字会出现错误。

for row in results:
    chinese = row[0]#整个results是一个字符串，第一个元素是汉字，所以取第一个赋值为chinese
    print(chinese)
    yin = row[1].split(',')#同理，拼音及语调是从第二个元素开始，对于多音字的时候，两个音以逗号分隔开。所以使用split分片工具，将他们从逗号分开
    print(yin)
    for i in range(len(yin)):#遍历yin中的拼音，组合形成文件的url。直接intonation遍历数组只会取到最后一个元素。所以用i遍历赋值。
        intonation = yin[i]
        mp3_url = "https://fanyiapp.cdn.bcebos.com/zhdict/mp3/"+intonation+".mp3"

在这里插入图片描述
运行效果能看出，已经分开提取完成。
3、保存文件`

def Saving(url, name,intonation):#对于多音字只用中文命名会出现同名文件，默认会不再添加进去。所以需要设置三个名字
    # 设置存储位置
    root = r'E:\Intonation\\'
    str_data = str(intonation)#因为str只能和str进行拼接。故加一个强制转换
    path = root + name + str_data+ '.mp3'
    try:
        if not os.path.exists(root):
            os.mkdir(root)
        if not os.path.exists(path):
            print(url)
            r = requests.get(url)
            print(r)
            with open(path, 'wb') as f:
                f.write(r.content)
                # f = r.replace(r, "1.jpg")
                f.close()
                print(chinese+"文件保存成功")
        else:
            print("文件已存在")
    except:
        print("爬取失败")
        return
    return

所有代码部分完成

完整代码：

import pymysql
import os
import requests
#导入数据库
conn = pymysql.connect("localhost", "root", "123456", "sys")
cursor = conn.cursor()
sql = "SELECT chinese,nicksounds FROM cr_dictcnold "
cursor.execute(sql)
results = cursor.fetchall()
print(results)
def Saving(url, name,intonation):
    # 设置存储位置
    root = r'E:\Intonation\\'
    str_data = str(intonation)
    path = root + name + str_data+ '.mp3'
    try:
        if not os.path.exists(root):
            os.mkdir(root)
        if not os.path.exists(path):
            print(url)
            r = requests.get(url)
            print(r)
            with open(path, 'wb') as f:
                f.write(r.content)
                f.close()
                print(chinese+"文件保存成功")
        else:
            print("文件已存在")
    except:
        print("爬取失败")
        return
    return
for row in results:
    chinese = row[0]
    print(chinese)
    yin = row[1].split(',')
    print(yin)
    for i in range(len(yin)):
        intonation = yin[i]
        mp3_url = "https://fanyiapp.cdn.bcebos.com/zhdict/mp3/"+intonation+".mp3"
        print(mp3_url)
        Saving(mp3_url,chinese,yin[i])