【项目实战】WaveNet 代码解析 —— audio_reader.py 【更新中】

最新推荐文章于 2024-08-21 09:27:33 发布

FallenDarkStar

最新推荐文章于 2024-08-21 09:27:33 发布

阅读量416

点赞数 1

分类专栏： WaveNet_TensorFlow 文章标签： tensorflow google wave 正则表达式队列深度学习

本文链接：https://blog.csdn.net/weixin_42721167/article/details/113112622

版权

WaveNet_TensorFlow 专栏收录该内容

4 篇文章 4 订阅

订阅专栏

WaveNet 代码解析 —— audio_reader.py

文章目录

WaveNet 代码解析 —— audio_reader.py

简介

本项目一个基于 WaveNet 生成神经网络体系结构的语音合成项目，它是使用 TensorFlow 实现的(项目地址)。

WaveNet 神经网络体系结构能直接生成原始音频波形，在文本到语音和一般音频生成方面显示了出色的结果(详情请参阅 WaveNet 的详细介绍)。

由于 WaveNet 项目较大，代码较多。为了方便学习与整理，将按照工程文件的结构依次介绍。

本文将介绍项目中的 audio_reader.py 文件：音频读取脚本。

代码解析

全局变量解析

以下变量主要作为 audio_reader.py 脚本的全局变量。

		FILE_PATTERN = r'p([0-9]+)_([0-9]+)\.wav'
		# r'...'是原字符串，\反斜线不会特殊对待，即没有转义
		# 匹配以下形式的字符串："p" + 任意数字 + "_" + 任意数字 + ".wav"

函数解析

find_files(directory, pattern=’*.wav’)

下面这段代码的主要任务是：匹配查找所有后缀为 “.wav” 的文件。
os.walk() 方法用于通过在目录树中游走输出在目录中的文件名，向上或者向下。

	def find_files(directory, pattern='*.wav'):
	    ''' 递归地查找所有与模式匹配的文件 '''
	    files = []
	    
	    # root: 当前正在遍历的文件夹地址
	    # dirnames: 该文件夹中所有的目录名组成的列表
	    # filenames: 该文件夹中所有的文件名组成的列表
	    for root, dirnames, filenames in os.walk(directory):
	        # 实现列表特殊字符的过滤或筛选，返回符合匹配模式的字符列表
	        for filename in fnmatch.filter(filenames, pattern):
	            # 拼接文件路径
	            files.append(os.path.join(root, filename))
	    
	    # 返回文件路径列表
	    return files

get_category_cardinality(files)

下面这段代码的主要任务是：遍历所给的所有文件，将找出最大和最小的说话人id。

	def get_category_cardinality(files):
	    # 匹配以下形式的字符串："p" + 任意数字 + "_" + 任意数字 + ".wav"
	    id_reg_expression = re.compile(FILE_PATTERN)
	    
	    # 初始化最大最小id
	    min_id = None
	    max_id = None
	    
	    # 遍历给定的所有文件
	    for filename in files:
	        # 匹配文件名，取第一组结果
	        matches = id_reg_expression.findall(filename)[0]
	        # 取音频的编号，id为大编号：区分说话人，recording_id为小编号：区分说话内容
	        id, recording_id = [int(id_) for id_ in matches]
	        # 针对音频大编号，找出最大和最小的id
	        if min_id is None or id < min_id:
	            min_id = id
	        if max_id is None or id > max_id:
	            max_id = id
	
	    # 返回说话人的最大最小id
	    return min_id, max_id

randomize_files(files)

下面这段代码的主要任务是：在给出的文件里表中，生成一个返回随机文件的生成器。

	def randomize_files(files):
	    # 遍历所有文件
	    for file in files:
	        # 随机生成一个整数，作为文件索引
	        file_index = random.randint(0, (len(files) - 1))
	        # 得到生成器，准备返回对于位置的文件
	        yield files[file_index]

load_generic_audio(directory, sample_rate)

下面这段代码的主要任务是：在给出的文件里表中，生成一个返回随机文件的生成器。
librosa.load() 用来读取音频文件，转化为了 numpy 的格式储存；返回音频信号值和采样率。

	def load_generic_audio(directory, sample_rate):
	    ''' 从目录中生成音频波形的生成器 '''
	    
	    # 查找以".wav"为后缀的音频文件
	    files = find_files(directory)
	    
	    # 设置匹配模板
	    id_reg_exp = re.compile(FILE_PATTERN)
	    
	    # 打印文件列表长度，随机取出文件
	    print("files length: {}".format(len(files)))
	    randomized_files = randomize_files(files)
	    
	    # 遍历取出的文件
	    for filename in randomized_files:
	        # 查找匹配的文件
	        ids = id_reg_exp.findall(filename)
	        if not ids:
	            # 文件名与包含id的模式不匹配，所以没有id。
	            category_id = None
	        else:
	            # 文件名与包含id的模式匹配.
	            category_id = int(ids[0][0])
	        
	        # 读取音频文件，输出音频的信号值
	        audio, _ = librosa.load(filename, sr=sample_rate, mono=True)
	        # 转换为一列
	        audio = audio.reshape(-1, 1)
	        
	        # 生成器返回音频信号值、文件名和文件id
	        yield audio, filename, category_id

trim_silence(audio, threshold, frame_length=2048)

下面这段代码的主要任务是：移除音频样本开始和结束时的沉默
librosa.feature.rmse() 的作用从音频样本 y 或声谱图 S 计算每帧的均方根（ RMS ）能量。计算来自音频样本的能量更快，因为它不需要 STFT 计算。然而，使用谱图可以更准确地表示能量随时间的变化，因为它的帧可以被加窗，因此如果它已经可用，更喜欢使用 S 。

	def trim_silence(audio, threshold, frame_length=2048):
	    ''' 移除样本开始和结束时的沉默 '''
	    if audio.size < frame_length:
	        frame_length = audio.size
	    
	    # 计算每帧的均方根
	    energy = librosa.feature.rmse(audio, frame_length=frame_length)
	    # 返回数组中大于阈值的元素的索引值数组，是二维数组
	    frames = np.nonzero(energy > threshold)
	    # 转换帧索引到音频样本索引，取索引列，第一列全为0
	    indices = librosa.core.frames_to_samples(frames)[1]
	
	    # 注意:如果整个音频是无声的，索引可以是一个空数组
	    return audio[indices[0]:indices[-1]] if indices.size else audio[0:0]

not_all_have_id(files)

下面这段代码的主要任务是：判断文件名是否符合定类别id所需的模式

	def not_all_have_id(files):
	    ''' 如果任何文件名不符合我们确定类别id所需的模式，则返回true '''
	    # 匹配模板
	    id_reg_exp = re.compile(FILE_PATTERN)
	    
	    # 遍历传入的所有文件
	    for file in files:
	        # 查找符合匹配模版的字符串，判断是否存在
	        ids = id_reg_exp.findall(file)
	        if not ids:
	            return True
	    return False

AudioReader类解析

AudioReader类成员变量解析

以下变量主要作为AudioReader类的成员变量

		audio_dir					# 音频样本文件目录
		coord						# 线程协调器
		receptive_field				# 感受野
		sample_rate					# 采样率
		sample_size					# 采样大小
		silence_threshold			# 沉默阈值
		
		sample_placeholder			# 32位float型占位符
		queue						# 32位float型占位符队列，默认大小为32
		enqueue						# 32位float型占位符入队操作
		
		gc_enabled					# 全局条件标志
		id_placeholder				# 32位int型占位符
		gc_queue					# 32位int型占位符队列，默认大小为32
		gc_enqueue					# 32位int型占位符入队操作
		
		gc_category_cardinality		# 类别基数
		
		threads						# 线程列表

AudioReader类成员函数解析

init (self, audio_dir, coord, sample_rate, gc_enabled, receptive_field, sample_size=None, silence_threshold=None, queue_size=32)

AudioReader类的初始化，为各成员变量赋值

    def __init__(self, audio_dir, coord, sample_rate, gc_enabled,
                 receptive_field, sample_size=None, 
                 silence_threshold=None, queue_size=32):
        self.audio_dir = audio_dir				# 赋值音频样本文件路径
        self.sample_rate = sample_rate			# 赋值采样率
        self.coord = coord						# 赋值线程协调器
        self.sample_size = sample_size			# 赋值样本大小
        self.receptive_field = receptive_field	# 赋值感受野
        self.silence_threshold = silence_threshold	# 赋值沉默阈值
        self.gc_enabled = gc_enabled			# 赋值全局条件标志
        self.threads = []						# 创建线程列表
        
        # 创建32位float型占位符
        self.sample_placeholder = tf.placeholder(dtype=tf.float32, shape=None)
        # 为占位符创建可包含动态形状的队列
        self.queue = tf.PaddingFIFOQueue(queue_size,
                                         ['float32'],
                                         shapes=[(None, 1)])
        # 为创建的队列创造入队方法
        self.enqueue = self.queue.enqueue([self.sample_placeholder])

        # 若全局条件标志开启
        if self.gc_enabled:
            # 创建32位int型占位符
            self.id_placeholder = tf.placeholder(dtype=tf.int32, shape=())
            # 为占位符创建可包含动态形状的队列
            self.gc_queue = tf.PaddingFIFOQueue(queue_size, ['int32'],
                                                shapes=[()])
            # 为创建的队列创造入队方法
            self.gc_enqueue = self.gc_queue.enqueue([self.id_placeholder])

        # 在audireader的线程中进行检查使得很难终止脚本的执行，所以我们现在在构造函数中执行
        # 在指定目录中查找文件
        files = find_files(audio_dir)
        # 若没找到文件则报错
        if not files:
            raise ValueError("No audio files found in '{}'.".format(audio_dir))
        
        # 若全局条件标志开启，但没找到相应的文件则报错
        if self.gc_enabled and not_all_have_id(files):
            raise ValueError("Global conditioning is enabled, but file names "
                             "do not conform to pattern having id.")
        
        # 确定我们将在嵌入表中容纳的互斥类别的数量
        if self.gc_enabled:
            # 取查到的文件索引
            _, self.gc_category_cardinality = get_category_cardinality(files)
            # 在最大的索引中添加1以获得类别的编号
            self.gc_category_cardinality += 1
            print("Detected --gc_cardinality={}".format(
                  self.gc_category_cardinality))
        else:
            self.gc_category_cardinality = None

dequeue(self, num_elements)

下面这段代码作用是：音频样本批量出队操作

	    def dequeue(self, num_elements):
	        # 批量出队指定数量的处理过的音频样本
	        output = self.queue.dequeue_many(num_elements)
	        # 将出队的音频样本元组返回
	        return output

dequeue_gc(self, num_elements)

下面这段代码作用是：音频样本批量出队操作

	    def dequeue_gc(self, num_elements):
	        # 为队列提供批量出队的方法
	        return self.gc_queue.dequeue_many(num_elements)

thread_main(self, sess)

下面这段代码作用是：线程音频处理函数

	    def thread_main(self, sess):
	        stop = False
	        # 多次查看数据集
	        while not stop:
	            # 加载通用音频
	            iterator = load_generic_audio(self.audio_dir, self.sample_rate)
	            for audio, filename, category_id in iterator:
	                # 若线程需要停止，则停止循环
	                if self.coord.should_stop():
	                    stop = True
	                    break
	                
	                # 若沉默阈值存在
	                if self.silence_threshold is not None:
	                    # 裁剪沉默时间
	                    audio = trim_silence(audio[:, 0], self.silence_threshold)
	                    # 转换为一列
	                    audio = audio.reshape(-1, 1)
	                    # 若大小为  0则提示音频在该沉默阈值下失效
	                    if audio.size == 0:
	                        print("Warning: {} was ignored as it contains only "
	                              "silence. Consider decreasing trim_silence "
	                              "threshold, or adjust volume of the audio."
	                              .format(filename))
	
	                # 填充数组
	                audio = np.pad(audio, [[self.receptive_field, 0], [0, 0]],
	                               'constant')
	
	                if self.sample_size:
	                    # 将样本切成大小为receptive_field + sample_size且有receptive_field重叠的部分
	                    while len(audio) > self.receptive_field:
	                        # 裁剪音频样本
	                        piece = audio[:(self.receptive_field +
	                                        self.sample_size), :]
	                        # 执行入队操作
	                        sess.run(self.enqueue,
	                                 feed_dict={self.sample_placeholder: piece})
	                        
	                        # 裁剪音频
	                        audio = audio[self.sample_size:, :]
	                        
	                        # 若全局条件标志存在，则入队
	                        if self.gc_enabled:
	                            sess.run(self.gc_enqueue, feed_dict={
	                                self.id_placeholder: category_id})
	                else:
	                    # 若样本大小为0，则直接执行入队操作
	                    sess.run(self.enqueue,
	                             feed_dict={self.sample_placeholder: audio})
	                    # 若全局条件标志存在，则入队
	                    if self.gc_enabled:
	                        sess.run(self.gc_enqueue,
	                                 feed_dict={self.id_placeholder: category_id})

start_threads(self, sess, n_threads=1)

下面这段代码作用是：音频样本批量出队操作

	    def start_threads(self, sess, n_threads=1):
	        for _ in range(n_threads):
	            # 创建线程，方法为thread_main，赋予参数sess
	            thread = threading.Thread(target=self.thread_main, args=(sess,))
	            thread.daemon = True	# 线程将在父进程退出时关闭
	            thread.start()			# 启动线程
	            self.threads.append(thread)		# 将线程加入线程池
	        
	        # 返回线程池
	        return self.threads