K同学[365天深度学习训练营]第五周记录T7咖啡豆识别 T8猫狗识别 T9猫狗识别2 T10数据增强 T11优化器对比实验

本文链接：https://blog.csdn.net/afive54/article/details/135254664

>- **🍨 本文为[🔗365天深度学习训练营](https://mp.weixin.qq.com/s/rbOOmire8OocQ90QM78DRA) 中的学习记录博客**
>- **🍖 原作者：[K同学啊 | 接辅导、项目定制](https://mtyjkh.blog.csdn.net/)**

我的环境：

- 系统环境：Ubuntu22.04

- 语言环境：Python3.9.18

- 编译器：vscode+jupyter notebook

- 深度学习环境：TensorFlow2.15.0

- 显卡：NVIDIA GeForce RTX 2080

这周搞了一张N卡2080，装系统配环境浪费了很多时间。

感觉T后续没什么特别的，我就把T系列全部过一遍吧

T7咖啡豆识别

🍺 要求：

自己搭建VGG-16网络框架（完成）
调用官方的VGG-16网络框架（完成）

🍻 拔高（可选）：

验证集准确率达到100%
使用PPT画出VGG-16算法框架图（发论文需要这项技能）

🔎 探索（难度有点大）

在不影响准确率的前提下轻量化模型
○ 目前VGG16的Total params是134,276,932

代码很简单

但验证集准确率一直在0.98-0.99，一直做不到1:

Epoch 43/100
30/30 [==============================] - ETA: 0s - loss: 1.3314e-05 - accuracy: 1.0000
Epoch 43: val_accuracy did not improve from 0.99167
30/30 [==============================] - 8s 275ms/step - loss: 1.3314e-05 - accuracy: 1.0000 - val_loss: 0.1212 - val_accuracy: 0.9875
Epoch 44/100
30/30 [==============================] - ETA: 0s - loss: 1.0177e-05 - accuracy: 1.0000
Epoch 44: val_accuracy did not improve from 0.99167
30/30 [==============================] - 8s 268ms/step - loss: 1.0177e-05 - accuracy: 1.0000 - val_loss: 0.1221 - val_accuracy: 0.9875
Epoch 45/100
30/30 [==============================] - ETA: 0s - loss: 7.2904e-06 - accuracy: 1.0000
Epoch 45: val_accuracy did not improve from 0.99167
30/30 [==============================] - 8s 272ms/step - loss: 7.2904e-06 - accuracy: 1.0000 - val_loss: 0.1292 - val_accuracy: 0.9875
Epoch 45: early stopping

不知道怎么办

轻量化模型我没搞过，只是知道可以用剪枝，通道重排，知识蒸馏等方法做

以后再说吧

T8猫狗识别

🍺 要求：

了解model.train_on_batch()并运用（完成）
了解tqdm，并使用tqdm实现可视化进度条（完成）

🍻 拔高（可选）：

本文代码中存在一个严重的BUG，请找出它并配以文字说明

🔎 探索（难度有点大）

修改代码，处理BUG

本次任务使用了新的模型训练的代码：

model.train_on_batch(image,label)

相对于之间的训练函数model.fit()，他一次只进行一个周期的训练。

优点是可以每个周期结束后按照自己的想法做出调整

BUG我没找到

T9猫狗识别2

要求：

找到并处理第8周的程序问题（本文给出了答案）

🍻 拔高（可选）：

请尝试增加数据增强部分内容以提高准确率
可以使用哪些方式进行数据增强？（下一周给出了答案）

🔎 探索（难度有点大）

本文中的代码存在较大赘余，请对代码进行精简

出现了两个报错：
第一个是文件夹的名字，代码里写的是：“365-9”实际上是“365-7”

第二个：


	"name": "ResourceExhaustedError",
	"message": "Graph execution error:

Detected at node Adam/StatefulPartitionedCall_26 defined at (most recent call last):
  File \"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/runpy.py\", line 197, in _run_module_as_main

  File \"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/runpy.py\", line 87, in _run_code

  File \"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/site-packages/ipykernel_launcher.py\", line 17, in <module>

  File \"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/site-packages/traitlets/config/application.py\", line 992, in launch_instance

  File \"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/site-packages/ipykernel/kernelapp.py\", line 701, in start

  File \"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/site-packages/tornado/platform/asyncio.py\", line 195, in start

  File \"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/asyncio/base_events.py\", line 601, in run_forever

  File \"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/asyncio/base_events.py\", line 1905, in _run_once

  File \"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/asyncio/events.py\", line 80, in _run

  File \"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/site-packages/ipykernel/kernelbase.py\", line 534, in dispatch_queue

  File \"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/site-packages/ipykernel/kernelbase.py\", line 523, in process_one

  File \"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/site-packages/ipykernel/kernelbase.py\", line 429, in dispatch_shell

  File \"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/site-packages/ipykernel/kernelbase.py\", line 767, in execute_request

  File \"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/site-packages/ipykernel/ipkernel.py\", line 429, in do_execute

  File \"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/site-packages/ipykernel/zmqshell.py\", line 549, in run_cell

  File \"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/site-packages/IPython/core/interactiveshell.py\", line 3048, in run_cell

  File \"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/site-packages/IPython/core/interactiveshell.py\", line 3103, in _run_cell

  File \"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/site-packages/IPython/core/async_helpers.py\", line 129, in _pseudo_sync_runner

  File \"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/site-packages/IPython/core/interactiveshell.py\", line 3308, in run_cell_async

  File \"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/site-packages/IPython/core/interactiveshell.py\", line 3490, in run_ast_nodes

  File \"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/site-packages/IPython/core/interactiveshell.py\", line 3550, in run_code

  File \"/tmp/ipykernel_148167/135579695.py\", line 37, in <module>

  File \"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/site-packages/keras/src/engine/training.py\", line 2787, in train_on_batch

  File \"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/site-packages/keras/src/engine/training.py\", line 1401, in train_function

  File \"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/site-packages/keras/src/engine/training.py\", line 1384, in step_function

  File \"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/site-packages/keras/src/engine/training.py\", line 1373, in run_step

  File \"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/site-packages/keras/src/engine/training.py\", line 1154, in train_step

  File \"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/site-packages/keras/src/optimizers/optimizer.py\", line 544, in minimize

  File \"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/site-packages/keras/src/optimizers/optimizer.py\", line 1223, in apply_gradients

  File \"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/site-packages/keras/src/optimizers/optimizer.py\", line 652, in apply_gradients

  File \"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/site-packages/keras/src/optimizers/optimizer.py\", line 1253, in _internal_apply_gradients

  File \"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/site-packages/keras/src/optimizers/optimizer.py\", line 1345, in _distributed_apply_gradients_fn

  File \"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/site-packages/keras/src/optimizers/optimizer.py\", line 1340, in apply_grad_to_update_var

Out of memory while trying to allocate 822083716 bytes.
BufferAssignment OOM Debugging.
BufferAssignment stats:
             parameter allocation:    1.53GiB
              constant allocation:         8B
        maybe_live_out allocation:    1.15GiB
     preallocated temp allocation:  784.00MiB
  preallocated temp fragmentation:       124B (0.00%)
                 total allocation:    2.30GiB
Peak buffers:
\tBuffer 1:
\t\tSize: 392.00MiB
\t\tXLA Label: fusion
\t\tShape: f32[25088,4096]
\t\t==========================

\tBuffer 2:
\t\tSize: 392.00MiB
\t\tXLA Label: fusion
\t\tShape: f32[25088,4096]
\t\t==========================

\tBuffer 3:
\t\tSize: 392.00MiB
\t\tOperator: op_name=\"XLA_Args\"
\t\tEntry Parameter Subshape: f32[25088,4096]
\t\t==========================

\tBuffer 4:
\t\tSize: 392.00MiB
\t\tOperator: op_name=\"XLA_Args\"
\t\tEntry Parameter Subshape: f32[25088,4096]
\t\t==========================

\tBuffer 5:
\t\tSize: 392.00MiB
\t\tOperator: op_name=\"XLA_Args\"
\t\tEntry Parameter Subshape: f32[25088,4096]
\t\t==========================

\tBuffer 6:
\t\tSize: 392.00MiB
\t\tOperator: op_name=\"XLA_Args\"
\t\tEntry Parameter Subshape: f32[25088,4096]
\t\t==========================

\tBuffer 7:
\t\tSize: 24B
\t\tOperator: op_type=\"AssignSubVariableOp\" op_name=\"AssignSubVariableOp\" source_file=\"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/site-packages/tensorflow/python/framework/ops.py\" source_line=1160
\t\tXLA Label: fusion
\t\tShape: (f32[25088,4096], f32[25088,4096], f32[25088,4096])
\t\t==========================

\tBuffer 8:
\t\tSize: 8B
\t\tOperator: op_name=\"XLA_Args\"
\t\tEntry Parameter Subshape: s64[]
\t\t==========================

\tBuffer 9:
\t\tSize: 4B
\t\tOperator: op_type=\"Pow\" op_name=\"Pow_1\" source_file=\"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/site-packages/tensorflow/python/framework/ops.py\" source_line=1160 deduplicated_name=\"fusion.4\"
\t\tXLA Label: fusion
\t\tShape: f32[]
\t\t==========================

\tBuffer 10:
\t\tSize: 4B
\t\tOperator: op_type=\"Pow\" op_name=\"Pow\" source_file=\"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/site-packages/tensorflow/python/framework/ops.py\" source_line=1160 deduplicated_name=\"fusion.4\"
\t\tXLA Label: fusion
\t\tShape: f32[]
\t\t==========================

\tBuffer 11:
\t\tSize: 4B
\t\tOperator: op_type=\"Pow\" op_name=\"Pow_1\" source_file=\"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/site-packages/tensorflow/python/framework/ops.py\" source_line=1160
\t\tXLA Label: constant
\t\tShape: f32[]
\t\t==========================

\tBuffer 12:
\t\tSize: 4B
\t\tOperator: op_name=\"XLA_Args\"
\t\tEntry Parameter Subshape: f32[]
\t\t==========================

\tBuffer 13:
\t\tSize: 4B
\t\tOperator: op_type=\"Pow\" op_name=\"Pow\" source_file=\"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/site-packages/tensorflow/python/framework/ops.py\" source_line=1160
\t\tXLA Label: constant
\t\tShape: f32[]
\t\t==========================


\t [[{{node Adam/StatefulPartitionedCall_26}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
 [Op:__inference_train_function_3565]",
	"stack": "---------------------------------------------------------------------------
ResourceExhaustedError                    Traceback (most recent call last)
Cell In[12], line 37
     30 \"\"\"
     31 训练模型，简单理解train_on_batch就是：它是比model.fit()更高级的一个用法
     32 
     33 想详细了解 train_on_batch 的同学，
     34 可以看看我的这篇文章：https://www.yuque.com/mingtian-fkmxf/hv4lcq/ztt4gy
     35 \"\"\"
     36  # 这里生成的是每一个batch的acc与loss
---> 37 history = model.train_on_batch(image,label)
     39 train_loss.append(history[0])
     40 train_accuracy.append(history[1])

File ~/anaconda3/envs/tensorflow/lib/python3.9/site-packages/keras/src/engine/training.py:2787, in Model.train_on_batch(self, x, y, sample_weight, class_weight, reset_metrics, return_dict)
   2783     iterator = data_adapter.single_batch_iterator(
   2784         self.distribute_strategy, x, y, sample_weight, class_weight
   2785     )
   2786     self.train_function = self.make_train_function()
-> 2787     logs = self.train_function(iterator)
   2789 logs = tf_utils.sync_to_numpy_or_python_type(logs)
   2790 if return_dict:

File ~/anaconda3/envs/tensorflow/lib/python3.9/site-packages/tensorflow/python/util/traceback_utils.py:153, in filter_traceback.<locals>.error_handler(*args, **kwargs)
    151 except Exception as e:
    152   filtered_tb = _process_traceback_frames(e.__traceback__)
--> 153   raise e.with_traceback(filtered_tb) from None
    154 finally:
    155   del filtered_tb

File ~/anaconda3/envs/tensorflow/lib/python3.9/site-packages/tensorflow/python/eager/execute.py:53, in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
     51 try:
     52   ctx.ensure_initialized()
---> 53   tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
     54                                       inputs, attrs, num_outputs)
     55 except core._NotOkStatusException as e:
     56   if name is not None:

ResourceExhaustedError: Graph execution error:

Detected at node Adam/StatefulPartitionedCall_26 defined at (most recent call last):
  File \"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/runpy.py\", line 197, in _run_module_as_main

  File \"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/runpy.py\", line 87, in _run_code

  File \"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/site-packages/ipykernel_launcher.py\", line 17, in <module>

  File \"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/site-packages/traitlets/config/application.py\", line 992, in launch_instance

  File \"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/site-packages/ipykernel/kernelapp.py\", line 701, in start

  File \"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/site-packages/tornado/platform/asyncio.py\", line 195, in start

  File \"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/asyncio/base_events.py\", line 601, in run_forever

  File \"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/asyncio/base_events.py\", line 1905, in _run_once

  File \"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/asyncio/events.py\", line 80, in _run

  File \"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/site-packages/ipykernel/kernelbase.py\", line 534, in dispatch_queue

  File \"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/site-packages/ipykernel/kernelbase.py\", line 523, in process_one

  File \"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/site-packages/ipykernel/kernelbase.py\", line 429, in dispatch_shell

  File \"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/site-packages/ipykernel/kernelbase.py\", line 767, in execute_request

  File \"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/site-packages/ipykernel/ipkernel.py\", line 429, in do_execute

  File \"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/site-packages/ipykernel/zmqshell.py\", line 549, in run_cell

  File \"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/site-packages/IPython/core/interactiveshell.py\", line 3048, in run_cell

  File \"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/site-packages/IPython/core/interactiveshell.py\", line 3103, in _run_cell

  File \"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/site-packages/IPython/core/async_helpers.py\", line 129, in _pseudo_sync_runner

  File \"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/site-packages/IPython/core/interactiveshell.py\", line 3308, in run_cell_async

  File \"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/site-packages/IPython/core/interactiveshell.py\", line 3490, in run_ast_nodes

  File \"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/site-packages/IPython/core/interactiveshell.py\", line 3550, in run_code

  File \"/tmp/ipykernel_148167/135579695.py\", line 37, in <module>

  File \"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/site-packages/keras/src/engine/training.py\", line 2787, in train_on_batch

  File \"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/site-packages/keras/src/engine/training.py\", line 1401, in train_function

  File \"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/site-packages/keras/src/engine/training.py\", line 1384, in step_function

  File \"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/site-packages/keras/src/engine/training.py\", line 1373, in run_step

  File \"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/site-packages/keras/src/engine/training.py\", line 1154, in train_step

  File \"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/site-packages/keras/src/optimizers/optimizer.py\", line 544, in minimize

  File \"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/site-packages/keras/src/optimizers/optimizer.py\", line 1223, in apply_gradients

  File \"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/site-packages/keras/src/optimizers/optimizer.py\", line 652, in apply_gradients

  File \"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/site-packages/keras/src/optimizers/optimizer.py\", line 1253, in _internal_apply_gradients

  File \"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/site-packages/keras/src/optimizers/optimizer.py\", line 1345, in _distributed_apply_gradients_fn

  File \"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/site-packages/keras/src/optimizers/optimizer.py\", line 1340, in apply_grad_to_update_var

Out of memory while trying to allocate 822083716 bytes.
BufferAssignment OOM Debugging.
BufferAssignment stats:
             parameter allocation:    1.53GiB
              constant allocation:         8B
        maybe_live_out allocation:    1.15GiB
     preallocated temp allocation:  784.00MiB
  preallocated temp fragmentation:       124B (0.00%)
                 total allocation:    2.30GiB
Peak buffers:
\tBuffer 1:
\t\tSize: 392.00MiB
\t\tXLA Label: fusion
\t\tShape: f32[25088,4096]
\t\t==========================

\tBuffer 2:
\t\tSize: 392.00MiB
\t\tXLA Label: fusion
\t\tShape: f32[25088,4096]
\t\t==========================

\tBuffer 3:
\t\tSize: 392.00MiB
\t\tOperator: op_name=\"XLA_Args\"
\t\tEntry Parameter Subshape: f32[25088,4096]
\t\t==========================

\tBuffer 4:
\t\tSize: 392.00MiB
\t\tOperator: op_name=\"XLA_Args\"
\t\tEntry Parameter Subshape: f32[25088,4096]
\t\t==========================

\tBuffer 5:
\t\tSize: 392.00MiB
\t\tOperator: op_name=\"XLA_Args\"
\t\tEntry Parameter Subshape: f32[25088,4096]
\t\t==========================

\tBuffer 6:
\t\tSize: 392.00MiB
\t\tOperator: op_name=\"XLA_Args\"
\t\tEntry Parameter Subshape: f32[25088,4096]
\t\t==========================

\tBuffer 7:
\t\tSize: 24B
\t\tOperator: op_type=\"AssignSubVariableOp\" op_name=\"AssignSubVariableOp\" source_file=\"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/site-packages/tensorflow/python/framework/ops.py\" source_line=1160
\t\tXLA Label: fusion
\t\tShape: (f32[25088,4096], f32[25088,4096], f32[25088,4096])
\t\t==========================

\tBuffer 8:
\t\tSize: 8B
\t\tOperator: op_name=\"XLA_Args\"
\t\tEntry Parameter Subshape: s64[]
\t\t==========================

\tBuffer 9:
\t\tSize: 4B
\t\tOperator: op_type=\"Pow\" op_name=\"Pow_1\" source_file=\"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/site-packages/tensorflow/python/framework/ops.py\" source_line=1160 deduplicated_name=\"fusion.4\"
\t\tXLA Label: fusion
\t\tShape: f32[]
\t\t==========================

\tBuffer 10:
\t\tSize: 4B
\t\tOperator: op_type=\"Pow\" op_name=\"Pow\" source_file=\"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/site-packages/tensorflow/python/framework/ops.py\" source_line=1160 deduplicated_name=\"fusion.4\"
\t\tXLA Label: fusion
\t\tShape: f32[]
\t\t==========================

\tBuffer 11:
\t\tSize: 4B
\t\tOperator: op_type=\"Pow\" op_name=\"Pow_1\" source_file=\"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/site-packages/tensorflow/python/framework/ops.py\" source_line=1160
\t\tXLA Label: constant
\t\tShape: f32[]
\t\t==========================

\tBuffer 12:
\t\tSize: 4B
\t\tOperator: op_name=\"XLA_Args\"
\t\tEntry Parameter Subshape: f32[]
\t\t==========================

\tBuffer 13:
\t\tSize: 4B
\t\tOperator: op_type=\"Pow\" op_name=\"Pow\" source_file=\"/home/wjh/anaconda3/envs/tensorflow/lib/python3.9/site-packages/tensorflow/python/framework/ops.py\" source_line=1160
\t\tXLA Label: constant
\t\tShape: f32[]
\t\t==========================


\t [[{{node Adam/StatefulPartitionedCall_26}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
 [Op:__inference_train_function_3565]"
}

原因是内存不足，将批处理个数降为16就正常了

T10数据增强

要求：

学会在代码中使用数据增强手段来提高acc
请探索更多的数据增强手段并记录

很基础的一些数据增强手法，没什么好说的。

T11优化器对比实验

本次调用了一个用于人脸识别的预训练模型

不过好像和自己搭建的VGG16没什么区别，验证准确率还是在0.6徘徊