写给NLP研究者的编程指南

最新推荐文章于 2024-02-17 11:54:24 发布

潜心修行的研究者

最新推荐文章于 2024-02-17 11:54:24 发布

阅读量606

点赞数

分类专栏：自然语言处理 NLP with DL

本文链接：https://blog.csdn.net/h2026966427/article/details/84442299

版权

23 篇文章 6 订阅

订阅专栏

6 篇文章 0 订阅

订阅专栏

基础：

将要学到的：how to write code in a way that makes your life easier！

写研究代码主要有两种模式：

使用一个框架

核心：training loop
找一个好的起点
- baseline
- 其他人的可阅读的代码
- 重现以理解整个过程，以方便找到过程中更好地决策。
先复制，再重构

写原型的话，先快速实现，再重构代码。
使用好的编码风格
- 有意义的变量名
- 一些重要数据的shape注释
- 不明显逻辑的描述
最小化测试过程
- 使用较小的测试集
- 数据预处理要一致
硬编码的多少
- 原型的话，不必考虑太多share的元素，硬编码可以多一些，以快速实现为目标。
- 不要太抽象，抽象是为了更好地应用和share，适合写组件的情况。

记录你已经运行的结果
控制实验
- Important for putting your work in context,to know what caused the difference
- Very controlled experiments,varying one thing: we can make causal claims(非常有控制的实验，改变一件事：我们可以提出因果关系)
通过一个配置类、配置文件、或配置脚本来记录参数情况。
- Not good: modifying code to run different variants; hard to keep track of what you ran
- Better: configuration files, or separate scripts, or something

Crucial tool for understanding model behavior during training

Tensorboard

Embeddings have sparse gradients，(only some embeddings are updated), but the momentum coefficients from ADAM are calculated for the whole embedding every time.

Solution:
```
from allennlp.training.optimizers import DenseSparseAdam
```
uses sparse accumulators for gradient moments
Look at your data
- 单独的数据处理，理解数据
- 模型需要在没有标签/计算损失的情况下运行

Key point during prototyping: The components that you use matter. A lot.

source control：如github

优点：
- 可以返回原先的版本
- 使协作变得更加容易
- 可以revisit older version of your code
- code review
持续集成
单元测试：自动检查您的代码的一小部分正常工作，如assert。

特点：

Things That We Use A Lot
- training a model
- mapping words (or characters, or labels) to indexes
- summarizing a sequence of tensors with a single tensor
Things That Require a Fair Amount of Code
- training a model
- (some ways of) summarizing a sequence of tensors with a single tensor
- some neural network modules
Things That Have Many Variations
- 词嵌入层
Things that reflect our higher-level thinking