OneFlow版本更新 - Changelog 0.3.2

OneFlow深度学习框架

于 2020-12-18 19:01:22 发布

阅读量260

点赞数

文章标签： docker jvm js cuda java

原文链接：https://github.com/Oneflow-Inc/oneflow

版权

OneFlow 发布了新版本 0.3.2，大版本 0.3 加入了诸多特性，性能更优，用户体验更友好，且率先支持了 CUDA 11.1。

内存亚线性优化在省的基础上更省，保持训练速度的前提下，大幅节省内存；新版 Checkpoint，让用户可以用 Numpy 初始化权重，想试验下新的初始化方法又不会开发算子的同学，现在可以用 Numpy 自己拼啦；新增 Python Kernel，让用户可以用 Python 为 OneFlow 实现自定义算子。

是不是迫不及待地往下翻详细介绍了？

主要新功能一览

支持亚线性内存优化

通过 oneflow.experimental.scope(checkpointing=self.checkpoint_activations) 开启，大幅节省内存。例如：

def transformer_layer(self, name, x, *, past):
    # ...    with flow.scope.namespace(name):        x = flow.identity(x)        with flow.experimental.scope.config(            checkpointing=self.checkpoint_activations        ):            norm1 = norm(x, name="layernorm_1")            # ...

新版本的 Checkpoint
新版本的 Checkpoint 大幅提高了灵活性。支持部分加载/保存，支持获取权重的值（可用于打印等操作），支持使用 Numpy 数组给权重赋值。文档：https://docs.oneflow.org/basics_topics/model_load_save.html#variable

读取权重：

flow.load_variables(flow.checkpoint.get(path))

保存权重：
```
flow.checkpoint.save(path)
```
获取 name 为 x 的权重的数值：
```
flow.get_all_variables()['x'].numpy()
```
将 name 为 x 的权重的数值设为 Numpy 数组 np_arr 的值：
```
flow.load_variables({'x': np_arr})
```

支持 dynamic loss scale schedule

具体开启方式：

loss_scale_policy = flow.optimizer.loss_scale.dynamic_loss_scale(increment_period=2000)
optimizer = flow.optimizer.AdamW(..., loss_scale_policy=loss_scale_policy)

支持最新的 CUDA 11.1

可以通过如下命令安装:

python3 -m pip install --find-links https://release.oneflow.info oneflow_cu111 --user

提供预先编译的带 XLA 张量编译器的安装包（支持CUDA 10,10.1,10.2,11.0）
可以通过如下命令安装:
```
python3 -m pip install --find-links https://release.oneflow.info oneflow_cu101_xla --user
```

完整 Changelog v0.3.0 ~ v0.3.2 (16/12/2020)

Op 修复和优化

优化了 scalar mul by tensor, cast scale， prelu，fused_scale_tril 等 Op 和 Op 组合

Dev sx xla clip #3656
Add UserOp::InferSbpSignature #3699
Fix fuse scalar mul by tensor sbp #3692
fix softmax condition #3675
slice_update op #3544
optimize rmsprop and lars optimizers #3809
add oneflow_range #3725
torch.gather #3602
skip conv2d padding dynamic test case #3813
Fix __hne in BinaryFuncFloorMod #3788
Fix bn[add]relu test case #3767
Make class Tensor abstract #3757
Add user_op::KernelCreateContext #3739
fix warning #3732
User op registry attr #3716
Dev refactor user op registry attr #3714
fix argwhere format #4010
Argwhere support empty blob #4009
Fuse cast scale #3999
layer_norm_grad_add_to_output #3998
Dev optimize prelu #3987
Switch identity to user op and add it to auto mixed precision clear list #3992
Optimize slice kernel #3989
Hotfix: add parallel cast to amp clear list #3988
fused_scale_tril / hot fix matmul / softmax broadcast_sub broadcast_div #3980
add combined margin cpu and fix bug #3961
fix pad op #3971
Fix constant init value #3947
indexed_slices_model_update handle empty tensor #3933
fix distribute_clone sbp #3803
Reshape backward issue with distribute split #3915
Remove NormalModelUpdateOpConf #3917
Dev unsorted segment sum #3731
Dev split like add backward #3901
distribute concat out dynamic false #3899
UserOpWrapper add HasGradTensor4OpOutput #3904
Unpack/Pack user op #3727
adam_bias_correction_learning_rate #3763
add flatten op implementation #3789
Dev enhance sort ops #3828
Optimize softmax cuda kernel block size #3853
SplitLikeOp prefix support #3866
fix gather set_is_dynamic #3900
fix unsorted segment sum like #3898

新增 Op 和已有 Op 的新功能

增加了 polyval, swish, mish, multi_square_sum, mseloss, lamb, triplet loss 等 Op

Add polyval op #3541
Add broadcast like backward #3665
Add cuda_pseudo_half.h #3669
add swish activation #3970
add mish activation #3972
Add multi_square_sum op #3977
TripOp add fill value #3960
add combined margin loss #3819
dynamic loss scale schedule op #3885
add mseloss #3893
LAMB support #3620
logical slice_assign and slice op #3647
Add Repeat/Acc user op #3707
Ssp variable proxy #3715
multi_count_not_finite op #3879
model update op add skip if #3883
Add triplet loss #3864

系统组件

OneFlow Collective Boxing 支持 NCCL All2All，支持 Ampere 架构 CUDA 设备

Add Nccl All2All #3538
Add attribute "batch_axis_non_change" to oneflow.transpose #3685
fix memcopy #3687
change url link of api docs #3677
Op collection #3833
fix pybind11 include #3876
Dev replace str to cfg obj in python callback #3832
Dev cpp instructions builder #3829
Dev forward declare cfg #3808
Fix CUDA 11.1 compiler crashes #3795
Bakcport bug fixes for distributed run from multi node ci #3765
Fix handle remote regst #3761
Refactor ExecKernel::bn_in_op2regst_desc_id to bn_in_op2blob_info #3744
Dev scope attr value #3756
rename UserOpAttrVal to AttrValue #3752
refactor OpGraphPass to JobPass #3745
RtRegst/Regst GetBlobDesc/BlobByOrdinal #3737
Log WARNING to stderr #3713
Use cudaMemcpyDefault #3700
Migrate foreigns to pybind11 #3939
Optimize NcclCollectiveBoxingExecutorBackend::ExecuteGroup latency #3997
OptimizerPlacementOptimization #3944
New checkpoint #3540
Sublinear memory cost by checkpointing #3976
Add gradients stats aggregation #3979
nccl enable mixed fusion #3981
remove serialized in python callback #3891
Fix CollectiveBoxingGenericTaskNode::ProduceAllRegstsAndBindEdges #3946
Add NaiveB2PSubTskGphBuilder #3942
disable new checkpoint by default temporarily #3943
Explicitly specify the SBP in NonDistributedOptimizerPass #3937
Add ssp variable proxy #3859
Dev switch error proto with cfg error proto #3858
New Chain #3874
DynamicLossScale #3886
Remove CheckNoCycle in chain graph #3693
Memory Reuse support time shape > meta shape #3796
OneFlow support tensor shape max dim size up to 6 #3802
Support Ampere devices #3806
Simple kernel memory bandwidth profiler #3855

Eager 模式

修复了一系列 bug

Use universal start global device id for all streams #3701
Ci add eager #3672
Fix eager mode bug #3681
Eager transport #3598
rm scope_proto symbol_id #3865
Replace py instruction to CFG Instruction #3773
refactor ParallelDescSymbol #3774
use proxy blob_object for boxing, add some inter-node boxing #3711
fix unpacked mirrored blob object shape #3703
Fix eager memory leak and re-enable new checkpoint #4008
barrier for multi node eager #3748

Python 前端

新增支持使用 Python + Numpy 实现 Kernel，多机分布式场景下也可以使用，文档：https://docs.oneflow.org/extended_topics/python_kernel_op.html

Dev add api rst #3695
add check in deconv #3835
fix stirng format in py35 #3878
fix exception in BlobObject del #3742
make float/double as aliases of float32/float64 #3740
Fix placement api doc #3638
Dev replace py job conf proto to cfg #3856
add bceloss #3804
add l1 loss op in python #3793
Py kernel2 #3686

工具链

更多的 SWIG 接口由 Pybind11 替换

Add api docs zzk #3680
Add api docs zzk #3587
Cfg template operator reform #3861
Dev use union instead of struct for oneof #3870
Sort cfg obj forward declare #3844
Dev move run instruction to pybind #3775
fix cfg module load error bug #3815
Fix oneflow worker launch in py35 #3778
Fix cfg sub proto mudule process bug #3729
Dev data onerec #3104
Dev compare cfg file #3717
remove proton not related to Instruction #3708
Dev switch instruction to cfg instruction #3702
replace ScopeProto to cfg #3816
Refine custom op build #3925
default show cpp error stack frame #3948
Dev replace py parallel conf proto to cfg #3810
optimize cfg generator to save time #3906

编译

修复 NVCC 参数，C++ 11 ABI 在 RedHat GCC 下 CMake 设置错误环境变量，修复编译可能出现的 make -j，修复手动编译的时候 include 目录消失

fix readme #3694
fix missing symbol when load so #3676
Fix CUDA_NVCC_GENCODES #3869
Add info in readme about how to build oneflow in docker #3781
Add bazel_cache dir for XLA build #3766
fix ubuntu build relocation R_X86_64_PC32 against symbol error #3754
Refactor build script #3698
fix make -j in grpc and openssl #3724
detect cxx11 abi availibility in cmake #3709
fix include files not copied #3907

CI

提升运行速度和稳定性，支持分布式环境

test use uuid log dir #3689
Run check_license_and_format in every branch #3683
Parallel run op cases #3670
Run xla and pure cpu only when cuda test succeeds #3679
add requirements.txt for api-docs #3671
ci add label check workflow #3664
CI merge all jobs into one #3868
Check label every push #3863
Update hard coded host affiliations #3847
External PR skip oss steps #3843
ci use pull_request ev #3842
ci only use pull_request_target #3840
Add pull_request_target to allow forks access secrets when CI triggerd #3837
CI run when bot is requested review #3831
Prevent CI failure #3830
ci dont test 2n8c #3786
upload bin to oss #4000
larger tol for bn #3965
fix oss list file 100 limit #3935
Refine release oss url #3924
Build master whl once a day #3894
Multi node support in CI #3735

Test

修复 image resize 测试用例

Fix image_test_util #3690
Fix image resize test #3666
import tensorflow in RunTensorFlowOp #3682

END

想参与OneFlow社区讨论吗？来撩 OneFlow 小助手吧~

QQ：3119703778

VX：OneFlowXZS

点击“阅读原文”，即刻进入 GitHub 仓库！

OneFlow深度学习框架

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
OneFlow版本更新 - Changelog 0.3.2

OneFlow 发布了新版本 0.3.2，大版本 0.3 加入了诸多特性，性能更优，用户体验更友好，且率先支持了 CUDA 11.1。内存亚线性优化在省的基础上更省，保持训练速度的前提下，...
复制链接

扫一扫