OneFlow 发布了新版本 0.3.2,大版本 0.3 加入了诸多特性,性能更优,用户体验更友好,且率先支持了 CUDA 11.1。
内存亚线性优化在省的基础上更省,保持训练速度的前提下,大幅节省内存;新版 Checkpoint,让用户可以用 Numpy 初始化权重,想试验下新的初始化方法又不会开发算子的同学,现在可以用 Numpy 自己拼啦;新增 Python Kernel,让用户可以用 Python 为 OneFlow 实现自定义算子。
是不是迫不及待地往下翻详细介绍了?
主要新功能一览
支持亚线性内存优化
通过
oneflow.experimental.scope(checkpointing=self.checkpoint_activations)
开启,大幅节省内存。例如:def transformer_layer(self, name, x, *, past): # ... with flow.scope.namespace(name): x = flow.identity(x) with flow.experimental.scope.config( checkpointing=self.checkpoint_activations ): norm1 = norm(x, name="layernorm_1") # ...
新版本的 Checkpoint
新版本的 Checkpoint 大幅提高了灵活性。支持部分加载/保存,支持获取权重的值(可用于打印等操作),支持使用 Numpy 数组给权重赋值。文档:https://docs.oneflow.org/basics_topics/model_load_save.html#variable
读取权重:
flow.load_variables(flow.checkpoint.get(path))
保存权重:
flow.checkpoint.save(path)
获取
name
为x
的权重的数值:flow.get_all_variables()['x'].numpy()
将
name
为x
的权重的数值设为 Numpy 数组np_arr
的值:flow.load_variables({'x': np_arr})
支持 dynamic loss scale schedule
具体开启方式:
loss_scale_policy = flow.optimizer.loss_scale.dynamic_loss_scale(increment_period=2000) optimizer = flow.optimizer.AdamW(..., loss_scale_policy=loss_scale_policy)
支持最新的 CUDA 11.1
可以通过如下命令安装:
python3 -m pip install --find-links https://release.oneflow.info oneflow_cu111 --user
提供预先编译的带 XLA 张量编译器的安装包(支持CUDA 10,10.1,10.2,11.0)
可以通过如下命令安装:
python3 -m pip install --find-links https://release.oneflow.info oneflow_cu101_xla --user
完整 Changelog v0.3.0 ~ v0.3.2 (16/12/2020)
Op 修复和优化
优化了 scalar mul by tensor
, cast scale
, prelu
,fused_scale_tril
等 Op 和 Op 组合
Dev sx xla clip #3656
Add UserOp::InferSbpSignature #3699
Fix fuse scalar mul by tensor sbp #3692
fix softmax condition #3675
slice_update op #3544
optimize rmsprop and lars optimizers #3809
add oneflow_range #3725
torch.gather #3602
skip conv2d padding dynamic test case #3813
Fix __hne in BinaryFuncFloorMod #3788
Fix bn[add]relu test case #3767
Make class Tensor abstract #3757
Add user_op::KernelCreateContext #3739
fix warning #3732
User op registry attr #3716
Dev refactor user op registry attr #3714
fix argwhere format #4010
Argwhere support empty blob #4009
Fuse cast scale #3999
layer_norm_grad_add_to_output #3998
Dev optimize prelu #3987
Switch identity to user op and add it to auto mixed precision clear list #3992
Optimize slice kernel #3989
Hotfix: add parallel cast to amp clear list #3988
fused_scale_tril / hot fix matmul / softmax broadcast_sub broadcast_div #3980
add combined margin cpu and fix bug #3961
fix pad op #3971
Fix constant init value #3947
indexed_slices_model_update handle empty tensor #3933
fix distribute_clone sbp #3803
Reshape backward issue with distribute split #3915
Remove NormalModelUpdateOpConf #3917
Dev unsorted segment sum #3731
Dev split like add backward #3901
distribute concat out dynamic false #3899
UserOpWrapper add HasGradTensor4OpOutput #3904
Unpack/Pack user op #3727
adam_bias_correction_learning_rate #3763
add flatten op implementation #3789
Dev enhance sort ops #3828
Optimize softmax cuda kernel block size #3853
SplitLikeOp prefix support #3866
fix gather set_is_dynamic #3900
fix unsorted segment sum like #3898
新增 Op 和已有 Op 的新功能
增加了 polyval
, swish
, mish
, multi_square_sum
, mseloss
, lamb
, triplet loss
等 Op
Add polyval op #3541
Add broadcast like backward #3665
Add cuda_pseudo_half.h #3669
add swish activation #3970
add mish activation #3972
Add multi_square_sum op #3977
TripOp add fill value #3960
add combined margin loss #3819
dynamic loss scale schedule op #3885
add mseloss #3893
LAMB support #3620
logical slice_assign and slice op #3647
Add Repeat/Acc user op #3707
Ssp variable proxy #3715
multi_count_not_finite op #3879
model update op add skip if #3883
Add triplet loss #3864
系统组件
OneFlow Collective Boxing 支持 NCCL All2All,支持 Ampere 架构 CUDA 设备
Add Nccl All2All #3538
Add attribute "batch_axis_non_change" to
oneflow.transpose
#3685fix memcopy #3687
change url link of api docs #3677
Op collection #3833
fix pybind11 include #3876
Dev replace str to cfg obj in python callback #3832
Dev cpp instructions builder #3829
Dev forward declare cfg #3808
Fix CUDA 11.1 compiler crashes #3795
Bakcport bug fixes for distributed run from multi node ci #3765
Fix handle remote regst #3761
Refactor ExecKernel::bn_in_op2regst_desc_id to bn_in_op2blob_info #3744
Dev scope attr value #3756
rename UserOpAttrVal to AttrValue #3752
refactor OpGraphPass to JobPass #3745
RtRegst/Regst GetBlobDesc/BlobByOrdinal #3737
Log WARNING to stderr #3713
Use cudaMemcpyDefault #3700
Migrate foreigns to pybind11 #3939
Optimize NcclCollectiveBoxingExecutorBackend::ExecuteGroup latency #3997
OptimizerPlacementOptimization #3944
New checkpoint #3540
Sublinear memory cost by checkpointing #3976
Add gradients stats aggregation #3979
nccl enable mixed fusion #3981
remove serialized in python callback #3891
Fix CollectiveBoxingGenericTaskNode::ProduceAllRegstsAndBindEdges #3946
Add NaiveB2PSubTskGphBuilder #3942
disable new checkpoint by default temporarily #3943
Explicitly specify the SBP in NonDistributedOptimizerPass #3937
Add ssp variable proxy #3859
Dev switch error proto with cfg error proto #3858
New Chain #3874
DynamicLossScale #3886
Remove CheckNoCycle in chain graph #3693
Memory Reuse support time shape > meta shape #3796
OneFlow support tensor shape max dim size up to 6 #3802
Support Ampere devices #3806
Simple kernel memory bandwidth profiler #3855
Eager 模式
修复了一系列 bug
Use universal start global device id for all streams #3701
Ci add eager #3672
Fix eager mode bug #3681
Eager transport #3598
rm scope_proto symbol_id #3865
Replace py instruction to CFG Instruction #3773
refactor ParallelDescSymbol #3774
use proxy blob_object for boxing, add some inter-node boxing #3711
fix unpacked mirrored blob object shape #3703
Fix eager memory leak and re-enable new checkpoint #4008
barrier for multi node eager #3748
Python 前端
新增支持使用 Python + Numpy 实现 Kernel,多机分布式场景下也可以使用,文档:https://docs.oneflow.org/extended_topics/python_kernel_op.html
Dev add api rst #3695
add check in deconv #3835
fix stirng format in py35 #3878
fix exception in BlobObject del #3742
make float/double as aliases of float32/float64 #3740
Fix placement api doc #3638
Dev replace py job conf proto to cfg #3856
add bceloss #3804
add l1 loss op in python #3793
Py kernel2 #3686
工具链
更多的 SWIG 接口由 Pybind11 替换
Add api docs zzk #3680
Add api docs zzk #3587
Cfg template operator reform #3861
Dev use union instead of struct for oneof #3870
Sort cfg obj forward declare #3844
Dev move run instruction to pybind #3775
fix cfg module load error bug #3815
Fix oneflow worker launch in py35 #3778
Fix cfg sub proto mudule process bug #3729
Dev data onerec #3104
Dev compare cfg file #3717
remove proton not related to Instruction #3708
Dev switch instruction to cfg instruction #3702
replace ScopeProto to cfg #3816
Refine custom op build #3925
default show cpp error stack frame #3948
Dev replace py parallel conf proto to cfg #3810
optimize cfg generator to save time #3906
编译
修复 NVCC 参数,C++ 11 ABI 在 RedHat GCC 下 CMake 设置错误环境变量,修复编译可能出现的 make -j
,修复手动编译的时候 include 目录消失
fix readme #3694
fix missing symbol when load so #3676
Fix CUDA_NVCC_GENCODES #3869
Add info in readme about how to build oneflow in docker #3781
Add bazel_cache dir for XLA build #3766
fix ubuntu build relocation R_X86_64_PC32 against symbol error #3754
Refactor build script #3698
fix make -j in grpc and openssl #3724
detect cxx11 abi availibility in cmake #3709
fix include files not copied #3907
CI
提升运行速度和稳定性,支持分布式环境
test use uuid log dir #3689
Run check_license_and_format in every branch #3683
Parallel run op cases #3670
Run xla and pure cpu only when cuda test succeeds #3679
add requirements.txt for api-docs #3671
ci add label check workflow #3664
CI merge all jobs into one #3868
Check label every push #3863
Update hard coded host affiliations #3847
External PR skip oss steps #3843
ci use pull_request ev #3842
ci only use pull_request_target #3840
Add pull_request_target to allow forks access secrets when CI triggerd #3837
CI run when bot is requested review #3831
Prevent CI failure #3830
ci dont test 2n8c #3786
upload bin to oss #4000
larger tol for bn #3965
fix oss list file 100 limit #3935
Refine release oss url #3924
Build master whl once a day #3894
Multi node support in CI #3735
Test
修复 image resize 测试用例
Fix image_test_util #3690
Fix image resize test #3666
import tensorflow in RunTensorFlowOp #3682
END
想参与OneFlow社区讨论吗?来撩 OneFlow 小助手吧~
QQ:3119703778
VX:OneFlowXZS
点击“阅读原文”,即刻进入 GitHub 仓库!