OneFlow版本更新 - Changelog 0.3.2

OneFlow 发布了新版本 0.3.2,大版本 0.3 加入了诸多特性,性能更优,用户体验更友好,且率先支持了 CUDA 11.1。

内存亚线性优化在省的基础上更省,保持训练速度的前提下,大幅节省内存;新版 Checkpoint,让用户可以用 Numpy 初始化权重,想试验下新的初始化方法又不会开发算子的同学,现在可以用 Numpy 自己拼啦;新增 Python Kernel,让用户可以用 Python 为 OneFlow 实现自定义算子。

是不是迫不及待地往下翻详细介绍了?

主要新功能一览

  • 支持亚线性内存优化

    通过 oneflow.experimental.scope(checkpointing=self.checkpoint_activations) 开启,大幅节省内存。例如:

    def transformer_layer(self, name, x, *, past):
        # ...    with flow.scope.namespace(name):        x = flow.identity(x)        with flow.experimental.scope.config(            checkpointing=self.checkpoint_activations        ):            norm1 = norm(x, name="layernorm_1")            # ...
  • 新版本的 Checkpoint

    新版本的 Checkpoint 大幅提高了灵活性。支持部分加载/保存,支持获取权重的值(可用于打印等操作),支持使用 Numpy 数组给权重赋值。文档:https://docs.oneflow.org/basics_topics/model_load_save.html#variable

    • 读取权重:

      flow.load_variables(flow.checkpoint.get(path))
      
    • 保存权重:

      flow.checkpoint.save(path)
      
    • 获取 namex 的权重的数值:

      flow.get_all_variables()['x'].numpy()
      
    • namex 的权重的数值设为 Numpy 数组 np_arr 的值:

      flow.load_variables({'x': np_arr})
      
  • 支持 dynamic loss scale schedule

    具体开启方式:

    loss_scale_policy = flow.optimizer.loss_scale.dynamic_loss_scale(increment_period=2000)
    optimizer = flow.optimizer.AdamW(..., loss_scale_policy=loss_scale_policy)
  • 支持最新的 CUDA 11.1

    可以通过如下命令安装:

    python3 -m pip install --find-links https://release.oneflow.info oneflow_cu111 --user
    
  • 提供预先编译的带 XLA 张量编译器的安装包(支持CUDA 10,10.1,10.2,11.0)

    可以通过如下命令安装:

    python3 -m pip install --find-links https://release.oneflow.info oneflow_cu101_xla --user
    

完整 Changelog v0.3.0 ~ v0.3.2 (16/12/2020)

Op 修复和优化

优化了 scalar mul by tensor, cast scaleprelufused_scale_tril 等 Op 和 Op 组合

  • Dev sx xla clip #3656

  • Add UserOp::InferSbpSignature #3699

  • Fix fuse scalar mul by tensor sbp #3692

  • fix softmax condition #3675

  • slice_update op #3544

  • optimize rmsprop and lars optimizers #3809

  • add oneflow_range #3725

  • torch.gather #3602

  • skip conv2d padding dynamic test case #3813

  • Fix __hne in BinaryFuncFloorMod #3788

  • Fix bn[add]relu test case #3767

  • Make class Tensor abstract #3757

  • Add user_op::KernelCreateContext #3739

  • fix warning #3732

  • User op registry attr #3716

  • Dev refactor user op registry attr #3714

  • fix argwhere format #4010

  • Argwhere support empty blob #4009

  • Fuse cast scale #3999

  • layer_norm_grad_add_to_output #3998

  • Dev optimize prelu #3987

  • Switch identity to user op and add it to auto mixed precision clear list #3992

  • Optimize slice kernel #3989

  • Hotfix: add parallel cast to amp clear list  #3988

  • fused_scale_tril / hot fix matmul / softmax broadcast_sub broadcast_div #3980

  • add combined margin cpu and fix bug #3961

  • fix pad op #3971

  • Fix constant init value #3947

  • indexed_slices_model_update handle empty tensor #3933

  • fix distribute_clone sbp #3803

  • Reshape backward issue with distribute split #3915

  • Remove NormalModelUpdateOpConf #3917

  • Dev unsorted segment sum #3731

  • Dev split like add backward #3901

  • distribute concat out dynamic false #3899

  • UserOpWrapper add HasGradTensor4OpOutput #3904

  • Unpack/Pack user op #3727

  • adam_bias_correction_learning_rate #3763

  • add flatten op implementation #3789

  • Dev enhance sort ops #3828

  • Optimize softmax cuda kernel block size #3853

  • SplitLikeOp  prefix support #3866

  • fix gather set_is_dynamic #3900

  • fix unsorted segment sum like #3898

新增 Op 和已有 Op 的新功能

增加了 polyval, swish, mish, multi_square_sum, mseloss, lamb, triplet loss 等 Op

  • Add polyval op #3541

  • Add broadcast like backward #3665

  • Add cuda_pseudo_half.h #3669

  • add swish activation #3970

  • add mish activation #3972

  • Add multi_square_sum op #3977

  • TripOp add fill value #3960

  • add combined margin loss #3819

  • dynamic loss scale schedule op #3885

  • add mseloss #3893

  • LAMB support #3620

  • logical slice_assign and slice op #3647

  • Add Repeat/Acc user op #3707

  • Ssp variable proxy #3715

  • multi_count_not_finite op #3879

  • model update op add skip if #3883

  • Add triplet loss #3864

系统组件

OneFlow Collective Boxing 支持 NCCL All2All,支持 Ampere 架构 CUDA 设备

  • Add Nccl All2All #3538

  • Add attribute "batch_axis_non_change" to oneflow.transpose #3685

  • fix memcopy #3687

  • change url link of api docs #3677

  • Op collection #3833

  • fix pybind11 include #3876

  • Dev replace str to cfg obj in python callback #3832

  • Dev cpp instructions builder #3829

  • Dev forward declare cfg #3808

  • Fix CUDA 11.1 compiler crashes #3795

  • Bakcport bug fixes for distributed run from multi node ci #3765

  • Fix handle remote regst #3761

  • Refactor ExecKernel::bn_in_op2regst_desc_id to bn_in_op2blob_info #3744

  • Dev scope attr value #3756

  • rename UserOpAttrVal to AttrValue #3752

  • refactor OpGraphPass to JobPass #3745

  • RtRegst/Regst GetBlobDesc/BlobByOrdinal #3737

  • Log WARNING to stderr #3713

  • Use cudaMemcpyDefault #3700

  • Migrate foreigns to pybind11 #3939

  • Optimize NcclCollectiveBoxingExecutorBackend::ExecuteGroup latency #3997

  • OptimizerPlacementOptimization #3944

  • New checkpoint #3540

  • Sublinear memory cost by checkpointing #3976

  • Add gradients stats aggregation #3979

  • nccl enable mixed fusion #3981

  • remove serialized in python callback #3891

  • Fix CollectiveBoxingGenericTaskNode::ProduceAllRegstsAndBindEdges #3946

  • Add NaiveB2PSubTskGphBuilder #3942

  • disable new checkpoint by default temporarily #3943

  • Explicitly specify the SBP in NonDistributedOptimizerPass #3937

  • Add ssp variable proxy #3859

  • Dev switch error proto with cfg error proto #3858

  • New Chain #3874

  • DynamicLossScale #3886

  • Remove CheckNoCycle in chain graph #3693

  • Memory Reuse support time shape > meta shape #3796

  • OneFlow support tensor shape max dim size up to 6 #3802

  • Support Ampere devices #3806

  • Simple kernel memory bandwidth profiler #3855

Eager 模式

修复了一系列 bug

  • Use universal start global device id for all streams #3701

  • Ci add eager #3672

  • Fix eager mode bug #3681

  • Eager transport #3598

  • rm scope_proto symbol_id #3865

  • Replace py instruction to CFG Instruction #3773

  • refactor ParallelDescSymbol #3774

  • use proxy blob_object for boxing, add some inter-node boxing #3711

  • fix unpacked mirrored blob object shape #3703

  • Fix eager memory leak and re-enable new checkpoint #4008

  • barrier for multi node eager #3748

Python 前端

新增支持使用 Python + Numpy 实现 Kernel,多机分布式场景下也可以使用,文档:https://docs.oneflow.org/extended_topics/python_kernel_op.html

  • Dev add api rst #3695

  • add check in deconv #3835

  • fix stirng format in py35 #3878

  • fix exception in BlobObject del #3742

  • make float/double as aliases of float32/float64 #3740

  • Fix placement api doc #3638

  • Dev replace py job conf proto to cfg #3856

  • add bceloss #3804

  • add l1 loss op in python #3793

  • Py kernel2 #3686

工具链

更多的 SWIG 接口由 Pybind11 替换

  • Add api docs zzk #3680

  • Add api docs zzk #3587

  • Cfg template operator reform #3861

  • Dev use union instead of struct for oneof #3870

  • Sort cfg obj forward declare #3844

  • Dev move run instruction to pybind #3775

  • fix cfg module load error bug #3815

  • Fix oneflow worker launch in py35 #3778

  • Fix cfg sub proto mudule process bug #3729

  • Dev data onerec #3104

  • Dev compare cfg file #3717

  • remove proton not related to Instruction #3708

  • Dev switch instruction to cfg instruction #3702

  • replace ScopeProto to cfg #3816

  • Refine custom op build #3925

  • default show cpp error stack frame #3948

  • Dev replace py parallel conf proto to cfg #3810

  • optimize cfg generator to save time #3906

编译

修复 NVCC 参数,C++ 11 ABI 在 RedHat GCC 下 CMake 设置错误环境变量,修复编译可能出现的 make -j,修复手动编译的时候 include 目录消失

  • fix readme #3694

  • fix missing symbol when load so #3676

  • Fix CUDA_NVCC_GENCODES #3869

  • Add info in readme about how to build oneflow in docker #3781

  • Add bazel_cache dir for XLA build #3766

  • fix ubuntu build relocation R_X86_64_PC32 against symbol error #3754

  • Refactor build script #3698

  • fix make -j in grpc and openssl #3724

  • detect cxx11 abi availibility in cmake #3709

  • fix include files not copied #3907

CI

提升运行速度和稳定性,支持分布式环境

  • test use uuid log dir #3689

  • Run check_license_and_format in every branch #3683

  • Parallel run op cases #3670

  • Run xla and pure cpu only when cuda test succeeds #3679

  • add requirements.txt for api-docs #3671

  • ci add label check workflow #3664

  • CI merge all jobs into one #3868

  • Check label every push #3863

  • Update hard coded host affiliations #3847

  • External PR skip oss steps #3843

  • ci use pull_request ev #3842

  • ci only use pull_request_target #3840

  • Add pull_request_target to allow forks access secrets when CI triggerd #3837

  • CI run when bot is requested review #3831

  • Prevent CI failure #3830

  • ci dont test 2n8c #3786

  • upload bin to oss #4000

  • larger tol for bn #3965

  • fix oss list file 100 limit #3935

  • Refine release oss url #3924

  • Build master whl once a day #3894

  • Multi node support in CI #3735

Test

修复 image resize 测试用例

  • Fix image_test_util #3690

  • Fix image resize test #3666

  • import tensorflow in RunTensorFlowOp #3682

END

想参与OneFlow社区讨论吗?来撩 OneFlow 小助手吧~

QQ:3119703778

VX:OneFlowXZS

点击“阅读原文”,即刻进入 GitHub 仓库!

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值