2024.7.21周报

目录

摘要

ABSTRACT

一、文献阅读

一、题目

二、摘要

三、创新点

四、文章解读

一、Introduction   

二、MLP vs KAN

三、KAN vs MLP experiment

四、KANs are interpretable

五、结论

二、KAN模型结构代码


摘要

本周我阅读了一篇题目为KAN: Kolmogorov–Arnold Networks的论文,这篇论文本Kolmogorov-Arnold表示定理的启发,提出了Kolmogorov-Arnold网络(KAN)作为多层感知器(MLP)的一种替代方案,MLP在神经元上具有固定的激活函数,而KAN在边缘(“权重”)上具有可学习的激活函数,其次对KAN模型架构代码进行了复现,加深了理解。

ABSTRACT

This week, I read a paper titled KAN: Kolmogorov Arnold Networks. Inspired by the Kolmogorov Arnold Representation Theorem, this paper proposes the Kolmogorov Arnold Network (KAN) as an alternative to the Multi Layer Perceptron (MLP). The MLP has a fixed activation function on the neurons, while the KAN has learnable activation functions on the edges ("weights"). Furthermore, the KAN model architecture code is reproduced, deepening our understanding.

一、文献阅读

一、题目

题目:KAN: Kolmogorov–Arnold Networks

链接:2404.19756 (arxiv.org)

二、摘要

本文受Kolmogorov-Arnold表示定理的启发,提出了Kolmogorov-Arnold网络(KAN)作为多层感知器(MLP)的一种替代方案,MLP在神经元上具有固定的激活函数,而KAN在边缘(“权重”)上具有可学习的激活函数。KAN没有线性权重(每一个权重参数都被一个参数化样条的单变量函数所取代)。这种变化使得KAN在小规模AI+科学任务的准确性和可解释性方面由于MLP。对于可解释性,KAN可以直观地可视化,并且可以轻松地与人类用户交互。

Inspired by the Kolmogorov-Arnold representation theorem, this paper proposes the Kolmogorov-Arnold Network (KAN) as an alternative to the Multilayer Perceptron (MLP). While MLPs have fixed activation functions on neurons, KANs have learnable activation functions on edges ("weights"). KANs do not have linear weights (each weight parameter is replaced by a univariate function of a parameterized spline). This change makes KANs superior to MLPs in terms of accuracy and interpretability on small-scale AI+science tasks. In terms of interpretability, KANs can be intuitively visualized and can easily interact with human users.

三、创新点

1、将原始的Kolmogorov-Arnold表示法推广到任意宽度和深度。

2、使用广泛的实证实验来突出其由于其准确性和可解释性而具有的AI +科学潜力。

四、文章解读

一、Introduction   

与MLP一样,KAN也具有完全连接的结构。然而,当MLP将固定的激活函数放在节点(“神经元”)上时,KAN将可学习的激活函数放在边(“权重”)上,如下图所示。因此,KAN根本没有线性权重矩阵:相反,每个权重参数由参数化为样条的可学习的1D函数代替。KAN的节点简单地对输入信号求和而不应用任何非线性。人们可能会担心KAN的代价太高,因为每个MLP的权重参数都变成了KAN的样条函数。幸运的是,KAN通常允许比MLP小得多的计算图。KAN只不过是样条和MLP的组合,利用了它们各自的优势,避免了它们各自的弱点。样条曲线对于低维函数是精确的,易于局部调整,并且能够在不同分辨率之间切换。KAN是这样的模型,因为它们在外部有MLP,在内部有样条。因此,KAN不仅可以学习特征(由于它们与MLP的外部相似性),而且还可以优化这些学习的特征以获得更高的精度(由于它们与样条的内部相似性)。

二、MLP vs KAN

MLP模型描述:

通用近似定理:该定理是多层感知机的理论基础,指出一个至少包含一个隐藏层的神经网络可以逼近任何连续函数。

浅层模型公式:f(x)=\sum_{i=1}^{N(e)}\sigma (w_{i}\cdot x+b_{i}),其中\sigma是激活函数,w_{i}b_{i}分别是权重和偏置。

浅层模型:图示中展示了具有固定激活函数节点和可学习权重边的浅层神经网络结构。

深层模型公式:MLP(x)=(W_{3}\circ\sigma _{2}\circ W_{2}\circ\sigma_{1}\circ W_{1} ),显示了多个层级的权重和非线性激活函数的复合。

深层模型:图中展示了一个深层结构,包含多个权重层和激活函数

KAN模型描述:

Kolmogorov-Arnold表示定理:这是KAN的理论基础,表明一个函数可以表示为二维特征的函数的组合。

模型公式:f(x)=\sum_{q=1}^{2n+1}\Phi_{q}(\sum_{p=1}^{n}\phi_{qp}(x_{p})),其中\Phi\phi是可学习的激活函数。

模型结构:图示中的KAN有可学习的激活函数边和在节点上的求和操作,展示了一种相比MLP更复杂的结构。

深层结构:图示了包含多个激活函数层的深层KAN结构,每一层都包括非线性且可学习的激活函数。

三、KAN vs MLP experiment

构建了五个已知具有平滑KA的示例:

(1)f(x)=J_{0}(20x),这是一个单变量函数,所以可以用样条表示,是一个[1,1]的KAN。

(2)f(x,y)=exp(sin(\pi x)+y^{2}),我们知道它可以用[2,1,1]的KAN精确表示。

(3)f(x,y)=xy,我们知道它可以用[2,2,1]的KAN精确表示。

(4)f(x_{1},...,x_{100})=exp(\frac{1}{100}sum_{i=1}^{100}sin^2(\frac{\pi x_{i}}{2}),高维函数,可以由[100,1,1]KAN精确表示。

(5)f(x_{1},x_{2},x_{3},x_{4})=exp(\frac{1}{2}(sin(\pi (x_{1}^{2}+x_{2}^{2})+sin(\pi (x_{3}^{2}+x_{4}^{2})))),可以由[4,4,2,1]的KAN表示。

文章采用了一种逐步增加网格点(分别是3, 5, 10, 20, 50, 100, 200, 500, 1000)的方法每200步增加一次,来训练Kolmogorov-A jumpold 网络(KANs)。同时也训练了具有不同深度和宽度的多层感知机(MLPs)作为基线对比。这两种模型都采用LBFGS算法进行了总共1800步的训练。通过在图3.1中绘制测试的均方根误差(RMSE)与模型参数数量的关系,我们发现KANs在参数扩展方面表现优于MLPs,尤其是在高维示例中。为了对比还绘制了根据我们的KAN理论预测的红色虚线(α = k+1 = 4),以及根据Sharma & Kaplan的理论预测的黑色虚线(α = (k+1)/d = 4/d)。结果显示,KANs几乎能够饱和这些更陡峭的红线,而MLPs却难以达到这些较慢的黑线的收敛速度,并且很快就趋于平稳。此外,还注意到,在最后一个例子中,具有两层的KAN(结构为[4, 9, 1])的表现远不如三层的KAN(结构为[4, 2, 2, 1])。这突出了更深层KANs的更大表达能力,这一点对于MLPs同样适用:更深的MLPs比较浅的具有更强的表达能力。采用了基本的训练设置,即KANs和MLPs都是使用LBFGS进行训练,没有采用高级技术,如在Adam和LBFGS之间切换或使用提升方法。未来的工作将进一步比较在高级设置下KANs和MLPs的表现。

四、KANs are interpretable

1.KANs 基于 Kolmogorov-Arnold 表示定理,该定理表明任何多变量连续函数都可以表示为有限数量的一元函数的组合。具体来说:(1)一元函数分解:通过将高维函数分解成一系列一元函数的组合,KANs 使得每个步骤和每个函数的贡献都变得透明和可理解。(2)每个节点仅进行简单的求和操作,这使得计算图中的每一步都容易理解和解释。

2. KANs 使用样条函数(splines)作为激活函数,这些函数是可学习的且容易解释。通过样条函数,可以直观地看到函数的形状和变化。与传统神经网络使用固定的激活函数不同,KANs 使用可学习的激活函数,使得每个连接的具体作用和影响都可以被精确地描述。

3. KANs 通过训练过程中的自动修剪,可以去除不重要的节点和边,从而简化计算图。这使得最终模型更紧凑,易于理解。通过评估每个节点和边的重要性,研究人员可以直观地理解哪些部分对模型的输出最为重要。

4. Knot Theory应用:在Knot Theory中,KANs 能够识别结的特征值,并通过无监督学习模式重新发现已知的数学关系。这表明 KANs 可以在不依赖大量先验知识的情况下,自动发现和解释复杂的数学结构。

通过KAN,我们重新发现了结数据集中的三个数学关系。

安德森局域化:KANs 被用于提取物理模型中的移动边界,通过数值数据生成符号公式。这种符号化的结果使得物理现象的解释变得更加直观和易于理解。

5. 符号回归和公式发现:在处理复杂数据集和特殊函数时,KANs 能够自动生成高精度的符号公式。这种能力不仅提高了模型的准确性,还增强了结果的可解释性。通过训练和简化过程,KANs 能够重新发现复杂的数学和物理规律,并生成易于理解的符号表示。

五、结论

KANs基于Kolmogorov-Arnold表示定理,这一理论已被广泛研究,但目前仅限于特定的形状和深度。目前KANs训练速度较慢,主要原因是不同的激活函数无法利用批处理计算。尽管KANs目前的训练速度较慢,通常比MLPs慢10倍,但这是一个可以在未来得到解决的问题,对于需要快速训练的任务,MLPs仍是优选。然而,在重视模型可解释性和准确性的情况下,KANs显示出强大的潜力,尤其适用于小规模的AI + 科学问题。

二、KAN模型结构代码

import torch
import torch.nn as nn
import numpy as np
from .spline import *
from .utils import sparse_mask


class KANLayer(nn.Module):
    """
    KANLayer class
    

    Attributes:
    -----------
        in_dim: int
            input dimension
        out_dim: int
            output dimension
        size: int
            the number of splines = input dimension * output dimension
        k: int
            the piecewise polynomial order of splines
        grid: 2D torch.float
            grid points
        noises: 2D torch.float
            injected noises to splines at initialization (to break degeneracy)
        coef: 2D torch.tensor
            coefficients of B-spline bases
        scale_base: 1D torch.float
            magnitude of the residual function b(x)
        scale_sp: 1D torch.float
            mangitude of the spline function spline(x)
        base_fun: fun
            residual function b(x)
        mask: 1D torch.float
            mask of spline functions. setting some element of the mask to zero means setting the corresponding activation to zero function.
        grid_eps: float in [0,1]
            a hyperparameter used in update_grid_from_samples. When grid_eps = 0, the grid is uniform; when grid_eps = 1, the grid is partitioned using percentiles of samples. 0 < grid_eps < 1 interpolates between the two extremes.
        weight_sharing: 1D tensor int
            allow spline activations to share parameters
        lock_counter: int
            counter how many activation functions are locked (weight sharing)
        lock_id: 1D torch.int
            the id of activation functions that are locked
        device: str
            device
    
    Methods:
    --------
        __init__():
            initialize a KANLayer
        forward():
            forward 
        update_grid_from_samples():
            update grids based on samples' incoming activations
        initialize_grid_from_parent():
            initialize grids from another model
        get_subset():
            get subset of the KANLayer (used for pruning)
        lock():
            lock several activation functions to share parameters
        unlock():
            unlock already locked activation functions
    """

    def __init__(self, in_dim=3, out_dim=2, num=5, k=3, noise_scale=0.1, scale_base=1.0, scale_sp=1.0, base_fun=torch.nn.SiLU(), grid_eps=0.02, grid_range=[-1, 1], sp_trainable=True, sb_trainable=True, save_plot_data = True, device='cpu', sparse_init=False):
        ''''
        initialize a KANLayer
        
        Args:
        -----
            in_dim : int
                input dimension. Default: 2.
            out_dim : int
                output dimension. Default: 3.
            num : int
                the number of grid intervals = G. Default: 5.
            k : int
                the order of piecewise polynomial. Default: 3.
            noise_scale : float
                the scale of noise injected at initialization. Default: 0.1.
            scale_base : float
                the scale of the residual function b(x). Default: 1.0.
            scale_sp : float
                the scale of the base function spline(x). Default: 1.0.
            base_fun : function
                residual function b(x). Default: torch.nn.SiLU()
            grid_eps : float
                When grid_eps = 0, the grid is uniform; when grid_eps = 1, the grid is partitioned using percentiles of samples. 0 < grid_eps < 1 interpolates between the two extremes. Default: 0.02.
            grid_range : list/np.array of shape (2,)
                setting the range of grids. Default: [-1,1].
            sp_trainable : bool
                If true, scale_sp is trainable. Default: True.
            sb_trainable : bool
                If true, scale_base is trainable. Default: True.
            device : str
                device
            
        Returns:
        --------
            self
            
        Example
        -------
        >>> model = KANLayer(in_dim=3, out_dim=5)
        >>> (model.in_dim, model.out_dim)
        (3, 5)
        '''
        super(KANLayer, self).__init__()
        # size 
        self.out_dim = out_dim
        self.in_dim = in_dim
        self.num = num
        self.k = k

        # shape: (size, num)
        ### grid size: (batch, in_dim, out_dim, G + 1) => (batch, in_dim, G + 2*k + 1)
        
        grid = torch.linspace(grid_range[0], grid_range[1], steps=num + 1)[None,:].expand(self.in_dim, num+1)
        grid = extend_grid(grid, k_extend=k)
        self.grid = torch.nn.Parameter(grid).requires_grad_(False)
        noises = (torch.rand(self.num+1, self.in_dim, self.out_dim) - 1 / 2) * noise_scale / num
        noises = noises.to(device)
        # shape: (size, coef)
        self.coef = torch.nn.Parameter(curve2coef(self.grid[:,k:-k].permute(1,0), noises, self.grid, k, device))
        #if isinstance(scale_base, float):
        if sparse_init:
            mask = sparse_mask(in_dim, out_dim)
        else:
            mask = 1.
        
        #scale_base = scale_base.to(device)
        self.scale_base = torch.nn.Parameter(torch.ones(in_dim, out_dim, device=device) * scale_base * mask).requires_grad_(sb_trainable)  # make scale trainable
        #else:
        #self.scale_base = torch.nn.Parameter(scale_base.to(device)).requires_grad_(sb_trainable)
        self.scale_sp = torch.nn.Parameter(torch.ones(in_dim, out_dim, device=device) * scale_sp * mask).requires_grad_(sp_trainable)  # make scale trainable
        self.base_fun = base_fun

        self.mask = torch.nn.Parameter(torch.ones(in_dim, out_dim, device=device)).requires_grad_(False)
        self.grid_eps = grid_eps
        
        ### remove weight_sharing & lock parts
        #self.weight_sharing = torch.arange(out_dim*in_dim).reshape(out_dim, in_dim)
        #self.lock_counter = 0
        #self.lock_id = torch.zeros(out_dim*in_dim).reshape(out_dim, in_dim)
        self.device = device

    def forward(self, x):
        '''
        KANLayer forward given input x
        
        Args:
        -----
            x : 2D torch.float
                inputs, shape (number of samples, input dimension)
            
        Returns:
        --------
            y : 2D torch.float
                outputs, shape (number of samples, output dimension)
            preacts : 3D torch.float
                fan out x into activations, shape (number of sampels, output dimension, input dimension)
            postacts : 3D torch.float
                the outputs of activation functions with preacts as inputs
            postspline : 3D torch.float
                the outputs of spline functions with preacts as inputs
        
        Example
        -------
        >>> model = KANLayer(in_dim=3, out_dim=5)
        >>> x = torch.normal(0,1,size=(100,3))
        >>> y, preacts, postacts, postspline = model(x)
        >>> y.shape, preacts.shape, postacts.shape, postspline.shape
        (torch.Size([100, 5]),
         torch.Size([100, 5, 3]),
         torch.Size([100, 5, 3]),
         torch.Size([100, 5, 3]))
        '''
        batch = x.shape[0]
        # x: shape (batch, in_dim) => shape (size, batch) (size = out_dim * in_dim)
        #x = torch.einsum('ij,k->ikj', x, torch.ones(self.out_dim, device=self.device)).reshape(batch, self.size).permute(1, 0)
        preacts = x[:,None,:].clone().expand(batch, self.out_dim, self.in_dim)
            
        base = self.base_fun(x) # (batch, in_dim)
        y = coef2curve(x_eval=x, grid=self.grid, coef=self.coef, k=self.k, device=self.device)  # y shape: (batch, in_dim, out_dim)
        
        postspline = y.clone().permute(0,2,1) # postspline shape: (batch, out_dim, in_dim)
            
        y = self.scale_base[None,:,:] * base[:,:,None] + self.scale_sp[None,:,:] * y
        y = self.mask[None,:,:] * y
        
        postacts = y.clone().permute(0,2,1)
            
        y = torch.sum(y, dim=1)  # shape (batch, out_dim)
        return y, preacts, postacts, postspline

    def update_grid_from_samples(self, x):
        '''
        update grid from samples
        
        Args:
        -----
            x : 2D torch.float
                inputs, shape (number of samples, input dimension)
            
        Returns:
        --------
            None
        
        Example
        -------
        >>> model = KANLayer(in_dim=1, out_dim=1, num=5, k=3)
        >>> print(model.grid.data)
        >>> x = torch.linspace(-3,3,steps=100)[:,None]
        >>> model.update_grid_from_samples(x)
        >>> print(model.grid.data)
        tensor([[-1.0000, -0.6000, -0.2000,  0.2000,  0.6000,  1.0000]])
        tensor([[-3.0002, -1.7882, -0.5763,  0.6357,  1.8476,  3.0002]])
        '''
        batch = x.shape[0]
        #x = torch.einsum('ij,k->ikj', x, torch.ones(self.out_dim, ).to(self.device)).reshape(batch, self.size).permute(1, 0)
        x_pos = torch.sort(x, dim=0)[0]
        y_eval = coef2curve(x_pos, self.grid, self.coef, self.k, device=self.device)
        num_interval = self.grid.shape[1] - 1 - 2*self.k
        ids = [int(batch / num_interval * i) for i in range(num_interval)] + [-1]
        grid_adaptive = x_pos[ids, :].permute(1,0)
        margin = 0.01
        h = (grid_adaptive[:,[-1]] - grid_adaptive[:,[0]])/num_interval
        grid_uniform = grid_adaptive[:,[0]] + h * torch.arange(num_interval+1,).to(self.device)[None, :]
        grid = self.grid_eps * grid_uniform + (1 - self.grid_eps) * grid_adaptive
        self.grid.data = extend_grid(grid, k_extend=self.k)
        self.coef.data = curve2coef(x_pos, y_eval, self.grid, self.k, device=self.device)

    def initialize_grid_from_parent(self, parent, x):
        '''
        update grid from a parent KANLayer & samples
        
        Args:
        -----
            parent : KANLayer
                a parent KANLayer (whose grid is usually coarser than the current model)
            x : 2D torch.float
                inputs, shape (number of samples, input dimension)
            
        Returns:
        --------
            None
          
        Example
        -------
        >>> batch = 100
        >>> parent_model = KANLayer(in_dim=1, out_dim=1, num=5, k=3)
        >>> print(parent_model.grid.data)
        >>> model = KANLayer(in_dim=1, out_dim=1, num=10, k=3)
        >>> x = torch.normal(0,1,size=(batch, 1))
        >>> model.initialize_grid_from_parent(parent_model, x)
        >>> print(model.grid.data)
        tensor([[-1.0000, -0.6000, -0.2000,  0.2000,  0.6000,  1.0000]])
        tensor([[-1.0000, -0.8000, -0.6000, -0.4000, -0.2000,  0.0000,  0.2000,  0.4000,
          0.6000,  0.8000,  1.0000]])
        '''
        batch = x.shape[0]
        # preacts: shape (batch, in_dim) => shape (size, batch) (size = out_dim * in_dim)
        #x_eval = torch.einsum('ij,k->ikj', x, torch.ones(self.out_dim, ).to(self.device)).reshape(batch, self.size).permute(1, 0)
        x_eval = x
        pgrid = parent.grid # (in_dim, G+2*k+1)
        pk = parent.k
        y_eval = coef2curve(x_eval, pgrid, parent.coef, pk, device=self.device)
        '''print(x_pos.shape)
        sp2 = KANLayer(in_dim=1, out_dim=self.in_dim, k=1, num=x_pos.shape[1] - 2*self.k - 1, scale_base=0., device=self.device)
        
        print(sp2.grid[:,sp2.k:-sp2.k].shape, x_pos[:,self.k:-self.k].shape, sp2.grid.shape)
        sp2.coef.data = curve2coef(sp2.grid[:,sp2.k:-sp2.k], x_pos[:,self.k:-self.k], sp2.grid, k=1, device=self.device)
        y_eval = coef2curve(x_eval, parent.grid, parent.coef, parent.k, device=self.device)
        percentile = torch.linspace(-1, 1, self.num + 1).to(self.device)
        self.grid.data = sp2(percentile.unsqueeze(dim=1))[0].permute(1, 0)'''
        h = (pgrid[:,[-pk]] - pgrid[:,[pk]])/self.num
        grid = pgrid[:,[pk]] + torch.arange(self.num+1,) * h
        grid = extend_grid(grid, k_extend=self.k)
        self.grid.data = grid
        self.coef.data = curve2coef(x_eval, y_eval, self.grid, self.k, self.device)

    def get_subset(self, in_id, out_id):
        '''
        get a smaller KANLayer from a larger KANLayer (used for pruning)
        
        Args:
        -----
            in_id : list
                id of selected input neurons
            out_id : list
                id of selected output neurons
            
        Returns:
        --------
            spb : KANLayer
            
        Example
        -------
        >>> kanlayer_large = KANLayer(in_dim=10, out_dim=10, num=5, k=3)
        >>> kanlayer_small = kanlayer_large.get_subset([0,9],[1,2,3])
        >>> kanlayer_small.in_dim, kanlayer_small.out_dim
        (2, 3)
        '''
        spb = KANLayer(len(in_id), len(out_id), self.num, self.k, base_fun=self.base_fun, device=self.device)
        spb.grid.data = self.grid[in_id]
        spb.coef.data = self.coef[in_id][:,out_id]
        spb.scale_base.data = self.scale_base[in_id][:,out_id]
        spb.scale_sp.data = self.scale_sp[in_id][:,out_id]
        spb.mask.data = self.mask[in_id][:,out_id]

        spb.in_dim = len(in_id)
        spb.out_dim = len(out_id)
        return spb

  • 11
    点赞
  • 23
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值