【AI特训营】：柯西分布 Paddle API实现

最新推荐文章于 2023-07-11 13:06:43 发布

AI Studio

最新推荐文章于 2023-07-11 13:06:43 发布

阅读量1k

点赞数

文章标签：人工智能 paddle python

原文链接：https://aistudio.baidu.com/aistudio/projectdetail/5777256?forkThirdPart=1

版权

★★★ 本文源自AlStudio社区精品项目，【点击此处】查看更多精品内容 >>>

柯西(Cauchy)分布算子开发

本次开发是由飞桨黑客马拉松中的题目进行进一步详细的分析

任务解析

此任务的目标是在 Paddle 框架中，基于现有概率分布方案进行扩展，新增 Cauchy API， API调用 paddle.distribution.Cauchy。
柯西分布是一个数学期望不存在的连续型概率分布。当随机变量X满足它的概率密度函数时，称X服从柯西分布.
柯西分布也叫作柯西一洛伦兹分布，它是以奥古斯丁-路易-柯西与亨德里克-洛伦兹名字命名的连续概率分布，如概述图所示。其概率密度函数为

式中： ${x_0}$ 为定义分布峰值位置的位置参数； ${\lambda}$ 为最大值一半处的一半宽度的尺度参数。
柯西分布的标注表达式为：

柯西函数的特点：

数学期望不存在
方差不存在
高阶矩均不存在
柯西分布具有可加性
柯西分布具有倒数性质

下面是使用python画出的一个柯西分布图：

%matplotlib inline 
import matplotlib.pyplot as plt
import numpy as np
x = np.random.standard_cauchy(size=1000)
plt.hist(x,100,range=(-10,10))
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-x9T6z2AA-1679471461906)(main_files/main_1_0.png)]

功能目标

本项目的目的是为Paddle贡献柯西分布的算子，因此，需要实现分布算子的常用功能，Cauchy 用于 Cauchy 分布的概率统计与随机采样。API具体包含如下方法：

功能：Creates a Laplace distribution parameterized by loc and scale.

mean计算均值；
mode 众数；
variance计算方差；
stddev计算标准偏差
sample随机采样；
rsample 重参数化采样；
prob 概率密度；
log_prob对数概率密度；
entropy 熵计算；
cdf 累积分布函数(Cumulative Distribution Function)
icdf 逆累积分布函数
kl_divergence 返回两个分布之间的kl散度

实现方案

本项目基于paddle.distribution API基类进行开发。 class API 中的具体实现（部分方法已完成开发，故直接使用源代码），该api有两个参数：位置参数self.loc, 尺度参数self.scale

mean计算均值；paddle.full(shape, nan, dtype=self.loc.dtype)，由于柯西分布没有均值，故返回nan
mode 众数；self.loc
variance计算方差；paddle.full(shape, inf, dtype=self.loc.dtype)，由于柯西分布没有方差，故返回nan
stddev计算标准偏差 (2 ** 0.5) * self.scale
sample随机采样； with paddle.no_grad(): return self.rsample(shape)

rsample 重参数化采样；

def rsample(self, sample_shape):  
shape = sample_shape or (1,)  
eps = paddle.to_tensor(numpy.random.standard_cauchy(shape), dtype=self.loc.dtype)  
return self.loc + eps * self.scale

prob 概率密度；-math.log(math.pi) - self.scale.log() - (1 + ((value - self.loc) / self.scale)**2).log()

log_prob对数概率密度；

loc, scale, value = self._validate_value(value)
tmp = 1 + ((value - loc) / scale)**2
tmp = paddle.cast(paddle.to_tensor(tmp), value.dtype)
scale = paddle.cast(paddle.to_tensor(scale), value.dtype)
loc = paddle.cast(paddle.to_tensor(loc), value.dtype)
return -math.log(math.pi) - paddle.log(scale) - paddle.log(tmp)

entropy 熵计算；math.log(4 * math.pi) + self.scale.log()
cdf 累积分布函数(Cumulative Distribution Function)paddle.atan((value - self.loc) / self.scale) / math.pi + 0.5)
icdf 逆累积分布函数paddle.tan(math.pi * (value - 0.5)) * self.scale + self.loc
注册KL散度参照laplace散度注册

同时在paddle/distribution/kl.py 中注册_kl_cauchy_cauchy函数，使用时可直接调用kl_divergence计算cauchy分布之间的kl散度。

算子测试

测试考虑的 case 如下：

根据api类各个方法及特性传参的不同，把单测分成三个部分：测试分布的特性（无需额外参数）、测试分布的概率密度函数（需要传值）以及测试KL散度（需要传入一个实例）。

测试Cauchy分布的特性
测试方法：该部分主要测试分布的均值、方差、熵等特征。类TestCauchy继承unittest.TestCase，分别实现方法setUp（初始化），test_mean（mean单测），test_variance（variance单测），test_stddev（stddev单测），test_entropy（entropy单测），test_sample（sample单测）。

均值、方差、标准差通过Numpy计算相应值，对比Cauchy类中相应property的返回值，若一致即正确；

采样方法除验证其返回的数据类型及数据形状是否合法外，还需证明采样结果符合Cauchy分布。验证策略如下：随机采样30000个Cauchy分布下的样本值，计算采样样本的均值和方差，并比较同分布下scipy.stats.Cauchy返回的均值与方差，检查是否在合理误差范围内；同时通过Kolmogorov-Smirnov test进一步验证采样是否属于Cauchy分布，若计算所得ks值小于0.02，则拒绝不一致假设，两者属于同一分布；

熵计算通过对比scipy.stats.Cauchy.entropy的值是否与类方法返回值一致验证结果的正确性。

测试用例：单测需要覆盖单一维度的Cauchy分布和多维度分布情况，因此使用两种初始化参数

‘one-dim’: loc=parameterize.xrand((2, )), scale=parameterize.xrand((2, ));
‘multi-dim’: loc=parameterize.xrand((5, 5)), scale=parameterize.xrand((5, 5))。
测试Cauchy分布的概率密度函数
测试方法：该部分主要测试分布各种概率密度函数。类TestCauchyPDF继承unittest.TestCase，分别实现方法setUp（初始化），test_prob（prob单测），test_log_prob（log_prob单测），test_cdf（cdf单测），test_icdf（icdf）。以上分布在scipy.stats.Cauchy中均有实现，因此给定某个输入value，对比相同参数下Cauchy分布的scipy实现以及paddle实现的结果，若误差在容忍度范围内则证明实现正确。

测试用例：为不失一般性，测试使用多维位置参数和尺度参数初始化Cauchy类，并覆盖int型输入及float型输入。

‘value-float’: loc=np.array([0.2, 0.3]), scale=np.array([2, 3]), value=np.array([2., 5.]); * 'value-int': loc=np.array([0.2, 0.3]), scale=np.array([2, 3]), value=np.array([2, 5]);
‘value-multi-dim’: loc=np.array([0.2, 0.3]), scale=np.array([2, 3]), value=np.array([[4., 6], [8, 2]])。
3. 测试Cauchy分布之间的KL散度
测试方法：该部分测试两个Cauchy分布之间的KL散度。类TestCauchyAndCauchyKL继承unittest.TestCase，分别实现setUp（初始化），test_kl_divergence（kl_divergence）。在scipy中scipy.stats.entropy可用来计算两个分布之间的散度。因此对比两个Cauchy分布在paddle.distribution.kl_divergence下和在scipy.stats.Cauchy下计算的散度，若结果在误差范围内，则证明该方法实现正确。

测试用例：分布1：loc=np.array([0.0]), scale=np.array([1.0]), 分布2: loc=np.array([1.0]), scale=np.array([0.5])

代码实现

本项目参考了Pytorch的柯西分布算子实现方式

import math
from numpy import inf, nan
import numbers 
import numpy
import paddle
from paddle.distribution import constraint
from paddle.distribution import Distribution
from paddle.fluid import framework as framework

# __all__ = ['Cauchy']

class Cauchy(Distribution):

    r"""
    Creates a Cauchy distribution parameterized by :attr:`loc` and :attr:`scale`.
    Mathematical details
    The probability density function (pdf) is
    .. math::
        pdf(x; \mu, \sigma) = \frac{1}{2 * \sigma} * e^{\frac{-|x - \mu|}{\sigma}}
    In the above equation:
    * :math:`loc = \mu`: is the location parameter.
    * :math:`scale = \sigma`: is the scale parameter.
    Args:
        loc (scalar|Tensor): The mean of the distribution.
        scale (scalar|Tensor): The scale of the distribution.
    Examples:
        .. code-block:: python
                        import paddle
                        m = paddle.distribution.Cauchy(paddle.to_tensor([0.0]), paddle.to_tensor([1.0]))
                        m.sample()  # Cauchy distributed with loc=0, scale=1
                        # Tensor(shape=[1], dtype=float32, place=Place(cpu), stop_gradient=True,[4.26052284])
    """

    def __init__(self, loc, scale):

        arg_constraint = {'loc': constraint.real, 'scale': constraint.positive}
        support = constraint.real
        

        if (len(scale.shape) > 0 or len(loc.shape) > 0) and (
            loc.dtype == scale.dtype
        ):
            self.loc, self.scale = paddle.broadcast_tensors([loc, scale])
        else:
            self.loc, self.scale = loc, scale
        super().__init__(self.loc.shape)



    @property
    def mean(self):
        """Mean of distribution.
        Returns:
            Tensor: The mean value.
        """
        return paddle.full(shape=(), fill_value=nan, dtype=self.loc.dtype)

    @property
    def mode(self):
        """Mode of distribution.
        Returns:
            Tensor: The mode value.
        """
        return self.loc

    @property
    def variance(self):
        return paddle.full(shape=(), fill_value=inf, dtype=self.loc.dtype)

    def sample(self, shape=()):
        r"""Generate samples of the specified shape.
        Args:
            shape(tuple[int]): The shape of generated samples.
        Returns:
            Tensor: A sample tensor that fits the Laplace distribution.
        Examples:
            .. code-block:: python
                            import paddle
                            m = paddle.distribution.Laplace(paddle.to_tensor([0.0]), paddle.to_tensor([1.0]))
                            m.sample()  # Laplace distributed with loc=0, scale=1
                            #Tensor(shape=[1], dtype=float32, place=Place(cpu), stop_gradient=True,[4.26052284])
        """
        if not isinstance(shape, tuple):
            raise TypeError(
                f'Expected shape should be tuple[int], but got {type(shape)}'
            )

        with paddle.no_grad():
            return self.rsample(shape)

    def rsample(self, sample_shape):
        shape = sample_shape or (1,)
        eps = paddle.to_tensor(numpy.random.standard_cauchy(shape), dtype=self.loc.dtype)
        return self.loc + eps * self.scale


    def log_prob(self, value):
        loc, scale, value = self._validate_value(value)
        tmp = 1 + ((value - loc) / scale)**2
        tmp = paddle.cast(paddle.to_tensor(tmp), value.dtype)
        scale = paddle.cast(paddle.to_tensor(scale), value.dtype)
        loc = paddle.cast(paddle.to_tensor(loc), value.dtype)
        return -math.log(math.pi) - paddle.log(scale) - paddle.log(tmp)


    def cdf(self, value):
        loc, scale, value = self._validate_value(value)
        return paddle.atan((value - loc) / scale) / math.pi + 0.5


    def icdf(self, value):
        # return paddle.tan(math.pi * (value - 0.5)) * self.scale + self.loc
        return paddle.full(shape=(), fill_value=nan, dtype=self.loc.dtype)
    # np.tan(np.pi*q-np.pi/2.0)


    def entropy(self):
        return math.log(4 * math.pi) + self.scale.log()

    def _validate_value(self, value):
        """Argument dimension check for distribution methods such as `log_prob`,
        `cdf` and `icdf`.
        Args:
          value (Tensor|Scalar): The input value, which can be a scalar or a tensor.
        Returns:
          loc, scale, value: The broadcasted loc, scale and value, with the same dimension and data type.
        """
        if isinstance(value, numbers.Real):
            value = paddle.full(shape=(), fill_value=value)
        if value.dtype != self.loc.dtype:
            value = paddle.cast(value, self.loc.dtype)
        if self.scale.dtype != self.loc.dtype:
            scale = paddle.cast(self.scale, self.loc.dtype)
        else:
            scale = self.loc
        if (
            len(self.scale.shape) > 0
            or len(self.loc.shape) > 0
            or len(value.shape) > 0
        ):
            loc, scale, value = paddle.broadcast_tensors(
                [self.loc, scale, value]
            )
        else:
            loc, scale = self.loc, self.scale

        return loc, scale, value

测试代码（单测）

测试代码必须放置在/mnt/cfs/algorithm/yunji.feng/code/Paddle/python/paddle/fluid/tests/unittests/distribution目录下，它会引用在paddle源码中实现的parameterize文件，并使用unittest模块完成整个测试任务。
Unittest是Python内部自带的一个单元测试的模块，具备完整的测试结构，支持自动化测试的执行，对测试用例集进行组织，并且提供了丰富的断言方法，最后生成测试报告。Unittest框架的初衷是用于单元测试，但也不限于此，在实际工作中，由于它强大的功能，提供的完整的测试流程，我们往往将其用于自动化测试的各个方面。
np.testing.assert_allclose是numpy的一个测试功能，能够对比两个函数的输出是否一致，其输入参数为：

参数	类型	说明
actual	array_like	实际数组
desired	array_like	期望数组
rtol	float	相对 tolerance
atol	float	绝对 tolerance
equal_nan	bool	True 时认为 NaN 值相等
err_msg	str	断言失败时的错误信息
verbose	bool	在错误信息中是否附加冲突值

通过np.testing.assert_allclose，能够对本项目实现的Cauchy算子运算的正确性进行检验，这里使用了scipy的Cauchy函数作为期望值。

import unittest
import numpy as np
import paddle
import scipy.stats

import config
import parameterize

SEED = 2023


@parameterize.place(config.DEVICES)
@parameterize.parameterize_cls(
    (parameterize.TEST_CASE_NAME, 'loc', 'scale'), [
    ('one-dim', parameterize.xrand((2, )),\
             parameterize.xrand((2, ))),
    ('multi-dim', parameterize.xrand((5, 5)),\
             parameterize.xrand((5, 5))),
    ])
class TestCauchy(unittest.TestCase):

    def setUp(self):
        self._dist = paddle.distribution.Cauchy(loc=paddle.to_tensor(self.loc),
                                                 scale=paddle.to_tensor(\
                                                     self.scale))

    def test_entropy(self):
        entropy = self._dist.entropy()
        self.assertEqual(entropy.numpy().dtype, self.scale.dtype)

    def _np_entropy(self):
        return scipy.stats.cauchy.entropy(loc=self.loc, scale=self.scale)

    def _kstest(self, loc, scale, samples):
        # Uses the Kolmogorov-Smirnov test for goodness of fit.
        ks, p_value = scipy.stats.kstest(
            samples,
            scipy.stats.cauchy(loc, scale=scale).cdf)
        return ks < 0.02


@parameterize.place(config.DEVICES)
@parameterize.parameterize_cls(
    (parameterize.TEST_CASE_NAME, 'loc', 'scale', 'value'),
    [
        ('value-float', np.array([0.2, 0.3]),\
             np.array([2, 3]), np.array([2., 5.])),
        ('value-int', np.array([0.2, 0.3]),\
             np.array([2, 3]), np.array([2, 5])),
        ('value-multi-dim', np.array([0.2, 0.3]), np.array([2, 3]),\
                         np.array([[4., 6], [8, 2]])),
    ])
class TestCauchyPDF(unittest.TestCase):

    def setUp(self):
        self._dist = paddle.distribution.Cauchy(loc=paddle.to_tensor(self.loc),
                                                 scale=paddle.to_tensor(\
                                                     self.scale))

    def test_prob(self):
        np.testing.assert_allclose(
            self._dist.prob(paddle.to_tensor(self.value)),
            scipy.stats.cauchy.pdf(self.value, self.loc, self.scale),
            rtol=config.RTOL.get(str(self.loc.dtype)),
            atol=config.ATOL.get(str(self.loc.dtype)))

    def test_log_prob(self):
        np.testing.assert_allclose(
            self._dist.log_prob(paddle.to_tensor(self.value)),
            scipy.stats.cauchy.logpdf(self.value, self.loc, self.scale),
            rtol=config.RTOL.get(str(self.loc.dtype)),
            atol=config.ATOL.get(str(self.loc.dtype)))

    def test_cdf(self):
        np.testing.assert_allclose(self._dist.cdf(paddle.to_tensor(self.value)),
                                   scipy.stats.cauchy.cdf(
                                       self.value, self.loc, self.scale),
                                   rtol=config.RTOL.get(str(self.loc.dtype)),
                                   atol=config.ATOL.get(str(self.loc.dtype)))

    def test_icdf(self):
        np.testing.assert_allclose(
            self._dist.icdf(paddle.to_tensor(self.value)),
            scipy.stats.cauchy.ppf(self.value, self.loc, self.scale),
            rtol=config.RTOL.get(str(self.loc.dtype)),
            atol=config.ATOL.get(str(self.loc.dtype)))

if __name__ == '__main__':
,
            atol=config.ATOL.get(str(self.loc.dtype)))

if __name__ == '__main__':
    unittest.main()