期货量化交易软件：神经网络变得轻松第十一自 GPT 获取

最新推荐文章于 2024-10-03 17:38:29 发布

赫兹期货量化软件

最新推荐文章于 2024-10-03 17:38:29 发布

阅读量152

点赞数

文章标签：神经网络 gpt 人工智能机器学习大数据

本文链接：https://blog.csdn.net/herzqt/article/details/131534801

版权

期货量化交易软件：神经网络变得轻松第十一自 GPT 获取

在 2018 年 6 月，OpenAI 提出了 GPT 神经网络模型，该模型立即在多种语言类测试中展现出极佳结果。 GDP-2 于 2019 年出现，而 GPT-3 则于 2020 年 5 月提出。这些模型展示了神经网络生成相关文本的能力。尚有其他实验涉及生成音乐和图像的能力。这一模型的主要缺点与它们涉及的计算资源相关。在配备 8 颗 GPU 的计算机上训练第一个 GPT 花费了一个月的时间。为了解决新问题，使用预先训练的模型，可部分弥补这一缺陷。但考虑到模型的规模，需要大量资源来维持模型的运行。内容 (接下来我要发布一些比较高深点的内容，赫兹量化交易的市场数学，让数学教我们怎么去盈利，在哪里我们值得亏钱，我们成本需要多少，能赚多少钱? )

1. 理解 GPT 模型

从概念来讲，GPT 模型是在之前研究的变换器基础上构建的。主要思路是基于大数据针对模型进行无监督预训练，然后再依据相对少量的标记数据进行微调。

分两步训练的原因在于模型规模。像 GPT 这样的现代深度机器学习模型涉及大量参数，可多达数亿个。因此，这种神经网络的训练需要大量的训练样本。当采用监督学习时，创建带标记的训练样本集合是件劳动密集型工作。与此同时，网站上有许多不同的数字化和无标记文本，这些文本非常适合模型的无监督训练。然而，统计数据表明，无监督学习相较监督学习，其结果要差很多。因此，在无监督训练之后，可依据相对少量的标记数据样本针对模型进行微调。

无监督学习可令 GPT 学习语言类模型，而针对特定任务，可依据标记数据进一步训练，从而调整模型。因此，为了执行不同的语言类任务，可以复制并微调一个预训练的模型。该限制基于采用无监督学习的原始语言集合。

实践表明，这种方法对于广泛的语言问题能产生良好的效果。例如，GPT-3 模型能够针对给定主题生成连贯流畅的文本。不过，请注意，指定的模型包含 1750 亿个参数，按顺序依据 570GB 的数据集合上进行了预训练。

尽管 GPT 模型是为处理自然语言类而开发的，但它们在音乐和图像生成任务中也表现出色。

理论上，GPT 模型可与任何数字化数据序列配合使用。唯一的前置需求是无监督的预学习需要足够的数据和资源。

2. GPT 与之前研究的变换器之间的区别

我们来研究 GPT 模型与之前研究的变换器有何区别。首先，GPT 模型未使用编码器，因为它们仅使用解码器。当没有编码器时，模型不再拥有“编码器 - 解码器自关注”内层。下图展示了 GPT 变换器模块。

添加图片注释，不超过 140 字（可选）

与经典的变换器相似，GPT 模型中的模块在彼此之上构建。每个模块针对关注机制都有自己的权重矩阵，并具有完全连接的前馈层。模块的数量决定了模型的规模。模块堆栈可能会很庞大。 GPT-1 和最小的 GPT-2（小型 GPT-2）有 12 个模块；GPT-2 特大型有 48 个，而 GPT-3 则有 96 个模块。

与传统语言类模型类似，GPT 仅能够发现与序列中先前元素的关系，但无法窥视未来。但它与变换器不同，GPT 不使用元素的掩码 — 代之，它更改了计算过程。 GPT 会重置 Score 矩阵中后续元素的关注比率。

同时，GPT 可被归类为自回归模型。每次迭代都会生成一个序列令牌。生成的令牌会被添加到输入序列中，并馈入模型进行下一次迭代。

与经典变换器一样，自关注机制内的每个令牌都会生成三个向量：一个 query，一个 key，和一个 value。在自回归模型当中，在每次新迭代里，输入序列仅能由 1 个令牌更改，因此每个令牌无需重新计算向量。因此，GPT 中的每一层只在序列有新元素时计算向量。每个变换器模块都保存其向量，以备后用。

这种方式令模型能够在接收最终令牌之前逐词生成文本。

当然，GPT 模型采用多目击者关注机制。

3. 实现

在开始之前，我们来简要地复习一下算法：

令牌的输入序列会被馈入到变换器模块之中。

针对所有自关注目击者的一个序列。进而，对于每个关注的目击者，2-5 中的动作是相同的。

令牌向量乘以相应的权重矩阵 W（已训练），可计算每个令牌的三个向量（query，key，value）。
将 'query' 和 'key' 相乘，赫兹量化可判定序列元素之间的依赖性。在此步骤，将序列中每个元素的向量 'query' 乘以序列中当前元素和所有先前元素的 'key' 向量。
在每个 query 的上下文中，使用 SoftMax 函数对获得的关注得分矩阵进行常规化。序列的后续元素则设置了零关注分数。

作为第 3 步和第 4 步的结果，我们获得了平方矩阵 Score，该平方矩阵的大小依据序列中元素的数量来确定，在其内每个 'query' 的上下文中所有元素的合计为 “1”。

将常规化的关注分数乘以序列相应元素的 'value' 向量，然后与结果向量相加，我们可以得到序列 (Z) 的每个元素的关注校正值。
接下来，赫兹量化基于所有关注目击者的结果断定加权 Z 向量。为此，将来自所有关注目击者的校正后的 “value” 向量串联到单一向量，然后乘以正在训练的 W0 矩阵。
所得张量会被添加到输入序列，并进行常规化。
多目击者自关注机制后随前馈模块的两个完全连接层。第一层（隐藏）包含的神经元数量比之含有 ReLU 激活函数的输入序列多 4 倍。第二层的尺寸等于输入序列的尺寸，且神经元不使用激活函数。
完全连接层的结果与张量求和，其张量将被馈入前馈模块。然后将生成的张量常规化。

3.1. 为我们的模型创建新类。

为了实现我们的模型，我们在 CNeuronBaseOCL 基类的基础上创建一个新类 CNeuronMLMHAttentionOCL。我故意退后一步，且没有使用之前创建的关注类。这是因为目前我们采用了新的“多目击者自关注”创建原理。早前，在文章第十部分中，我们创建了 CNeuronMHAttentionOCL 类，该类可顺序提供 4 个关注线程的重新计算。线程数量在方法里是硬编码的，因此更改线程数量会需要大量工时，这会涉及修改类及其方法的相关代码。

警告。如上所述，GPT 模型采用含有相同（不可更改）超参数的相同变换器模块的堆栈，唯一的区别在于其所正在训练的矩阵。因此，我决定创建一个多层模块，它允许创建带有超参数的模型，在创建类时这些模型可作为传递参数。这包括堆栈中变换器模块的重复次数。

结果就是，赫兹量化有了一个类，它可基于一些指定的参数来创建几乎整个模型。因此，在新类的“保护”部分中，我们声明五个变量来存储模块参数：

iLayers	模型中的变换器模块数量
iHeads	自关注目击者的数量
iWindow	输入窗口大小（1 个输入序列令牌）
iWindowKey	内部向量 Query、Key、Value 的维度
iUnits	输入序列中的元素（令牌）数量

还有，在受保护部分中，声明 6 个数组，来存储缓冲区集合，其内是我们的张量和训练权重矩阵：

QKV_Tensors	存储张量 Query、Key、Value，及其梯度的数组
QKV_Weights	存储 Wq、Wk、Wv 权重矩阵、及其动量矩阵集合的数组
S_Tensors	存储 Score 矩阵、及其梯度集合的数组
AO_Tensors	存储自关注机制输出张量、及其梯度的数组
FF_Tensors	存储前馈模块输入、隐藏和输出张量、及其梯度的数组
FF_Weights	存储前馈模块权重矩阵、及其动量的数组。

赫兹量化将在以后实现它们时，再研究该类方法。

class CNeuronMLMHAttentionOCL : public CNeuronBaseOCL { protected: uint iLayers; ///< Number of inner layers uint iHeads; ///< Number of heads uint iWindow; ///< Input window size uint iUnits; ///< Number of units uint iWindowKey; ///< Size of Key/Query window //--- CCollection *QKV_Tensors; ///< The collection of tensors of Queries, Keys and Values CCollection *QKV_Weights; ///< The collection of Matrix of weights to previous layer CCollection *S_Tensors; ///< The collection of Scores tensors CCollection *AO_Tensors; ///< The collection of Attention Out tensors CCollection *FF_Tensors; ///< The collection of tensors of Feed Forward output CCollection *FF_Weights; ///< The collection of Matrix of Feed Forward weights ///\ingroup neuron_base_ff virtual bool feedForward(CNeuronBaseOCL *NeuronOCL); ///< \brief Feed Forward method of calling kernel ::FeedForward().@param NeuronOCL Pointer to previos layer. virtual bool ConvolutionForward(CBufferDouble *weights, CBufferDouble *inputs,CBufferDouble *outputs, uint window, uint window_out, ENUM_ACTIVATION activ); ///< \brief Convolution Feed Forward method of calling kernel ::FeedForwardConv(). virtual bool AttentionScore(CBufferDouble *qkv, CBufferDouble *scores, bool mask=true); ///< \brief Multi-heads attention scores method of calling kernel ::MHAttentionScore(). virtual bool AttentionOut(CBufferDouble *qkv, CBufferDouble *scores, CBufferDouble *out); ///< \brief Multi-heads attention out method of calling kernel ::MHAttentionOut(). virtual bool SumAndNormilize(CBufferDouble *tensor1, CBufferDouble *tensor2, CBufferDouble *out); ///< \brief Method sum and normalize 2 tensors by calling 2 kernels ::SumMatrix() and ::Normalize(). ///\ingroup neuron_base_opt virtual bool updateInputWeights(CNeuronBaseOCL *NeuronOCL); ///< Method for updating weights.\details Calling one of kernels ::UpdateWeightsMomentum() or ::UpdateWeightsAdam() in depends on optimization type (#ENUM_OPTIMIZATION).@param NeuronOCL Pointer to previos layer. virtual bool ConvolutuionUpdateWeights(CBufferDouble *weights, CBufferDouble *gradient, CBufferDouble *inputs, CBufferDouble *momentum1, CBufferDouble *momentum2, uint window, uint window_out); ///< Method for updating weights in convolution layer.\details Calling one of kernels ::UpdateWeightsConvMomentum() or ::UpdateWeightsConvAdam() in depends on optimization type (#ENUM_OPTIMIZATION). virtual bool ConvolutionInputGradients(CBufferDouble *weights, CBufferDouble *gradient, CBufferDouble *inputs, CBufferDouble *inp_gradient, uint window, uint window_out, uint activ); ///< Method of passing gradients through a convolutional layer. virtual bool AttentionInsideGradients(CBufferDouble *qkv,CBufferDouble *qkv_g,CBufferDouble *scores,CBufferDouble *scores_g,CBufferDouble *gradient); ///< Method of passing gradients through attention layer. public: /** Constructor */CNeuronMLMHAttentionOCL(void); /** Destructor */~CNeuronMLMHAttentionOCL(void); virtual bool Init(uint numOutputs,uint myIndex,COpenCLMy *open_cl, uint window, uint window_key, uint heads, uint units_count, uint layers, ENUM_OPTIMIZATION optimization_type); ///< Method of initialization class.@param[in] numOutputs Number of connections to next layer.@param[in] myIndex Index of neuron in layer.@param[in] open_cl Pointer to #COpenCLMy object.@param[in] window Size of in/out window and step.@param[in] units_countNumber of neurons.@param[in] optimization_type Optimization type (#ENUM_OPTIMIZATION)@return Boolen result of operations. virtual bool calcInputGradients(CNeuronBaseOCL *prevLayer); ///< Method to transfer gradients to previous layer @param[in] prevLayer Pointer to previous layer. //--- virtual int Type(void) const { return defNeuronMLMHAttentionOCL; }///< Identificator of class.@return Type of class //--- methods for working with files virtual bool Save(int const file_handle); ///< Save method @param[in] file_handle handle of file @return logical result of operation virtual bool Load(int const file_handle); ///< Load method @param[in] file_handle handle of file @return logical result of operation };

在类的构造函数当中，赫兹量化为类的超参数设置初始值，并初始化集合数组。

CNeuronMLMHAttentionOCL::CNeuronMLMHAttentionOCL(void) : iLayers(0), iHeads(0), iWindow(0), iWindowKey(0), iUnits(0) { QKV_Tensors=new CCollection(); QKV_Weights=new CCollection(); S_Tensors=new CCollection(); AO_Tensors=new CCollection(); FF_Tensors=new CCollection(); FF_Weights=new CCollection(); }

相应地，赫兹量化类的析构函数中的集合数组删除了。

CNeuronMLMHAttentionOCL::~CNeuronMLMHAttentionOCL(void) { if(CheckPointer(QKV_Tensors)!=POINTER_INVALID) delete QKV_Tensors; if(CheckPointer(QKV_Weights)!=POINTER_INVALID) delete QKV_Weights; if(CheckPointer(S_Tensors)!=POINTER_INVALID) delete S_Tensors; if(CheckPointer(AO_Tensors)!=POINTER_INVALID) delete AO_Tensors; if(CheckPointer(FF_Tensors)!=POINTER_INVALID) delete FF_Tensors; if(CheckPointer(FF_Weights)!=POINTER_INVALID) delete FF_Weights; }

类的初始化，以及模型的构建是在 Init 方法当中执行。该方法会从参数里接收：


numOutputs	后续一层中所需创建链接的元素数量
myIndex	层中的神经元索引
open_cl	OpenCL 对象指针
window	输入窗口大小（输入序列令牌）
window_key	内部向量 Query、Key、Value 的维度
heads	自关注目击者（线程）数量
units_count	输入序列中的元素数量
layers	在模型堆栈中的模块（层）数量
optimization_type	在训练期间的参数优化方法

bool CNeuronMLMHAttentionOCL::Init(uint numOutputs,uint myIndex,COpenCLMy *open_cl,uint window,uint window_key,uint heads,uint units_count,uint layers,ENUM_OPTIMIZATION optimization_type) { if(!CNeuronBaseOCL::Init(numOutputs,myIndex,open_cl,window*units_count,optimization_type)) return false; //--- iWindow=fmax(window,1); iWindowKey=fmax(window_key,1); iUnits=fmax(units_count,1); iHeads=fmax(heads,1); iLayers=fmax(layers,1);

在方法伊始，赫兹量化调用相应的方法来初始化父类。请注意，我们没有执行基础的检查来验证所收到的 OpenCL 对象指针，和输入序列的大小，因为这些检查已在父类方法中实现。

成功初始化父类之后，我们将超参数保存到相应的变量当中。

接着，赫兹量化计算所要创建的张量大小。请注意，以前的组织多目击者关注的修改方式。我们不会为 'query'、'key' 和 'value' 向量创建单独的数组 - 它们将合并到一个数组。进而，我们不会为每个关目击者创建单独的数组。代之，我们将为 QKV（query + key + value）、Scores和自关注机制的输出创建通用数组。这些元素将在张量中的索引级别切分为序列。当然，这种方式更难以理解。在张量中查找所需元素也许会更加困难。但是，它能令模型变得更灵活，根据关注目击者的数量，并依据内核级别的并行线程来组织规划所有关注目击者的并发重新计算。

QKV_Tensor（数字）张量的大小定义为内部向量（query + key + value）的三个大小与目击者数量的乘积。级联权重矩阵 QKV_Weight 的大小定义为输入序列令牌的三个大小的乘积，并依据偏移元素、内部向量的大小和关注目击者数量递增。与此类似，我们计算剩余张量的大小。

uint num=3*iWindowKey*iHeads*iUnits; //Size of QKV tensor uint qkv_weights=3*(iWindow+1)*iWindowKey*iHeads; //Size of weights' matrix of QKV tensor uint scores=iUnits*iUnits*iHeads; //Size of Score tensor uint mh_out=iWindowKey*iHeads*iUnits; //Size of multi-heads self-attention uint out=iWindow*iUnits; //Size of our tensor uint w0=(iWindowKey+1)*iHeads*iWindow; //Size W0 tensor uint ff_1=4*(iWindow+1)*iWindow; //Size of weights' matrix 1-st feed forward layer uint ff_2=(4*iWindow+1)*iWindow; //Size of weights' matrix 2-nd feed forward layer

断定所有张量的大小之后，依据模块中关注层的数量运行一个循环，创建必要的张量。请注意，在循环体内规划了两个嵌套循环。第一重循环为 value 张量及其梯度创建数组。第二重为权重矩阵及其矩创建数组。请注意，对于最后一层，不会为前馈块输出张量及其梯度创建新的数组。代之，将指向父类输出和梯度数组的指针加到集合之中。如此简单的步骤避免了在数组之间传递数值的不必要迭代，并且消除了不必要的内存消耗。

for(uint i=0; i<iLayers; i++) { CBufferDouble *temp=NULL; for(int d=0; d<2; d++) { //--- Initialize QKV tensor temp=new CBufferDouble(); if(CheckPointer(temp)==POINTER_INVALID) return false; if(!temp.BufferInit(num,0)) return false; if(!QKV_Tensors.Add(temp)) return false; //--- Initialize scores temp=new CBufferDouble(); if(CheckPointer(temp)==POINTER_INVALID) return false; if(!temp.BufferInit(scores,0)) return false; if(!S_Tensors.Add(temp)) return false; //--- Initialize multi-heads attention out temp=new CBufferDouble(); if(CheckPointer(temp)==POINTER_INVALID) return false; if(!temp.BufferInit(mh_out,0)) return false; if(!AO_Tensors.Add(temp)) return false; //--- Initialize attention out temp=new CBufferDouble(); if(CheckPointer(temp)==POINTER_INVALID) return false; if(!temp.BufferInit(out,0)) return false; if(!FF_Tensors.Add(temp)) return false; //--- Initialize Feed Forward 1 temp=new CBufferDouble(); if(CheckPointer(temp)==POINTER_INVALID) return false; if(!temp.BufferInit(4*out,0)) return false; if(!FF_Tensors.Add(temp)) return false; //--- Initialize Feed Forward 2 if(i==iLayers-1) { if(!FF_Tensors.Add(d==0 ? Output : Gradient)) return false; continue; } temp=new CBufferDouble(); if(CheckPointer(temp)==POINTER_INVALID) return false; if(!temp.BufferInit(out,0)) return false; if(!FF_Tensors.Add(temp)) return false; } //--- Initialize QKV weights temp=new CBufferDouble(); if(CheckPointer(temp)==POINTER_INVALID) return false; if(!temp.Reserve(qkv_weights)) return false; for(uint w=0; w<qkv_weights; w++) { if(!temp.Add(GenerateWeight())) return false; } if(!QKV_Weights.Add(temp)) return false; //--- Initialize Weights0 temp=new CBufferDouble(); if(CheckPointer(temp)==POINTER_INVALID) return false; if(!temp.Reserve(w0)) return false; for(uint w=0; w<w0; w++) { if(!temp.Add(GenerateWeight())) return false; } if(!FF_Weights.Add(temp)) return false; //--- Initialize FF Weights temp=new CBufferDouble(); if(CheckPointer(temp)==POINTER_INVALID) return false; if(!temp.Reserve(ff_1)) return false; for(uint w=0; w<ff_1; w++) { if(!temp.Add(GenerateWeight())) return false; } if(!FF_Weights.Add(temp)) return false; //--- temp=new CBufferDouble(); if(CheckPointer(temp)==POINTER_INVALID) return false; if(!temp.Reserve(ff_2)) return false; for(uint w=0; w<ff_1; w++) { if(!temp.Add(GenerateWeight())) return false; } if(!FF_Weights.Add(temp)) return false; //--- for(int d=0; d<(optimization==SGD ? 1 : 2); d++) { temp=new CBufferDouble(); if(CheckPointer(temp)==POINTER_INVALID) return false; if(!temp.BufferInit(qkv_weights,0)) return false; if(!QKV_Weights.Add(temp)) return false; temp=new CBufferDouble(); if(CheckPointer(temp)==POINTER_INVALID) return false; if(!temp.BufferInit(w0,0)) return false; if(!FF_Weights.Add(temp)) return false; //--- Initialize FF Weights temp=new CBufferDouble(); if(CheckPointer(temp)==POINTER_INVALID) return false; if(!temp.BufferInit(ff_1,0)) return false; if(!FF_Weights.Add(temp)) return false; temp=new CBufferDouble(); if(CheckPointer(temp)==POINTER_INVALID) return false; if(!temp.BufferInit(ff_2,0)) return false; if(!FF_Weights.Add(temp)) return false; } } //--- return true; }

结果就是，对于每一层，我们获得以下张量矩阵。

QKV_Tensor	输出梯度
S_Tensors	输出梯度
AO_Tensors	MH 输出MH 梯度
FF_Tensors	FF1 输入 (关注输出)FF1 输出FF2 输出FF1 输入梯度FF1 梯度FF2 梯度
QKV_Weights	权重增量权重 (SGD) / 第一个动量 (Adam)仅 Adam 第二动量
FF_Weights	权重 0FF1 权重FF2 权重W0 增量权重 (SGD) / 第一动量 (Adam)FF1 增量权重 (SGD) / 第一动量 (Adam)FF2 增量权重 (SGD) / 第一动量 (Adam)仅 Adam W0 第二动量仅 Adam FF1 第二动量仅 Adam FF2 第二动量