java矩阵乘法函数,java – cpu的矩阵访问和乘法优化

最新推荐文章于 2022-04-04 09:51:15 发布

红白研究家

最新推荐文章于 2022-04-04 09:51:15 发布

阅读量218

点赞数

文章标签： java矩阵乘法函数

我在java中制作了一些内在优化的矩阵包装器(在JNI的帮助下).需要肯定这一点,你能给出关于矩阵优化的一些提示吗？我要实施的是：

Matrix可以表示为四组缓冲区/数组,一组用于水平访问,一组用于垂直访问,一组用于对角访问,命令缓冲区仅在需要时计算矩阵元素.这是一个例子.

Matrix signature:

0 1 2 3

4 5 6 7

8 9 1 3

3 5 2 9

First(hroizontal) set:

horSet[0]={0,1,2,3} horSet[1]={4,5,6,7} horSet[2]={8,9,1,3} horSet[3]={3,5,2,9}

Second(vertical) set:

verSet[0]={0,4,8,3} verSet[1]={1,5,9,5} verSet[2]={2,6,1,2} verSet[3]={3,7,3,9}

Third(optional) a diagonal set:

diagS={0,5,1,9} //just in case some calculation needs this

Fourth(calcuation list, in a "one calculation one data" fashion) set:

calc={0,2,1,3,2,5} --->0 means multiply by the next element

1 means add the next element

2 means divide by the next element

so this list means

( (a[i]*2)+3 ) / 5 when only a[i] is needed.

Example for fourth set:

A.mult(2), A.sum(3), A.div(5), A.mult(B)

(to list) (to list) (to list) (calculate *+/ just in time when A is needed )

so only one memory access for four operations.

loop start

a[i] = b[i] * ( ( a[i]*2) +3 ) / 5 only for A.mult(B)

loop end

如上所示,当需要访问列元素时,第二组提供连续访问.没有飞跃.第一组水平访问也达到了同样的效果.

这应该让一些事情变得更容易,有些事情更难：

Easier:

**Matrix transpozing operation.

Just swapping the pointers horSet[x] and verSet[x] is enough.

**Matrix * Matrix multiplication.

One matrix gives one of its horizontal set and other matrix gives vertical buffer.

Dot product of these must be highly parallelizable for intrinsics/multithreading.

If the multiplication order is inverse, then horizontal and verticals are switched.

**Matrix * vector multiplication.

Same as above, just a vector can be taken as horizontal or vertical freely.

Harder:

** Doubling memory requirement is bad for many cases.

** Initializing a matrix takes longer.

** When a matrix is multiplied from left, needs an update vertical-->horizontal

sets if its going to be multiplied from right after.(same for opposite)

(if a tranposition is taken between, this does not count)

Neutral:

** Same matrix can be multiplied with two other matrices to get two different

results such as A=A*B(saved in horizontal sets) A=C*A(saved in vertical sets)

then A=A*A gives A*B*C*A(in horizontal) and C*A*A*B (in vertical) without

copying A.

** If a matrix always multiplied from left or always from right, every access

and multiplication will not need update and be contiguous on ram.

** Only using horizontals before transpozing, only using verticals after,

should not break any rules.

主要目的是具有(8的倍数,8的倍数)大小的矩阵并且应用具有多个线程的avx内在函数(每个胎面同时在一组上工作).

我只实现了矢量*矢量dotproduct.如果你掌握了编程方法,我会进入这个方向.

我写的dotproduct(带内在函数)比循环展开的版本快6倍(这是乘法的两倍),当在包装器中启用多线程时(8x – >几乎使用),它也会停留在内存带宽上限20GB / s接近我的ddr3的限制)已经尝试过opencl它对cpu有点慢,但对于gpu来说非常好.

谢谢.

编辑：“块矩阵”缓冲区如何执行？当乘以大矩阵时,小补丁以特殊方式相乘,缓存可能用于减少主存储器访问.但是这需要在垂直 – 水平 – 对角线和该块之间的矩阵乘法之间进行更多更新.

解决方法:

一些库使用Expression Templates来实现非常特定的优化函数的应用,用于级联矩阵运算.

这使得语句的连接成为可能：

M = A*B.transp(); // where M, A, B are matrices

在这种情况下,您将需要3个类：

class Matrix;

class Transposed

{

public:

Transposed(Matrix &matrix) : m_matrix(matrix) {}

Matrix & obj (void) { return m_matrix; }

private:

Matrix & m_matrix;

};

class MatrixMatrixMulTransPosed

{

public:

MatrixMatrixMulTransPosed(Matrix &matrix, Transposed &trans)

: m_matrix(matrix), m_transposed(trans.obj()) {}

Matrix & matrix (void) { return m_matrix; }

Matrix & transposed (void) { return m_transposed; }

private:

Matrix & m_matrix;

Matrix & m_transposed;

};

class Matrix

{

public:

MatrixMatrixMulTransPosed operator* (Transposed &rhs)

{

return MatrixMatrixMulTransPosed(*this, rhs);

}

Matrix& operator= (MatrixMatrixMulTransPosed &mmtrans)

{

// Actual computation goes here and is stored in this.

// using mmtrans.matrix() and mmtrans.transposed()

}

};

你可以推进这个概念,以便能够为任何平均值的任何计算都有一个特殊的函数.

标签：java,c-2,optimization,matrix,intrinsics

来源： https://codeday.me/bug/20190703/1368931.html

红白研究家

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
java矩阵乘法函数,java – cpu的矩阵访问和乘法优化

我在java中制作了一些内在优化的矩阵包装器(在JNI的帮助下).需要肯定这一点,你能给出关于矩阵优化的一些提示吗？我要实施的是：Matrix可以表示为四组缓冲区/数组,一组用于水平访问,一组用于垂直访问,一组用于对角访问,命令缓冲区仅在需要时计算矩阵元素.这是一个例子.Matrix signature:0 1 2 34 5 6 78 9 1 33 5 2 9First(h...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。