《Programming Massively Parallel Processors》Chapter5 习题解答

最新推荐文章于 2024-04-22 09:40:17 发布

CHAOS万有引力

最新推荐文章于 2024-04-22 09:40:17 发布

阅读量2.8k

点赞数 4

分类专栏： cuda

本文链接：https://blog.csdn.net/fulva/article/details/9240899

版权

cuda 专栏收录该内容

19 篇文章 0 订阅

订阅专栏

自己做的部分习题解答，因为时间关系，有些马虎，也不全面，欢迎探讨或指出错误

5.1 Consider the matrixaddition in Exercise 3.1. Can one use shared memory to reduce theglobal memory bandwidth consumption?

Hint: analyze the elementsaccessed by each thread and see if there is any commonality betweenthreads.

Answer:I think there is no need to use shared memory in Exercise3.1, becauseall threads only use their variables once and no variables need to beshared between threads.

5.2 Draw the equivalent ofFigure 5.6 for a 8*8 matrix multiplication with 2*2 tiling and 4*4tiling. Verify that the reduction in global memory bandwidth isindeed proportional to the dimension size of the tiles.

Answer:

1.A 8*8matrix multiplication with 2*2tiling

Block0,0

Phase1

Phase2

thread0,0

M0,0

↓

Mds0,0

N0,0

↓

Nds0,0

Pvalue0,0+=

Mds0,0*Nds0,0

+Mds0,1*Nds1,0

M0,2

↓

Mds0,0

N2,0

↓

Nds0,0

Pvalue0,0+=

Mds0,0*Nds0,0

+Mds0,1*Nds1,0

thread0,1

M0,1

↓

Mds0,1

N0,1

↓

Nds0,1

Pvalue0,1+=

Mds0,0*Nds0,1

+Mds0,1*Nds1,1

M0,3

↓

Mds0,1

N2,1

↓

Nds0,1

Pvalue0,1+=

Mds0,0*Nds0,1

+Mds0,1*Nds1,1

thread1,0

M1,0

↓

Mds1,0

N1,0

↓

Nds1,0

Pvalue1,0+=

Mds1,0*Nds0,0

+Mds1,1*Nds1,0

M1,2

↓

Mds1,0

N3,0

↓

Nds1,0

Pvalue1,0+=

Mds1,0*Nds0,0

+Mds1,1*Nds1,0

thread1,1

M1,1

↓

Mds1,1

N1,1

↓

Nds1,1

Pvalue1,1+=

Mds1,0*Nds0,1

+Mds1,1*Nds1,1

M1,3

↓

Mds1,1

N3,1

↓

Nds1,1

Pvalue1,1+=

Mds1,0*Nds0,1

+Mds1,1*Nds1,1

Phase3

Phase4

thread0,0

M0,4

↓

Mds0,0

N4,0

↓

Nds0,0

Pvalue0,0+=

Mds0,0*Nds0,0

+Mds0,1*Nds1,0

M0,6

↓

Mds0,0

N6,0

↓

Nds0,0

Pvalue0,0+=

Mds0,0*Nds0,0

+Mds0,1*Nds1,0

thread0,1

M0,5

↓

Mds0,1

N4,1

↓

Nds0,1

Pvalue0,1+=

Mds0,0*Nds0,1

+Mds0,1*Nds1,1

M0,7

↓

Mds0,1

N6,1

↓

Nds0,1

Pvalue0,1+=

Mds0,0*Nds0,1

+Mds0,1*Nds1,1

thread1,0

M1,4

↓

Mds1,0

N5,0

↓

Nds1,0

Pvalue1,1+=

Mds1,0*Nds0,1

+Mds1,1*Nds1,1

M1,6

↓

Mds1,0

N7,0

↓

Nds1,0

Pvalue1,1+=

Mds1,0*Nds0,1

+Mds1,1*Nds1,1

thread1,1

M1,5

↓

Mds1,1

N5,1

↓

Nds1,1

Pvalue1,1+=

Mds1,0*Nds0,1

+Mds1,1*Nds1,1

M1,7

↓

Mds1,1

N7,1

↓

Nds1,1

Pvalue1,1+=

Mds1,0*Nds0,1

+Mds1,1*Nds1,1

2.A 8*8matrix multiplication with 4*4tiling

Block0,0

Phase1

Phase2

thread0,0

M0,0

↓

Mds0,0

N0,0

↓

Nds0,0

Pvalue0,0+=

Mds0,0*Nds0,0

+Mds0,1*Nds1,0

+Mds0,2*Nds2,0

+Mds0,3*Nds3,0

M0,4

↓

Mds0,0

N4,0

↓

Nds0,0

Pvalue0,0+=

Mds0,0*Nds0,0

+Mds0,1*Nds1,0

+Mds0,2*Nds2,0

+Mds0,3*Nds3,0

thread0,1

M0,1

↓

Mds0,1

N0,1

↓

Nds0,1

Pvalue0,1+=

Mds0,0*Nds0,1

+Mds0,1*Nds1,1

+Mds0,2*Nds2,1

+Mds0,3*Nds3,1

M0,5

↓

Mds0,1

N4,1

↓

Nds0,1

Pvalue0,1+=

Mds0,0*Nds0,1

+Mds0,1*Nds1,1

+Mds0,2*Nds2,1

+Mds0,3*Nds3,1

thread0,2

M0,2

↓

Mds0,2

N0,2

↓

Nds0,2

Pvalue0,2+=

Mds0,0*Nds0,2

+Mds0,1*Nds1,2

+Mds0,2*Nds2,2

+Mds0,3*Nds3,2

M0,6

↓

Mds0,2

N4,2

↓

Nds0,2

Pvalue0,2+=

Mds0,0*Nds0,2

+Mds0,1*Nds1,2

+Mds0,2*Nds2,2

+Mds0,3*Nds3,2

thread0,3

M0,3

↓

Mds0,3

N0,3

↓

Nds0,3

Pvalue0,3+=

Mds0,0*Nds0,3

+Mds0,1*Nds1,3

+Mds0,2*Nds2,3

+Mds0,3*Nds3,3

M0,7

↓

Mds0,3

N4,3

↓

Nds0,3

Pvalue0,3+=

Mds0,0*Nds0,3

+Mds0,1*Nds1,3

+Mds0,2*Nds2,3

+Mds0,3*Nds3,3

Thread1.x-thread3.xellipsis

As shown in the tables,the reduction in global memory bandwidth is indeed proportional tothe dimension size of the tiles, cause the if the tile is bigger, thethread used is proportional bigger, the phase of read data fromglobal memory is proportional smaller, so the reduction in globalmemory bandwidth is proportional to the dimension size of the tiles.

5.3 What type of incorrectexecution behavior can happen if one forgot to use syncthreads() inthe kernel of Figure 5.12?

Answer: The barrier__syncthreads() in line 11 ensures that all threads have finishedloading the tiles of d_M and d_N into Mds and Nds before any of themcan move forward. The barrier __syncthread() in line 14 ensures thatall threads have finished using the d_M and d_N elements in theshared memory before any of them move on to the next iteration andload the elements in the next tiles. Without synthreads() in thekernel, the threads would load the elements too early and corrupt theinput values for other threads.

5.4 Assuming capacity was notan issue for register or shared memory, give one case that it wouldbe valuable to use shared memory instead of registers to hold valuesfetched from global memory?

Explain your answer?

Answer: Without concerningthe capacity of register or shared memory. The biggest differencebetween them is that a register is made for a single thread, butshared memory can be shared by all threads in one block.

So the matrixmultiplication maybe a good example because the data read by onethread may be useful to other threads.

5.5 For our tiledmatrix-matrix multiplication kernel, if we use a 32*32 tile, what isthe reduction of memory bandwidth usage for input matrices M andN?
a. 1/8 of the original usage

b. 1/16 of the originalusage

c. 1/32 of the originalusage

d. 1/64 of the originalusage

Answer: c

5.6 Assume that a kernel islaunched with 1000 tread blocks each of which has 512 threads. If avariable is declared as a local variable in the kernel, how manyversions of the variable will be created through the life time of theexecution of the kernel?

a.1

b.1000

c.512

d.512000

Answer: d

5.7 In the previous question,if a variable is declared as a shared memory variable, how manyversions of the variable will be created through the life time of theexecution of the kernel?

a.1

b.1000

c.512

d.51200

Answer: b

5.9 Consider performing amatrix multiplication of two input matrices with dimensions N*N. Howmany times is each element in the input matrices request form globalmemory when:

a. There is no tiling?

b. Tiles of size T*T areused?

Answer: a. N

b. N/T