CUDA Intro to Parallel Programming笔记--Lesson 1 The GPU Programming Model

1.  3 traditional ways computes run faster

  • Faster clocks
  • More work/clock cycle  
  • More processors

2. Parallelism

  • A high end Gpu contains over 3,000 arithmatic units,ALUs, that can simultanously run 3,000 arithmetic operations. GPU can have tens of thousands of parallel pieces of work all active at the same time.
  • A modern GPU may be runing up to 65,000 concurrent threads. 

3. GPGPU--General purpose Programmability on the Graphics Grocessing Unit.

4.  How Are CPUs Getting Faster?

    More transistors avaliable for computation.

5. Why don`t we keep increasing clock speed?

  Runing a billion transistors generate an awful lot of heat,and we can`t keep all these processors cool.

6. What kind of processors are we building

  A: Why are traditional CPU-like processors not the most energy efficient processors?

  Q: Traditonal CPU-like processors rise up in flexibility and performance but expensive in terms of power.

  We might choose to build simpler control structures and instead devote those transistors to supporting more computation to the data path.The way that we`re going to build that data path in the GPU is by building a large number of parallel compute units. Individually, these compute units are small,simple,and power efficient.

7.  Build a power efficient processor

  Optimizing 

    • Minimizing Latency(execute time)
    • Throughput(tasks completed unit time, stuff/time, jobs/hour)  

       Notes:these two goals are not necessarily aligned.        

8. Latency vs Bandwidth

  Improved latency often leads to improved througput,and vise versa.But the GP designers are really prioritizing througput.

9. Core GPU design tents

  • Lots of simple compute units and trade simple control for more compute
  • Explicitly(显示) parallel programming model  
  • Optimize for througput,not latency

10. GPU from the point of view of the developer

  8 core Intel Ivy Bridge processor,has 8 cores,each core has 8-wide AVX vector operations,each core supports two simultaneously running threads.Multiply those together will get 128-way parallelism.

11. Squaring numbers using CUDA

 1 #include <stdio.h>
 2 __global__ void square(float *d_out,float *d_in){
 3     int idx = threadIdx.x;
 4     float f = d_in[idx];
 5     d_out[idx] = f * f*f;
 6 }
 7 
 8 int main(int argc, char **argv){
 9     const int ArraySize = 96;
10     const int ArrayBytes = ArraySize * sizeof(float);
11 
12     //Generate the input array on the host
13     float h_in[ArraySize]; 
14     for(int i=0; i<96; i++){
15         h_in[i] = float(i);
16     }
17     float h_out[ArraySize];
18 
19     //Declare GPU memory pointers
20     float *d_in;
21     float *d_out;
22 
23     //Allocate GPU memory
24     cudaMalloc((void **) &d_in, ArrayBytes);
25     cudaMalloc((void **) &d_out, ArrayBytes);
26 
27     //Transfer the array to the GPU
28     cudaMemcpy(d_in,h_in,ArrayBytes,cudaMemcpyHostToDevice);
29 
30     //Launch the kernel    
31     square<<< 1 , ArraySize>>>(d_out,d_in);
32 
33     //Copy back the result to the CPU
34     cudaMemcpy(h_out,d_out,ArrayBytes,cudaMemcpyDeviceToHost);
35 
36     for (int i = 0; i < ArraySize; ++i){
37         printf("%f", h_out[i]);
38         printf(((i % 4) == 3) ? "\n": "\t");
39     }
40 
41     cudaFree(d_in);
42     cudaFree(d_out);
43 }

 

转载于:https://www.cnblogs.com/robertgao/p/7455460.html

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值