统计C++程序运行时间

honky_tonk_man

已于 2023-02-13 11:15:43 修改

阅读量865

点赞数

分类专栏： GPU c/c++ 文章标签： c++ 算法开发语言

于 2022-09-16 10:36:45 首次发布

本文链接：https://blog.csdn.net/qq_37026934/article/details/126885246

版权

c/c++ 同时被 2 个专栏收录

33 篇文章 2 订阅

订阅专栏

GPU

4 篇文章 1 订阅

订阅专栏

前言

为了分析C++程序的效率，一个非常重要的指标就是其算法的运行时间，我们C++11提供了chrono库专门进行时间的测量，具体使用方法如下

chrono

首先我们要说明chrono库中有自己的chrono命名空间，且这个命名空间包含在std中，所以我们要使用他先指定在std命名空间中，再指定在chrono命名空间中，比如我们计算现在的时间是多少

#include <chrono>
std::chrono::high_resoulution_clock::now();

std::chrono有3个clock可以选择，分别是ststem_clock, steady_clock, high_resolution_clock

ststem_clock
主要是从unix时间开始计算(1970/1/1 00:00:00)到现在的系统时间，但是系统时间可以被管理员修改，所以我们一般不用他计算interval
steady_clock：用time tick精确的计算时间间隔
high_resolution_clock：同上

我们测算的基本思想是先统计当前时间，再在算法执行完毕后再统计一下当前时间，最后用chrono的内置转换函数，将其转换成我们想要的单位返回，最后输出，例子如下

#include <iostream>
#include <chrono>

using namespace std;

int main()
{
    auto start = chrono::high_resolution_clock::now();
   // do somthine 
    auto end = chrono::high_resolution_clock::now();
    auto ms = chrono::duration_cast< std::chrono::milliseconds>(start - end);
    
    cout << ms.count() << endl;

    return 0;

具体例子如下，我们写一个矩阵相乘算法，矩阵相乘算法具体放到device(GPU)上执行

#include <iostream>
#include <math.h>
#include <chrono>

void
__global__
matrix_miltiple(int *a, int *b, int *c, int n){
    int row = blockIdx.y * blockDim.y + threadIdx.y; //每个block中的thread_id都是从0开始
    int col = blockIdx.x * blockDim.x + threadIdx.x;

    int temp = 0;

    if((row < n) && (col < n)){ //one thread iterate over matrix a's row and matrix b'x column
        for(int i = 0; i < n; i++) temp += a[row * n + i] * b[i * n + col];
    }

    c[row * n + col] = temp;
}

void
init_matrix(int *a, int n){
    for(int i = 0; i < n; i++){
        a[n] = rand() % 100;
    }
}

int
main(){
    //matrix size of row or col (1024)
    int n = 1 << 10;

    // size of matrix n * n (1024*1024)
    size_t bytes = n * n * sizeof(int);

    //host pointer
    int *h_a, *h_b, *h_c;

    //malloc host
    h_a = (int*)malloc(bytes);
    h_b = (int*)malloc(bytes);
    h_c = (int*)malloc(bytes);

    //device pointer 
    int *d_a, *d_b, *d_c;

    //alloc device mem
    cudaMalloc(&d_a, bytes);
    cudaMalloc(&d_b, bytes);
    cudaMalloc(&d_c, bytes);

    //init martrix
    init_matrix(h_a, n);
    init_matrix(h_b, n);

    //memcopy from host to device
    cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, h_b, bytes, cudaMemcpyHostToDevice);

    //thread pre block
    int BLOCK_SIZE = 16;

    //how many block in each demension
    int GRID_SIZE = (int)ceil(n / BLOCK_SIZE);

    //开启二维
    dim3 grid(GRID_SIZE, GRID_SIZE);
    dim3 threads(BLOCK_SIZE, BLOCK_SIZE);

    //计时
    auto start = std::chrono::high_resolution_clock::now();

    matrix_miltiple<<<grid, threads>>>(d_a, d_b, d_c, n);

    //wait 
    cudaDeviceSynchronize();

    //copy back to host
    cudaMemcpy(h_c, d_c, bytes, cudaMemcpyDeviceToHost);

    //计时
    auto end = std::chrono::high_resolution_clock::now();

    //转换
    auto ms = std::chrono::duration_cast< std::chrono::milliseconds>(end - start);

    std::cout << ms.count() << "ms" << std::endl; 
    return 0;
}