操作系统实验2：Threading Matrix Multiply

Bug-Man-xzh

于 2022-04-27 19:47:59 发布

阅读量1.1k

点赞数 2

分类专栏：操作系统（双语）文章标签： linux

本文链接：https://blog.csdn.net/m0_51473192/article/details/124394513

版权

操作系统（双语）专栏收录该内容

2 篇文章 5 订阅

订阅专栏

文章目录

Experiment Object and Requirement
Experiment Content
Matrix Multiply
Command line
Time calculation
Experiment
- test cases
Summary
- code

Experiment Object and Requirement

Using threads to accelerate matrix multiply
Using command to pass arguments
solution will also be timed. If you have implemented the threading correctly you should expect to see quite a bit of speed-up when the machine you are using has multiple processors. For example, on some machine you may witness

Experiment Content

In this assignment you have to write a version of matrix multiply that uses threads to divide up the work necessary to compute the product of two matrices. There are several ways to improve the performance using threads. You need to divide the product in row dimension among multiple threads in the computation. That is, if you are computing A * B and A is a 10 x 10 matrix and B is a 10 x 10 matrix, your code should use threads to divide up the computation of the 10 rows of the product which is a 10 x 10 matrix. If you were to use 5 threads, then rows 0 and 1 of the product would be computed by thread 0, rows 2 and 3 would be computed by thread 1,…and rows 8 and 9 would be computed by thread 4.

If the product matrix had 100 rows, then each ”strip” of the matrix would have 20 rows for its thread to compute. This form of parallelization(平行化) is called a ”strip” decomposition of the matrix since the effect of partitioning one dimension only is to assign a ”strip” of the matrix to each thread for computation. Note that the number of threads may not divide the number of rows evenly. For example, of the product matrix has 20 rows and you are using 3 threads, some threads will need to compute more rows than others. In a good strip decomposition, the ”extra” rows are spread as evenly as possible among the threads. For example, with 20 rows and 3 threads, there are two ”extra” rows (20 mod 3 is 2). A good solution will not give both of the extra rows to one thread but, instead, will assign 7 rows to one thread, 7 rows to another, and 6 to the last. Note that for this assignment you can pass all of the A and B matrix to each thread.
What you need to do Your program must conform to the following prototype: my matrix multiply -a a matrix file.txt -b b matrix file.txt -t thread count where the -a and -b parameters specify input files containing matrices and thread count is the number of threads to use in your strip decompostion. The input matrix files are text files having the following format. The first line contains two integers: rows columns. Then each line is an element in row major order. Lines that begin with ”#” should be considered 1 comments and should be ignored. Here is an example matrix file
3 2
#Row 0
0.711306
0.890967
#Row 1
0.345199
0.380204
#Row 2
0.276921
0.026524
This matrix has 3 rows and 2 columns and contains comment lines showing row boundaries in the row-major ordering.
Your program will need to print out the result of A * B where A is contained in the file passed via the -a parameter and B is contained in the file passed via the -b parameter. It must print the product in the same format (with comments indicating rows) as the input matrix files: rows and columns on the first line, each element in row-major order on a separate line.
Your solution will also be timed. If you have implemented the threading correctly you should expect to see quite a bit of speed-up when the machine you are using has multiple processors. For example, on some machine you may witness
my_matrix_multiply -a a1000.txt -b b1000.txt -t 1
completes in 8.8 seconds when a1000.txt and b1000.txt are both 1000 x 1000 matracies while applying
my_matrix_multiply -a a1000.txt -b b1000.txt -t 2
the time drops to 4.9 seconds (where both of these times come from the ”real” number in the Linux time command).

Note this method of timing includes the time necessary to read in both A and B matrix files. For smaller products, this I/O time may dominate the total execution time. So that you can time the matrix multiply itself during development. A C function has been provided (c-timer.c)that returns Linux epoch as a double (including fractions of a second) and it may be used to time different parts of your implementation. For example, using this timing code, the same execution as shown above completes in 7.9 seconds with one thread, and 4.1 seconds with two threads. Thus the I/O (and any argument parsing, etc.) takes approximately 0.8 seconds for the two input matrices.

Matrix Multiply

I used two functions to complete matrix multiply.
My target is dividing up the rows of input matrix and the output matrix. Hence I use variable start and end to record the start line and end line of every thread in the matrixes. These two variables are defined in main function.

Besides, I use a struct namedmatrix_info to pass arguments

struct matrix_info
{
    float*A;
    float*B;
    int n,m,k;
    float*result;
};

Code below is the function that includes arguments passing and calculating of matrix multiply.

void *matrix_operator(void*info)
{
	...
	for (int i = 0; i < n; ++i)
    {
        for (int k_ = 0; k_ < m; ++k_)
        {
            for (int j = 0; j < k; ++j)
            {
                C[i * k + j] += A[i * m + k_] * B[k_ * k + j];
            }
        }
    }
}

Usually, there are some extra rows that are needed to be assigned to different threads properly.

nthread : number of threads I use in the program
rows_each_thread : number of rows for each normal threads to do matrix_operator().
extra_rows_to_assing : the number of threads to do matrix_operator() with extra lines.
n : the number of rows of matrix
it : the rank of each thread

rows_each_thread = n / nthread;

extra_rows_to_assign = nthread - n + rows_each_thread * nthread;

Moreover, I tried pthread_setaffinity_np(tid[it], sizeof(cpu_set_t), &cpu); to bind threads to specific CPU cores.
The code is below.

    struct matrix_info *t_info = malloc(sizeof(struct matrix_info) * num_of_thread);
    pthread_t *tid = malloc(sizeof(pthread_t) * num_of_thread);
    clock_gettime(CLOCK_MONOTONIC, &t);
    matrix_mul_start = TIME_MS(t);
    printf("[Info] Done IO in %lf ms\n", (((double) (matrix_mul_start - start))));

    for (int it = 0; it < num_of_thread; ++it)
    {
        cpu_set_t cpu;
        CPU_ZERO(&cpu);
        int start = rows_each_thread * it;
        if (it >= extra_rows_to_assign)
            start += it - extra_rows_to_assign;
        int end = start + rows_each_thread;
        if (it >= extra_rows_to_assign)
            end += 1;
        if (end > n_A) end = n_A;
        CPU_SET(it % 8, &cpu);
        t_info[it].A = A + start * m_A;
        t_info[it].B = B;
        t_info[it].n = end - start;
        t_info[it].m = m_A;
        t_info[it].k = m_B;
        t_info[it].result = C + start * m_C;
#ifndef DEBUG
        printf("n, m, k = %d, %d, %d\n", t_info[it].n, t_info[it].m, t_info[it].k);
#endif
        pthread_create(tid + it, NULL, matrix_operator, t_info + it);
        pthread_setaffinity_np(tid[it], sizeof(cpu_set_t), &cpu);
    }

Command line

In this program, this command is raquired to be used to pass arguments.

./m -a A.txt -b B.txt -t 5

Hence, I need to do some preprocess.
Some steps are similiar with steps in Assignment1.

int main(int argc, char**argv)
{
//Start of preprocess
    int m_A, m_B, n_A, n_B;
    float *A, *B, *C;
    float start,matrix_mul_start,end;

    struct timespec t;
    clock_gettime(CLOCK_MONOTONIC, &t);
    start = TIME_MS(t);

    A = read_matrix(&m_A, &n_A, argv[2]);
    B = read_matrix(&m_B, &n_B, argv[4]);

    int num_of_thread = atoi(argv[6]);

    if (m_A != n_B)
    {
        printf("[Error] Illegal shape For matmul, got m1, m2 = [%d, %d]", m_A, n_B);
    }

    int m_C = m_B;
    int n_C = n_A;

    C = (float *)malloc(sizeof(float) * n_C * m_C);
//end of preprocess
    
// pass argument, do matrix multiply 

//Free space
}

Time calculation

In a multi-thread environment, I choose clock_gettime(clockid_t clk_id, struct timespec *res) to get the number of seconds elapsed during my program

#define TIME_MS(t) (((float) t.tv_sec) * 1000.f + ((float) t.tv_nsec) / 1e6f)

//codes below are in main function
struct timespec t;
    clock_gettime(CLOCK_MONOTONIC, &t);
    start = TIME_MS(t);

Experiment

test cases

I used this python file to generate matrix.

import random
 
 
def genetate_random_nums(rows=10, cols=10, writefile='data.txt'):
    '''
    生成随机数
    产生:rows行，cols列的数据文件
    '''
    matrix=[]
    for i in range(rows):
        tmp_list=[]
        for j in range(cols):
            tmp_list.append(random.randint(1,100))
        matrix.append([str(o) for o in tmp_list])
    with open(writefile, 'w') as f:
        for one_list in matrix:
            f.write(' '.join(one_list)+'\n')
if __name__ == '__main__':
    genetate_random_nums(rows=100, cols=100, writefile='A.txt')
    genetate_random_nums(rows=100, cols=100, writefile='B.txt')

The table below is the time used with different thread number.
The test matrix is 100 $\times$ 100.
在这里插入图片描述
The test matrix is 1000 $\times$ 1000.

According to tables above, I find that when I use more threads, the program will be accelerated. This phenomenon is obvious when the number of threads changed from 1 to 2or from 1 to 4. However, when the number of threads increased continuously, cost time doesn’t decreased too much , it even increased slightly.
Personally, the reason behind this phenomenon may be that the program spends more time to assign matrix to different threads.

Summary

In this experiment, I learnt the use of threads(some functions in pthread.h). And I need to add -lpthread to build programs that use threads. Just like I need to add -lm when my program includes math.h, there are many things that I need to pay attention to in command lines.
Besides, I learnt how to use threads to accelerate the speed of programs. During the experiment, I observed the results, and I concluded that the increase of thread numbers does accelerate the program, but if there are too much threads, program will spend too much time on assigning tasks to different threads. Thereby, I need to find the proper thread numbers if I hope to maximum the speed of programs.

code

//main.cpp
#define _GNU_SOURCE
#include <stdlib.h>
#include <pthread.h>
#include <stdio.h>
#include <string.h>
#include <time.h>
#include <sched.h>

#define TIME_MS(t) (((float) t.tv_sec) * 1000.f + ((float) t.tv_nsec) / 1e6f)
#define DEBUG

struct matrix_info
{
    float*A;
    float*B;
    int n,m,k;
    float*result;
};

void *matrix_operator(void *info)
{
    struct matrix_info *pinfo=(struct matrix_info*)info;
    float *A = pinfo->A;
    float *B = pinfo->B;
    int n = pinfo->n;
    int m = pinfo->m;
    int k = pinfo->k;
    float *C = pinfo->result;

    for (int i = 0; i < n * k; ++i)
        C[i] = 0.0f;

    for (int i = 0; i < n; ++i)
    {
        for (int k_ = 0; k_ < m; ++k_)
        {
            for (int j = 0; j < k; ++j)
            {
                C[i * k + j] += A[i * m + k_] * B[k_ * k + j];
            }
        }
    }
}

//read txt file
/*
m:columns
n:rows
fn:A or B
*/
float *read_matrix(int *m, int *n, char *fn)
{
    FILE *file = fopen(fn, "r");
    if (file == NULL)
    {
        printf("[Error] Failed to read file %s.\n", fn);
    }
    fscanf(file, "%d", n);
    fscanf(file, "%d", m);
    float *mat = (float *)malloc(sizeof(float) * (*n) * (*m));
    for (int i = 0; i < (*n) * (*m); ++i)
    {
        fscanf(file, "%f", &(mat[i]));
    }

    fclose(file);
    return mat;
}

void write_matrix(float *mat, int *m, int *n, char *fn)
{
    FILE *file = fopen(fn, "w");
    fprintf(file, "%d %d\n", *m, *n);
    for (int i = 0; i < *n; ++i)
    {
        for (int j = 0; j < *m; ++j)
        {
            fprintf(file, "%f ", mat[i * (*m) + j]);
        }
        fprintf(file, "\n");
    }
    fclose(file);
}

int main(int argc, char**argv)
{
    int m_A, m_B, n_A, n_B;
    float *A, *B, *C;
    float start,matrix_mul_start,end;

    struct timespec t;
    clock_gettime(CLOCK_MONOTONIC, &t);
    start = TIME_MS(t);

    A = read_matrix(&m_A, &n_A, argv[2]);
    B = read_matrix(&m_B, &n_B, argv[4]);

    int num_of_thread = atoi(argv[6]);

    if (m_A != n_B)
    {
        printf("[Error] Illegal shape For matmul, got m1, m2 = [%d, %d]", m_A, n_B);
    }

    int m_C = m_B;
    int n_C = n_A;

    C = (float *)malloc(sizeof(float) * n_C * m_C);

    int rows_each_thread = n_A / num_of_thread;
    int extra_rows_to_assign =  num_of_thread - n_A + rows_each_thread * num_of_thread;

    struct matrix_info *t_info = malloc(sizeof(struct matrix_info) * num_of_thread);
    pthread_t *tid = malloc(sizeof(pthread_t) * num_of_thread);
    clock_gettime(CLOCK_MONOTONIC, &t);
    matrix_mul_start = TIME_MS(t);
    printf("[Info] Done IO in %lf ms\n", (((double) (matrix_mul_start - start))));

    for (int it = 0; it < num_of_thread; ++it)
    {
        cpu_set_t cpu;
        CPU_ZERO(&cpu);
        int start = rows_each_thread * it;
        if (it >= extra_rows_to_assign)
            start += it - extra_rows_to_assign;
        int end = start + rows_each_thread;
        if (it >= extra_rows_to_assign)
            end += 1;
        if (end > n_A) end = n_A;
        CPU_SET(it % 8, &cpu);
        t_info[it].A = A + start * m_A;
        t_info[it].B = B;
        t_info[it].n = end - start;
        t_info[it].m = m_A;
        t_info[it].k = m_B;
        t_info[it].result = C + start * m_C;
#ifndef DEBUG
        printf("n, m, k = %d, %d, %d\n", t_info[it].n, t_info[it].m, t_info[it].k);
#endif
        pthread_create(tid + it, NULL, matrix_operator, t_info + it);
        pthread_setaffinity_np(tid[it], sizeof(cpu_set_t), &cpu);
    }

    for (int it = 0; it < num_of_thread; ++it){
        pthread_join(tid[it], NULL);
    }
    clock_gettime(CLOCK_MONOTONIC, &t);
    end = TIME_MS(t);
    printf("[Info] Done MM in %lf ms...\n", (((double) (end - matrix_mul_start))));
 
    write_matrix(C, &m_C, &n_C, "result.txt");

    clock_gettime(CLOCK_MONOTONIC, &t);
    end = TIME_MS(t);
    printf("[Info] Done in %lf ms...\n", ((double) (end - start)));
 
    
    free(A); A = NULL;
    free(B); B = NULL;
    free(C); C = NULL;
    free(tid); tid = NULL;
    free(t_info); t_info = NULL;
    return 0;
}

Bug-Man-xzh

关注

2
点赞
踩
11

收藏

觉得还不错? 一键收藏
1
评论
操作系统实验2：Threading Matrix Multiply

Experiment ObjectExperiment ContentIn this assignment you have to write a version of matrix multiply that uses threads to divide up the work necessary to compute the product of two matrices. There are several ways to improve the performance using thread
复制链接

扫一扫

专栏目录