文章目录
Experiment Object and Requirement
- Using threads to accelerate matrix multiply
- Using command to pass arguments
- solution will also be timed. If you have implemented the threading correctly you should expect to see quite a bit of speed-up when the machine you are using has multiple processors. For example, on some machine you may witness
Experiment Content
In this assignment you have to write a version of matrix multiply that uses threads to divide up the work necessary to compute the product of two matrices. There are several ways to improve the performance using threads. You need to divide the product in row dimension among multiple threads in the computation. That is, if you are computing A * B and A is a 10 x 10 matrix and B is a 10 x 10 matrix, your code should use threads to divide up the computation of the 10 rows of the product which is a 10 x 10 matrix. If you were to use 5 threads, then rows 0 and 1 of the product would be computed by thread 0, rows 2 and 3 would be computed by thread 1,…and rows 8 and 9 would be computed by thread 4.
- If the product matrix had 100 rows, then each ”strip” of the matrix would have 20 rows for its thread to compute. This form of parallelization(平行化) is called a ”strip” decomposition of the matrix since the effect of partitioning one dimension only is to assign a ”strip” of the matrix to each thread for computation. Note that the number of threads may not divide the number of rows evenly. For example, of the product matrix has 20 rows and you are using 3 threads, some threads will need to compute more rows than others. In a good strip decomposition, the ”extra” rows are spread as evenly as possible among the threads. For example, with 20 rows and 3 threads, there are two ”extra” rows (20 mod 3 is 2). A good solution will not give both of the extra rows to one thread but, instead, will assign 7 rows to one thread, 7 rows to another, and 6 to the last. Note that for this assignment you can pass all of the A and B matrix to each thread.
- What you need to do Your program must conform to the following prototype: my matrix multiply -a a matrix file.txt -b b matrix file.txt -t thread count where the -a and -b parameters specify input files containing matrices and thread count is the number of threads to use in your strip decompostion. The input matrix files are text files having the following format. The first line contains two integers: rows columns. Then each line is an element in row major order. Lines that begin with ”#” should be considered 1 comments and should be ignored. Here is an example matrix file
3 2
#Row 0
0.711306
0.890967
#Row 1
0.345199
0.380204
#Row 2
0.276921
0.026524
This matrix has 3 rows and 2 columns and contains comment lines showing row boundaries in the row-major ordering.- Your program will need to print out the result of A * B where A is contained in the file passed via the -a parameter and B is contained in the file passed via the -b parameter. It must print the product in the same format (with comments indicating rows) as the input matrix files: rows and columns on the first line, each element in row-major order on a separate line.
- Your solution will also be timed. If you have implemented the threading correctly you should expect to see quite a bit of speed-up when the machine you are using has multiple processors. For example, on some machine you may witness
my_matrix_multiply -a a1000.txt -b b1000.txt -t 1
completes in 8.8 seconds when a1000.txt and b1000.txt are both 1000 x 1000 matracies while applying
my_matrix_multiply -a a1000.txt -b b1000.txt -t 2
the time drops to 4.9 seconds (where both of these times come from the ”real” number in the Linux time command).
Note this method of timing includes the time necessary to read in both A and B matrix files. For smaller products, this I/O time may dominate the total execution time. So that you can time the matrix multiply itself during development. A C function has been provided (c-timer.c)that returns Linux epoch as a double (including fractions of a second) and it may be used to time different parts of your implementation. For example, using this timing code, the same execution as shown above completes in 7.9 seconds with one thread, and 4.1 seconds with two threads. Thus the I/O (and any argument parsing, etc.) takes approximately 0.8 seconds for the two input matrices.
Matrix Multiply
I used two functions to complete matrix multiply.
My target is dividing up the rows of input matrix and the output matrix. Hence I use variable start
and end
to record the start line and end line of every thread in the matrixes. These two variables are defined in main function.
Besides, I use a struct namedmatrix_info
to pass arguments
struct matrix_info
{
float*A;
float*B;
int n,m,k;
float*result;
};
Code below is the function that includes arguments passing and calculating of matrix multiply.
void *matrix_operator(void*info)
{
...
for (int i = 0; i < n; ++i)
{
for (int k_ = 0; k_ < m; ++k_)
{
for (int j = 0; j < k; ++j)
{
C[i * k + j] += A[i * m + k_] * B[k_ * k + j];
}
}
}
}
Usually, there are some extra rows that are needed to be assigned to different threads properly.
nthread
: number of threads I use in the programrows_each_thread
: number of rows for each normal threads to domatrix_operator()
.extra_rows_to_assing
: the number of threads to domatrix_operator()
with extra lines.n
: the number of rows of matrixit
: the rank of each thread
rows_each_thread = n / nthread;
extra_rows_to_assign = nthread - n + rows_each_thread * nthread;
Moreover, I tried pthread_setaffinity_np(tid[it], sizeof(cpu_set_t), &cpu);
to bind threads to specific CPU cores.
The code is below.
struct matrix_info *t_info = malloc(sizeof(struct matrix_info) * num_of_thread);
pthread_t *tid = malloc(sizeof(pthread_t) * num_of_thread);
clock_gettime(CLOCK_MONOTONIC, &t);
matrix_mul_start = TIME_MS(t);
printf("[Info] Done IO in %lf ms\n", (((double) (matrix_mul_start - start))));
for (int it = 0; it < num_of_thread; ++it)
{
cpu_set_t cpu;
CPU_ZERO(&cpu);
int start = rows_each_thread * it;
if (it >= extra_rows_to_assign)
start += it - extra_rows_to_assign;
int end = start + rows_each_thread;
if (it >= extra_rows_to_assign)
end += 1;
if (end > n_A) end = n_A;
CPU_SET(it % 8, &cpu);
t_info[it].A = A + start * m_A;
t_info[it].B = B;
t_info[it].n = end - start;
t_info[it].m = m_A;
t_info[it].k = m_B;
t_info[it].result = C + start * m_C;
#ifndef DEBUG
printf("n, m, k = %d, %d, %d\n", t_info[it].n, t_info[it].m, t_info[it].k);
#endif
pthread_create(tid + it, NULL, matrix_operator, t_info + it);
pthread_setaffinity_np(tid[it], sizeof(cpu_set_t), &cpu);
}
Command line
In this program, this command is raquired to be used to pass arguments.
./m -a A.txt -b B.txt -t 5
Hence, I need to do some preprocess.
Some steps are similiar with steps in Assignment1.
int main(int argc, char**argv)
{
//Start of preprocess
int m_A, m_B, n_A, n_B;
float *A, *B, *C;
float start,matrix_mul_start,end;
struct timespec t;
clock_gettime(CLOCK_MONOTONIC, &t);
start = TIME_MS(t);
A = read_matrix(&m_A, &n_A, argv[2]);
B = read_matrix(&m_B, &n_B, argv[4]);
int num_of_thread = atoi(argv[6]);
if (m_A != n_B)
{
printf("[Error] Illegal shape For matmul, got m1, m2 = [%d, %d]", m_A, n_B);
}
int m_C = m_B;
int n_C = n_A;
C = (float *)malloc(sizeof(float) * n_C * m_C);
//end of preprocess
// pass argument, do matrix multiply
//Free space
}
Time calculation
In a multi-thread environment, I choose clock_gettime(clockid_t clk_id, struct timespec *res)
to get the number of seconds elapsed during my program
#define TIME_MS(t) (((float) t.tv_sec) * 1000.f + ((float) t.tv_nsec) / 1e6f)
//codes below are in main function
struct timespec t;
clock_gettime(CLOCK_MONOTONIC, &t);
start = TIME_MS(t);
Experiment
test cases
I used this python file to generate matrix.
import random
def genetate_random_nums(rows=10, cols=10, writefile='data.txt'):
'''
生成随机数
产生:rows行,cols列的数据文件
'''
matrix=[]
for i in range(rows):
tmp_list=[]
for j in range(cols):
tmp_list.append(random.randint(1,100))
matrix.append([str(o) for o in tmp_list])
with open(writefile, 'w') as f:
for one_list in matrix:
f.write(' '.join(one_list)+'\n')
if __name__ == '__main__':
genetate_random_nums(rows=100, cols=100, writefile='A.txt')
genetate_random_nums(rows=100, cols=100, writefile='B.txt')
The table below is the time used with different thread number.
The test matrix is 100
×
\times
× 100.
The test matrix is 1000
×
\times
× 1000.
According to tables above, I find that when I use more threads, the program will be accelerated. This phenomenon is obvious when the number of threads changed from 1 to 2or from 1 to 4. However, when the number of threads increased continuously, cost time doesn’t decreased too much , it even increased slightly.
Personally, the reason behind this phenomenon may be that the program spends more time to assign matrix to different threads.
Summary
In this experiment, I learnt the use of threads(some functions in pthread.h
). And I need to add -lpthread
to build programs that use threads. Just like I need to add -lm
when my program includes math.h
, there are many things that I need to pay attention to in command lines.
Besides, I learnt how to use threads to accelerate the speed of programs. During the experiment, I observed the results, and I concluded that the increase of thread numbers does accelerate the program, but if there are too much threads, program will spend too much time on assigning tasks to different threads. Thereby, I need to find the proper thread numbers if I hope to maximum the speed of programs.
code
//main.cpp
#define _GNU_SOURCE
#include <stdlib.h>
#include <pthread.h>
#include <stdio.h>
#include <string.h>
#include <time.h>
#include <sched.h>
#define TIME_MS(t) (((float) t.tv_sec) * 1000.f + ((float) t.tv_nsec) / 1e6f)
#define DEBUG
struct matrix_info
{
float*A;
float*B;
int n,m,k;
float*result;
};
void *matrix_operator(void *info)
{
struct matrix_info *pinfo=(struct matrix_info*)info;
float *A = pinfo->A;
float *B = pinfo->B;
int n = pinfo->n;
int m = pinfo->m;
int k = pinfo->k;
float *C = pinfo->result;
for (int i = 0; i < n * k; ++i)
C[i] = 0.0f;
for (int i = 0; i < n; ++i)
{
for (int k_ = 0; k_ < m; ++k_)
{
for (int j = 0; j < k; ++j)
{
C[i * k + j] += A[i * m + k_] * B[k_ * k + j];
}
}
}
}
//read txt file
/*
m:columns
n:rows
fn:A or B
*/
float *read_matrix(int *m, int *n, char *fn)
{
FILE *file = fopen(fn, "r");
if (file == NULL)
{
printf("[Error] Failed to read file %s.\n", fn);
}
fscanf(file, "%d", n);
fscanf(file, "%d", m);
float *mat = (float *)malloc(sizeof(float) * (*n) * (*m));
for (int i = 0; i < (*n) * (*m); ++i)
{
fscanf(file, "%f", &(mat[i]));
}
fclose(file);
return mat;
}
void write_matrix(float *mat, int *m, int *n, char *fn)
{
FILE *file = fopen(fn, "w");
fprintf(file, "%d %d\n", *m, *n);
for (int i = 0; i < *n; ++i)
{
for (int j = 0; j < *m; ++j)
{
fprintf(file, "%f ", mat[i * (*m) + j]);
}
fprintf(file, "\n");
}
fclose(file);
}
int main(int argc, char**argv)
{
int m_A, m_B, n_A, n_B;
float *A, *B, *C;
float start,matrix_mul_start,end;
struct timespec t;
clock_gettime(CLOCK_MONOTONIC, &t);
start = TIME_MS(t);
A = read_matrix(&m_A, &n_A, argv[2]);
B = read_matrix(&m_B, &n_B, argv[4]);
int num_of_thread = atoi(argv[6]);
if (m_A != n_B)
{
printf("[Error] Illegal shape For matmul, got m1, m2 = [%d, %d]", m_A, n_B);
}
int m_C = m_B;
int n_C = n_A;
C = (float *)malloc(sizeof(float) * n_C * m_C);
int rows_each_thread = n_A / num_of_thread;
int extra_rows_to_assign = num_of_thread - n_A + rows_each_thread * num_of_thread;
struct matrix_info *t_info = malloc(sizeof(struct matrix_info) * num_of_thread);
pthread_t *tid = malloc(sizeof(pthread_t) * num_of_thread);
clock_gettime(CLOCK_MONOTONIC, &t);
matrix_mul_start = TIME_MS(t);
printf("[Info] Done IO in %lf ms\n", (((double) (matrix_mul_start - start))));
for (int it = 0; it < num_of_thread; ++it)
{
cpu_set_t cpu;
CPU_ZERO(&cpu);
int start = rows_each_thread * it;
if (it >= extra_rows_to_assign)
start += it - extra_rows_to_assign;
int end = start + rows_each_thread;
if (it >= extra_rows_to_assign)
end += 1;
if (end > n_A) end = n_A;
CPU_SET(it % 8, &cpu);
t_info[it].A = A + start * m_A;
t_info[it].B = B;
t_info[it].n = end - start;
t_info[it].m = m_A;
t_info[it].k = m_B;
t_info[it].result = C + start * m_C;
#ifndef DEBUG
printf("n, m, k = %d, %d, %d\n", t_info[it].n, t_info[it].m, t_info[it].k);
#endif
pthread_create(tid + it, NULL, matrix_operator, t_info + it);
pthread_setaffinity_np(tid[it], sizeof(cpu_set_t), &cpu);
}
for (int it = 0; it < num_of_thread; ++it){
pthread_join(tid[it], NULL);
}
clock_gettime(CLOCK_MONOTONIC, &t);
end = TIME_MS(t);
printf("[Info] Done MM in %lf ms...\n", (((double) (end - matrix_mul_start))));
write_matrix(C, &m_C, &n_C, "result.txt");
clock_gettime(CLOCK_MONOTONIC, &t);
end = TIME_MS(t);
printf("[Info] Done in %lf ms...\n", ((double) (end - start)));
free(A); A = NULL;
free(B); B = NULL;
free(C); C = NULL;
free(tid); tid = NULL;
free(t_info); t_info = NULL;
return 0;
}