2011 英特尔® 线程挑战赛—Tiling Rectangles

最新推荐文章于 2024-10-09 16:20:31 发布

denghui0815

最新推荐文章于 2024-10-09 16:20:31 发布

阅读量921

点赞数

文章标签：英特尔 string concurrency 测试算法优化

本文链接：https://blog.csdn.net/denghui0815/article/details/6549369

版权

2011 英特尔® 线程挑战赛—Tiling Rectangles

邓辉 denghui0815@hotmail.com

源码下载

问题描述

给定一个整数尺寸的矩形区域，此区域可以再细分为多个方格，同样也是整数尺寸。此过程称为拼贴矩形。对于这种用方格拼成的矩形，我们可以使用一系列分组的整数对拼贴进行编码。从给定矩形水平方向的上方开始，从左向右、从上向下“读取”方格。用括号将处于相同水平（位于拼贴矩形顶部）的方格的侧面长度组合到一起，然后按从左向右的顺序列出

例如，4x7 矩形将按以下方式拼贴…

_ _ _ _ _ _ _

| | |_|

| |_ _|_|

| |_| |

| _ _ _ |_|_ _|

…将编码为 (4 2 1) (1) (1 2) (1)

问题描述：编写一段多线程代码，输入未知数量的有序整数集，并判断其中的一些整数组是否可形成某些用方格拼成的矩形的正确编码。用于测试的整数集将保存在此程序的命令行上所列的第一个文本文件中。对于发现的所有有效编码，此程序将输出矩形的高度和宽度，以及格式正确的拼贴编码。输出内容将保存在命令行上所列的第二个文本文件中。

输入描述：此程序的输入内容来自命令行上一个指定的文本文件。连续行会按顺序对应到矩形的相同可能拼贴。 “0”（零）用于表示拼贴整数集的结束。输入文件中的每个文本行将包含 20 个整数，拼贴的最后一行可能少于 20 个整数，并且会以零结尾。文件结束标志着拼贴整数集的结束。

输出描述：对于输入文件内的每个可能拼贴整数集，此程序将输出矩形的尺寸和每个矩形的拼贴编码，并将输出结果储存在此程序的命令行上所列的第二个文件中。如果从输入的整数集中得不到可行的编码，则应该打印一条消息说明这种情况。整数集的输出顺序必须与输入顺序相同。

命令行示例： tiling.exe setsin.txt rectout.txt

输入文件示例，setsin.txt:

4 2 1 1 1 2 1 0

2 1 1 0

17 10 9 5 4 1 2 1 8 1 5 0

36 33 5 28 25 9 2 7 16 0

输出文件示例，rectout.txt:

Set 1

dimensions: 4 x 7

(4 2 1) (1) (1 2) (1)

Set 2

dimensions: 2 x 3

(2 1) (1)

dimensions: 3 x 2

(2) (1 1)

Set 3

Cannot encode a rectangle

Set 4

dimensions: 61 x 69

(36 33) (5 28) (25 9 2) (7) (16)

计时：将使用此程序的总执行时间进行计分。为得到最准确的计时结果，所提交的代码需要包含计时代码并将计算出的总执行时间打印到标准输出，否则将使用外部秒表计时。

串行算法

根据问题描述，我们需要将矩形按读入的顺序，按从左向右、从上向下的规则进行拼贴，如果所有矩形拼贴完成，得到的仍然为矩形，即得到一组合法的编码。

假设所有矩形的面积为nArea,前n个矩形的宽度和为Wn，那么nArea % Wn 必须零，才能可能拼贴出一个宽度为Wn、面积为nArea的矩形。

在进行拼贴时，我们需要记录每个矩形的下边沿，即矩形下边沿的起点和长度（x，y，len）。如果定义边沿到原点的距离为 y * W + x，那么下一个矩形拼贴的位置为到原点距离最小的边沿的起点（用红色标识）。

例如数据：100 150 50 50 100 50 50 50 50 50 0

1. 当n为3时，第一行有3个矩形，那么W = 100 + 150 + 50 = 300；拼贴这三个矩形后得到三个矩形的边沿。

其中红色标识出该边沿到原点的距离最小。

2. 在红色点处拼贴第4个矩形。

3. 在红色点处拼贴第5个矩形。

4. 在红色点处拼贴第6个矩形。

当前边沿（150， 100， 150）与边沿（150， 250， 50）连续且等高，进行合并得到边沿（150， 100，200）。

5. 在红色点处拼贴第7个矩形。

6. 依次拼贴剩余的矩形。

7. 最终合并后得到一条边沿(200, 0, 300),拼贴结束。

由于本算法需要反复获取到原点距离最小的边沿，那么可以将边沿放入二叉堆，其提取最小值的复杂度为O（logn），可大大提高求解速度。

算法的流程图如下：

核心函数源码：

#define XCOMPAREEDGE(ValA, ValB) (ValA.nVal64 < ValB.nVal64)

// 二叉堆FixDown

__inline void XFixDown(int i, int nHeap, XEdge* pHeap)

{

int j = i * 2 + 1;

if (j + 1 < nHeap && XCOMPAREEDGE(pHeap[j + 1], pHeap[j]) ) ++j;

while (j < nHeap && XCOMPAREEDGE(pHeap[j], pHeap[i]))

{

const __m128i tmp = pHeap[i].m128Val; pHeap[i].m128Val = pHeap[j].m128Val; pHeap[j].m128Val = tmp;

i = j; j = i * 2 + 1;

if(j + 1 < nHeap && XCOMPAREEDGE(pHeap[j + 1], pHeap[j])) ++j;

}

// 二叉堆FixUp

__inline void XFixUp(int i, XEdge* pHeap)

{

int j = (i - 1) >> 1;

while(i > 0 && XCOMPAREEDGE(pHeap[i], pHeap[j]))

{

const __m128i tmp = pHeap[i].m128Val; pHeap[i].m128Val = pHeap[j].m128Val; pHeap[j].m128Val = tmp;

i = j; j = (i - 1) >> 1;

}

void XTilingRect(XProblem* pArray)

{

int* pRect = pArray->pRect;

int nRect = pArray->nRect;

int i,j,xs,nHeap = 0;

uint64 nArea = 0,nRetCnt=0;

uint32 nW,nH;

XEdge* pHeap = (XEdge*)scalable_malloc(nRect * sizeof(pHeap[0]) + (nRect + 2) * sizeof(int));

int* pCnt = (int*)(pHeap + nRect);

if(pHeap == NULL) XError("XTilingRect 内存不足!");

XStringExpand(&pArray->xRet, 32 + nRect * 8);

XStringAddStr(&pArray->xRet, "Set ", 4);

XStringAddInt(&pArray->xRet, pArray->nIndex + 1, 0);

XStringAddStr(&pArray->xRet, "/n", 1);

// 计算所有矩形的总面积

for(i = 0; i < nRect; ++i)

{

nArea += (uint64)pRect[i] * pRect[i];

}

// 循环ys为的矩形个数

for(nW = 0, i = 0; i < nRect; ++i)

{

nW += pRect[i];

nH = (uint32)(nArea / nW);

if((uint64)nW * nH == nArea)

{

// 将ys为的边界加入堆

for(xs = 0, nHeap = 0; nHeap <= i; ++nHeap)

{

pHeap[nHeap].ys = pRect[nHeap];

pHeap[nHeap].xs = xs;

pHeap[nHeap].len = pRect[nHeap];

xs += pRect[nHeap];

XFixUp(nHeap, pHeap);

}

int nPreY = 0, nCnt = 0, nCur = nHeap;

pCnt[nCnt++] = 0;

while(nHeap)

{

// 取出最靠近原点的边

XEdge xEdgeCur = pHeap[0];

// 将该边从堆中移去

pHeap[0] = pHeap[--nHeap];

XFixDown(0, nHeap, pHeap);

while(pHeap[0].ys == xEdgeCur.ys && pHeap[0].xs == xEdgeCur.xs + xEdgeCur.len)

{ // 合并连续的边

xEdgeCur.len += pHeap[0].len;

// 将该边从二叉堆中移去

pHeap[0] = pHeap[--nHeap];

XFixDown(0, nHeap, pHeap);

}

if(nPreY != xEdgeCur.ys)

{ // 换行判断

pCnt[nCnt++] = nCur;

nPreY = xEdgeCur.ys;

}

// 消除当前边

xEdgeCur.len += xEdgeCur.xs;

while(xEdgeCur.xs < xEdgeCur.len && nCur < nRect)

{ // 将新的边界加入堆

pHeap[nHeap].ys = xEdgeCur.ys + pRect[nCur];

pHeap[nHeap].xs = xEdgeCur.xs;

pHeap[nHeap++].len = pRect[nCur];

xEdgeCur.xs += pRect[nCur++];

XFixUp(nHeap - 1, pHeap);

}

if(nHeap == 0 && nCur == nRect)

{ // 完成拼接，输出结果

++nRetCnt;

XTilingRectOutPutXString(&pArray->xRet, pArray->pRect, nW, nH, pCnt, nCnt - 1);

break;

}

else if(xEdgeCur.xs != xEdgeCur.len)

{ // 发生拼接错误

break;

}

scalable_free(pHeap);

if(nRetCnt == 0)

XStringAddStr(&pArray->xRet, "/nCannot encode a rectangle/n/n", 28);

else

XStringAddStr(&pArray->xRet, "/n", 1);

}

热点分析

使用Intel Amplifier分析热点，结果如下：(数据为17sisrs.txt 复制为36192行)

分析结果显示热点处于：

1. XLoadData_File

// 加载测试数据文件版本

void XLoadData_File(const char* szInput, XProblemAry* pProblemAry)

{

char szLine[2048] = {0};

FILE* fp = fopen(szInput, "rb");

if(fp == NULL) XError("打开输入文件失败!");

int nReadSize = 1024;

XProblem xProblemRead = {0};

XProblemAryExpand(pProblemAry, 512);

XProblemExpand(&xProblemRead, nReadSize);

while(fgets(szLine, sizeof(szLine), fp) != NULL)

{

if(xProblemRead.nRect + 20 > nReadSize)

{

nReadSize += nReadSize;

XProblemExpand(&xProblemRead, nReadSize);

}

int* pRead = xProblemRead.pRect + xProblemRead.nRect;

int i,nRead = sscanf(szLine, "%d %d %d %d %d %d %d %d %d %d %d %d %d %d %d %d %d %d %d %d",

pRead, pRead + 1, pRead + 2, pRead + 3, pRead + 4,

pRead + 5,pRead + 6, pRead + 7, pRead + 8, pRead + 9,

pRead + 10, pRead + 11, pRead + 12, pRead + 13, pRead + 14,

pRead + 15,pRead + 16, pRead + 17, pRead + 18, pRead + 19);

if(pRead[nRead - 1] == 0)

{

xProblemRead.nRect += nRead - 1;

xProblemRead.nIndex = pProblemAry->nProblem;

XProblemAryExpand(pProblemAry, 8);

XProblemCopy(&pProblemAry->pProblem[pProblemAry->nProblem++], &xProblemRead);

xProblemRead.nRect = 0;

}

else if(nRead == 20)

{

xProblemRead.nRect += nRead;

}

else

{

XError("错误的数据，不足个整数，且没有结尾!");

}

XProblemFree(&xProblemRead);

fclose(fp);

}

该函数功能为数据加载，通过FILE*访问文件，使用sscan读取数字，效率较低。使用内存映射读取数据可以提高效率。优化后代码如下：

// 加载测试数据内存映射串行版本

void XLoadData_Serial(const char* szInput, XProblemAry* pProblemAry)

{

// 文件大小

uint64 nFileSize = 0;

// 文件映射

XFILEMAPHANDLE hFileMap = XFileMapOpen(szInput, XFILEMAP_READONLY, nFileSize);

if(hFileMap == NULL) XError("XFileMapOpen Error!");

// 读取测试数据的大小

const char *pInput = (const char*)XFileMapView(hFileMap, XFILEMAP_READONLY, 0, (uint32)nFileSize);

if(pInput == NULL) XError("XFileMapView Error!");

int nReadSize = 1024;

XProblem xProblemRead = {0};

XProblemAryExpand(pProblemAry, 512);

XProblemExpand(&xProblemRead, nReadSize);

const char *pReadTmp = pInput;

const char *pReadEnd = pInput + nFileSize;

while(pReadTmp < pReadEnd)

{

xProblemRead.nRect = 0;

while(pReadTmp < pReadEnd && (*pReadTmp < '0' || *pReadTmp > '9')) ++pReadTmp;

while(pReadTmp < pReadEnd)

{

if(xProblemRead.nRect + 20 > nReadSize)

{

nReadSize += nReadSize;

XProblemExpand(&xProblemRead, nReadSize);

}

int j,*pReadRect = xProblemRead.pRect + xProblemRead.nRect;

for(j = 0; j < 20; ++j)

{

XREAD_INT(pReadTmp, pReadEnd, pReadRect[j]);

if(pReadRect[j] == 0 || pReadRect[j] == -1) break;

}

if(j < 20)

{

if(pReadRect[j] == 0)

{

xProblemRead.nRect += j;

xProblemRead.nIndex = pProblemAry->nProblem;

XProblemAryExpand(pProblemAry, 8);

XProblemCopy(&pProblemAry->pProblem[pProblemAry->nProblem++], &xProblemRead);

break;

}

else

{

XError("错误的数据，不足个整数，且没有结尾!");

}

else

{

xProblemRead.nRect += j;

}

XProblemFree(&xProblemRead);

XFileMapUnView((void*)pInput, (unsigned int)nFileSize);

XFileMapClose(&hFileMap);

}

2. XStringAddInt

功能为输出整数到字符串，采用了sprintf，可通过查表法提高效率。

// 增加整数

void XStringAddInt(XString* pString, uint32 nOut, int nSpace)

{

char szNum[32];

sprintf(szNum, nSpace ? "%d " : "%d", nOut);

XStringAddStr(pString, szNum, strlen(szNum));

}

增加映射表初始化函数，并修改XStringAddInt。

char g_pNum2String[100000][8];

char g_pNum2String_Zero[100000][8];

uint8 g_nNumeStringLen[100000];

// 初始化数字到字符串转换表

void XInitNum2StringTab()

{

cfor(int i = 0; i < 100000; ++i)

{

sprintf(g_pNum2String[i], "%d ", i);

sprintf(g_pNum2String_Zero[i], "%05d ", i);

g_nNumeStringLen[i] = strlen(g_pNum2String[i]) - 1;

}

// 增加整数

void XStringAddInt(XString* pString, uint32 nOut, int nSpace)

{

const uint32 nHigh = nOut / 100000;

if(nHigh)

{

XStringAddStr(pString, g_pNum2String[nHigh], g_nNumeStringLen[nHigh]);

XStringAddStr(pString, g_pNum2String_Zero[nOut % 100000], 5 + nSpace);

}

else

{

XStringAddStr(pString, g_pNum2String[nOut], g_nNumeStringLen[nOut] + nSpace);

}

//char szNum[32];

//sprintf(szNum, nSpace ? "%d " : "%d", nOut);

//XStringAddStr(pString, szNum, strlen(szNum));

}

3. XTilingRect

矩形拼贴算法，使用二叉堆，优化空间较小。

再使用Intel Amplifier分析热点，结果如下：(扩大数据为17sisrs.txt 复制为108576行)

分析各个热点函数后，发现XStringAddStr还有优化空间。因为每次增加的字符串都比较短，所以memcpy的效率较低，改为循环赋值可提高效率。

// 增加字符串

__inline void XStringAddStr(XString* pString, const char* pData, int nLen)

{

XStringExpand(pString, nLen);

memcpy(pString->pData + pString->nUse, pData, nLen);

pString->nUse += nLen;

}

修改后代码：

// 增加字符串

__inline void XStringAddStr(XString* pString, const char* pData, int nLen)

{

XStringExpand(pString, nLen);

char * pDst = pString->pData + pString->nUse;

for(int i = 0; i < nLen; ++i)

{

pDst[i] = pData[i];

}

pString->nUse += nLen;

//memcpy(pString->pData + pString->nUse, pData, nLen);

}

相同矩形的拼贴，可通过优化算法大大提高效率，修改XTilingRect加入判定和优化代码。

BOOL bEqual = TRUE;

// 判断是否所有矩形尺寸相同

for(i = 1; bEqual && i < pArray->nRect; ++i)

{

if(pArray->pRect[i] != pArray->pRect[i - 1]) bEqual = FALSE;

}

if(bEqual)

{

// 循环ys为的矩形个数

for(nW = 1; nW <= nRect; ++nW)

{

nH = nRect / nW;

if(nW * nH == nRect)

{

for(j = 0; j <= nH; ++j)

{

pCnt[j] = nW * j;

}

// 完成拼接，输出结果

++nRetCnt;

XTilingRectOutPutXString(&pArray->xRet, pArray->pRect, nW * pRect[0], nH* pRect[0], pCnt, nH);

}

并行算法

由于每个输入文件中存在多个测试数据，在并行对这些测试数据求解时无需考虑数据冲突问题，所以通过cilk_for即可实现并行优化。

for(int i = 0; i < xProblemAry.nProblem; ++i)

{

XTilingRect(xProblemAry.pProblem + i);

}

利用Cilk优化后代码如下：

cilk_for(int i = 0; i < xProblemAry.nProblem; ++i)

{

XTilingRect(xProblemAry.pProblem + i);

}

编译后，再使用Amplifier检测Concurrency结果如下：

算法已经具有良好的并行度,但XLoadData_Serial和XStringFree函数为串行执行，其中XStringFree在XSaveDataAndFree_Serial调用，所以优化IO也变得重要，同样使用cilk_for实现并行版本的加载数据XLoadData_Parallel和保存数据函数XSaveDataAndFree_Parallel。

重新编译后，再使用Amplifier检测Concurrency结果如下：

目前仅有一个系统函数__security_init_cookie没有并行，其为系统在mainCRTStartup中调用。

性能测试

测试环境：

Operating System: Windows XP Professional (5.1, Build 2600) Service Pack 3

Language: Chinese (Regional Setting: Chinese)

System Manufacturer: Dell Inc.

System Model: Vostro 1088

BIOS: Phoenix ROM BIOS PLUS Version 1.10 A02

Processor: Intel(R) Core(TM)2 Duo CPU T6570 @ 2.10GHz (2 CPUs)

Memory: 2046MB RAM

Page File: 908MB used, 3027MB available

Windows Dir: C:/WINDOWS

DirectX Version: DirectX 9.0c (4.09.0000.0904)

DX Setup Parameters: Not found

DxDiag Version: 5.03.2600.5512 32bit Unicode

测试结果：

测试数据算法版本	Setsin.txt	19sisrs.txt	20sisrs.txt
串行版本	0.070894秒	0.146068秒	0.357285秒
并行版本	0.041005秒	0.089352秒	0.207256秒
加速比	1.729	1.635	1.723