A100 显卡关键参数

全局视图

首先看top图,
GA100是无损卡,就是说上面所有的unit都是好的,如下,
![在这里插入图片描述](https://i-blog.csdnimg.cn/direct/18df1f1b7dbd43a8b2b73499f502cd25.png在这里插入图片描述

A100包含有108个SM,每个SM最大可以容纳1024个threads.
说白了就是GA100的有缺陷的卡就为A100,撇去那个不能用的GPC
在这里插入图片描述

SM视图

下面是一个SM的视图:
在这里插入图片描述

在这里插入图片描述

算力

算力:
在这里插入图片描述

加工工艺

工艺采用的是7nm工艺:
在这里插入图片描述

关键参数

在这里插入图片描述
在这里插入图片描述

实际测量参数

下面是关键参数:

device properties : 
	name : NVIDIA A100-PCIE-40GB
	totalGlobalMem : 42298834944
	sharedMemPerBlock : 49152
	regsPerBlock : 65536
	warpSize : 32
	memPitch : 2147483647
	maxThreadsPerBlock : 1024
	maxThreadsDim[0] : 1024
	maxThreadsDim[1] : 1024
	maxThreadsDim[2] : 64
	maxGridSize[0] : 2147483647
	maxGridSize[1] : 65535
	maxGridSize[2] : 65535
	clockRate : 1410000
	totalConstMem : 65536
	major : 8
	minor : 0
	textureAlignment : 512
	texturePitchAlignment : 32
	deviceOverlap : 1
	multiProcessorCount : 108
	kernelExecTimeoutEnabled : 0
	integrated : 0
	canMapHostMemory : 1
	computeMode : 0
	concurrentKernels : 1
	ECCEnabled : 1
	pciBusID : 64
	pciDeviceID : 0
	pciDomainID : 0
	tccDriver : 0
	asyncEngineCount : 3
	unifiedAddressing : 1
	memoryClockRate : 1215000
	memoryBusWidth : 5120
	l2CacheSize : 41943040
	persistingL2CacheMaxSize : 31457280
	maxThreadsPerMultiProcessor : 2048
	streamPrioritiesSupported : 1
	globalL1CacheSupported : 1
	localL1CacheSupported : 1
	sharedMemPerMultiprocessor : 167936
	regsPerMultiprocessor : 65536
	managedMemory : 1
	isMultiGpuBoard : 0
	multiGpuBoardGroupID : 0
	singleToDoublePrecisionPerfRatio : 2
	pageableMemoryAccess : 0
	concurrentManagedAccess : 1
	computePreemptionSupported : 1
	canUseHostPointerForRegisteredMem : 1
	cooperativeLaunch : 1
	cooperativeMultiDeviceLaunch : 1
	pageableMemoryAccessUsesHostPageTables : 0
	directManagedMemAccessFromHost : 0
	accessPolicyMaxWindowSize : 134213632

device limit : 
	deviceLimitStackSize : 1024
	deviceLimitPrintfFifoSize : 7077888
	deviceLimitMallocHeapSize : 8388608
	deviceLimitDevRuntimeSyncDepth : 2
	deviceLimitDevRuntimePendingLaunchCount : 2048
	deviceLimitMaxL2FetchGranularity : 64
	deviceLimitPersistingL2CacheSize : 7864320

summary : 
	register total size : 6.75 MiB
	shared memory size per sm : 164.00 KiB
	shared memory total size : 17.30 MiB
	constant memory total size : 64.00 KiB
	level 2 cache total size : 40.00 MiB
	device memory total size : 39.39 GiB
	device memory bandwidth : 1.56 TB/s
	stack memory total size : 216.00 MiB

block 在SM上的分布

  • sm上是以block为单位进行分配的。
  • 先分配偶数标号的sm,接着再分配奇数标号的sm
  • <<<108,1024>>全部sm占满。
    在这里插入图片描述
grid_dimblock_dimsm0sm1sm2sm3sm4sm5sm6sm7sm8sm9sm10sm11sm12sm13sm14sm15sm16sm17sm18sm19sm20sm21sm22sm23sm24sm25sm26sm27sm28sm29sm30sm31sm32sm33sm34sm35sm36sm37sm38sm39sm40sm41sm42sm43sm44sm45sm46sm47sm48sm49sm50sm51sm52sm53sm54sm55sm56sm57sm58sm59sm60sm61sm62sm63sm64sm65sm66sm67sm68sm69sm70sm71sm72sm73sm74sm75sm76sm77sm78sm79sm80sm81sm82sm83sm84sm85sm86sm87sm88sm89sm90sm91sm92sm93sm94sm95sm96sm97sm98sm99sm100sm101sm102sm103sm104sm105sm106sm107
11100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
1323200000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
1646400000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
112812800000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
125625600000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
151251200000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
11024102400000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
161101010101010101010101010101010100000000000000000000000000000000000000000000000000000000000000000000000000000
16323203203203203203203203203203203203203203203203200000000000000000000000000000000000000000000000000000000000000000000000000000
16646406406406406406406406406406406406406406406406400000000000000000000000000000000000000000000000000000000000000000000000000000
1612812801280128012801280128012801280128012801280128012801280128012800000000000000000000000000000000000000000000000000000000000000000000000000000
1625625602560256025602560256025602560256025602560256025602560256025600000000000000000000000000000000000000000000000000000000000000000000000000000
1651251205120512051205120512051205120512051205120512051205120512051200000000000000000000000000000000000000000000000000000000000000000000000000000
161024102401024010240102401024010240102401024010240102401024010240102401024010240102400000000000000000000000000000000000000000000000000000000000000000000000000000
321101010101010101010101010101010101010101010101010101010101010101000000000000000000000000000000000000000000000
323232032032032032032032032032032032032032032032032032032032032032032032032032032032032032032032032000000000000000000000000000000000000000000000
326464064064064064064064064064064064064064064064064064064064064064064064064064064064064064064064064000000000000000000000000000000000000000000000
321281280128012801280128012801280128012801280128012801280128012801280128012801280128012801280128012801280128012801280128012801280128000000000000000000000000000000000000000000000
322562560256025602560256025602560256025602560256025602560256025602560256025602560256025602560256025602560256025602560256025602560256000000000000000000000000000000000000000000000
325125120512051205120512051205120512051205120512051205120512051205120512051205120512051205120512051205120512051205120512051205120512000000000000000000000000000000000000000000000
321024102401024010240102401024010240102401024010240102401024010240102401024010240102401024010240102401024010240102401024010240102401024010240102401024010240102401024000000000000000000000000000000000000000000000
641111111111111111111111010101010101010101010101010101010101010101010101010101010101010101010101010101010101010
64323232323232323232323232323232323232323232320320320320320320320320320320320320320320320320320320320320320320320320320320320320320320320320320320320320320320320320320320320320
64646464646464646464646464646464646464646464640640640640640640640640640640640640640640640640640640640640640640640640640640640640640640640640640640640640640640640640640640640640
6412812812812812812812812812812812812812812812812812812812812812812801280128012801280128012801280128012801280128012801280128012801280128012801280128012801280128012801280128012801280128012801280128012801280128012801280128012801280128012801280
6425625625625625625625625625625625625625625625625625625625625625625602560256025602560256025602560256025602560256025602560256025602560256025602560256025602560256025602560256025602560256025602560256025602560256025602560256025602560256025602560
6451251251251251251251251251251251251251251251251251251251251251251205120512051205120512051205120512051205120512051205120512051205120512051205120512051205120512051205120512051205120512051205120512051205120512051205120512051205120512051205120
641024102410241024102410241024102410241024102410241024102410241024102410241024102410241024010240102401024010240102401024010240102401024010240102401024010240102401024010240102401024010240102401024010240102401024010240102401024010240102401024010240102401024010240102401024010240102401024010240102401024010240
1081111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
10832323232323232323232323232323232323232323232323232323232323232323232323232323232323232323232323232323232323232323232323232323232323232323232323232323232323232323232323232323232323232323232323232323232323232323232323232
10864646464646464646464646464646464646464646464646464646464646464646464646464646464646464646464646464646464646464646464646464646464646464646464646464646464646464646464646464646464646464646464646464646464646464646464646464
108128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128128
108256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256256
108512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512512
1081024102410241024102410241024102410241024102410241024102410241024102410241024102410241024102410241024102410241024102410241024102410241024102410241024102410241024102410241024102410241024102410241024102410241024102410241024102410241024102410241024102410241024102410241024102410241024102410241024102410241024102410241024102410241024102410241024102410241024102410241024102410241024102410241024102410241024102410241024102410241024102410241024

A100上SM/TPC/GPC分组关系

注意,SM都是逻辑idx
在这里插入图片描述
具体做法:

//__cooperative__
__global__ void KERNEL_NAME(__TEST_CASE_NAME__)(int *id, uint64_t *clocks) {
  int threadInBlock = threadIdx.x + threadIdx.y * blockDim.x + threadIdx.z * blockDim.x * blockDim.y;
  int blockInGrid = blockIdx.x + blockIdx.y * gridDim.x + blockIdx.z * gridDim.x * gridDim.y;
  int oneBlockSize = blockDim.x * blockDim.y * blockDim.z;
  int tidx = threadInBlock + oneBlockSize * blockInGrid;

#pragma unroll
  for (int i = 0; i < SYNC_LOOP; i++) {
      __syncthreads();
  }

  uint64_t start = rt::Clock();

  id[tidx] = __mysmid();
  clocks[__mysmid()] = start;  
}

static void gpc_test_kernel(int grid_dim, int block_dim, uint32_t *h_id, uint64_t *h_clocks) {
  rt::Error_t err;
  Stream_t stream;
  uint32_t *d_id;
  CHECK_ERROR(rt::Malloc((void **)&d_id, sizeof(uint32_t) * grid_dim * block_dim));
  CHECK_ERROR(rt::Memset(d_id, 0, sizeof(uint32_t) * grid_dim * block_dim));


  uint64_t *d_clocks;
  CHECK_ERROR(rt::Malloc((void **)&d_clocks, sizeof(uint64_t) * grid_dim * block_dim));
  CHECK_ERROR(rt::Memset(d_clocks, 0, sizeof(uint64_t) * grid_dim * block_dim));

  CHECK_ERROR(rt::StreamCreate(&stream));

  // kernel function
  void *args[] = {(void *)&d_id, (void *)&d_clocks};
  err = rt::LaunchCooperativeKernel((const void *)(KERNEL_NAME(__TEST_CASE_NAME__)), grid_dim, block_dim, args, 0,
                                    stream);
  CHECK_ERROR(err);
  CHECK_ERROR(rt::GetLastError());

  CHECK_ERROR(rt::StreamSynchronize(stream));
  CHECK_ERROR(rt::DeviceSynchronize());
  CHECK_ERROR(rt::StreamSynchronize(stream));

  CHECK_ERROR(rt::Memcpy(h_id, d_id, sizeof(uint32_t) * 1 * grid_dim * block_dim, rt::MemcpyDeviceToHost));
  CHECK_ERROR(rt::Memcpy(h_clocks, d_clocks, sizeof(uint64_t) * 1 * grid_dim * block_dim, rt::MemcpyDeviceToHost));

  CHECK_ERROR(rt::StreamDestroy(stream));
  CHECK_ERROR(rt::Free(d_id));
  CHECK_ERROR(rt::Free(d_clocks));
}

int mainc(){
......
  rt::Error_t err;
  std::ofstream file2(std::string(test_name) + std::string("_gpc_sm_layout.csv"));


  err = rt::SetDevice(0);
  CHECK_ERROR(err);
  rt::DeviceProp device_prop;
  err = rt::GetDeviceProperties(&device_prop, 0);
  CHECK_ERROR(err);

  grid_dim=device_prop.multiProcessorCount;
  block_dim = 1;

  gpc_test_kernel(grid_dim, block_dim, id, h_clocks);
...
}
sm0sm1sm2sm3sm4sm5sm6sm7sm8sm9sm10sm11sm12sm13sm14sm15sm16sm17sm18sm19sm20sm21sm22sm23sm24sm25sm26sm27sm28sm29sm30sm31sm32sm33sm34sm35sm36sm37sm38sm39sm40sm41sm42sm43sm44sm45sm46sm47sm48sm49sm50sm51sm52sm53sm54sm55sm56sm57sm58sm59sm60sm61sm62sm63sm64sm65sm66sm67sm68sm69sm70sm71sm72sm73sm74sm75sm76sm77sm78sm79sm80sm81sm82sm83sm84sm85sm86sm87sm88sm89sm90sm91sm92sm93sm94sm95sm96sm97sm98sm99sm100sm101sm102sm103sm104sm105sm106sm107
111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
3.25863E+143.25863E+143.25447E+143.25447E+143.25654E+143.25654E+143.26144E+143.26144E+143.2569E+143.2569E+143.25881E+143.25881E+143.25386E+143.25386E+143.25863E+143.25863E+143.25447E+143.25447E+143.25654E+143.25654E+143.26144E+143.26144E+143.2569E+143.2569E+143.25881E+143.25881E+143.25386E+143.25386E+143.25863E+143.25863E+143.25447E+143.25447E+143.25654E+143.25654E+143.26144E+143.26144E+143.2569E+143.2569E+143.25881E+143.25881E+143.25386E+143.25386E+143.25863E+143.25863E+143.25447E+143.25447E+143.25654E+143.25654E+143.26144E+143.26144E+143.2569E+143.2569E+143.25881E+143.25881E+143.25386E+143.25386E+143.25863E+143.25863E+143.25447E+143.25447E+143.25654E+143.25654E+143.26144E+143.26144E+143.2569E+143.2569E+143.25881E+143.25881E+143.25386E+143.25386E+143.25863E+143.25863E+143.25447E+143.25447E+143.25654E+143.25654E+143.26144E+143.26144E+143.2569E+143.2569E+143.25881E+143.25881E+143.25386E+143.25386E+143.25863E+143.25863E+143.25447E+143.25447E+143.25654E+143.25654E+143.26144E+143.26144E+143.2569E+143.2569E+143.25881E+143.25881E+143.25386E+143.25386E+143.25863E+143.25863E+143.25447E+143.25447E+143.25654E+143.25654E+143.26144E+143.26144E+143.2569E+143.2569E+14

在这里插入图片描述
按照这个图就可以直到SM和GPC的划分了

计算能力

在这里插入图片描述

MIG

一个GPUA100,可以划分7个独立的Instance.
在这里插入图片描述
在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

参数列表

A100 一个block中最大可以容纳1024个threads
一个SM最大可以荣达2048个threads
所以最大可以容纳<216,1024>个线程
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

<think>好的,用户之前让我整理了一个关于DeepSeek大模型不同参数规模的显卡要求和部署方案的文档,现在他们希望将这些信息整理成表格,表头包括模型名称、最低显卡要求、最低消费级显卡要求、显卡参数、硬件要求和性能代价。我需要仔细回顾之前的对话内容,确保所有数据都被准确提取并组织到表格中。 首先,用户提供的原始文档结构清晰,分为不同模型部分,每个部分有最低显卡、优化方案、硬件要求等信息。我需要逐一提取每个模型对应的数据。例如,对于DeepSeek-7B模型,基础部署是RTX 3090 24GB,极限压缩是RTX 3060 12GB,消费级显卡要求可能需要将这两个分开处理,或者合并到同一单元格。同时,用户提到的优化方案如使用vLLM或双卡配置可能需要放在硬件要求或其他相关列中。 接下来,需要注意每个部分的关键点。比如,在硬件要求中,是否提到了PCIe版本、NVLink、框架如Deepspeed-Inference等。性能代价方面,延迟、精度损失、速度下降等都需要准确捕捉。例如,DeepSeek-70B模型提到量化后精度损失3-5%,延迟≥500ms,这些都要在性能代价列中体现。 另外,用户可能希望表格简洁明了,可能需要合并或简化某些信息。例如,在显卡参数中,可能需要显存大小和是否支持量化技术。例如,RTX 3090 24GB的显存参数是24GB,而消费级显卡的最低要求可能需要区分是否使用了量化技术,如INT4或FP8。 同时,需要注意表格中的替代方案是否属于最低消费级显卡要求,比如DeepSeek-13B的单卡部署是企业级的A40,而消费级方案是双RTX 4090,这需要明确分列在两个不同的列中。此外,硬件要求中的条件如PCIe 4.0×16或需要NVLink互联也要准确列出。 还需要检查是否有遗漏的信息,比如在关键部署建议中的显存瓶颈突破和量化优先级是否需要在表格中体现,但根据用户的要求,可能只需要表格中的指定表头,所以可能需要忽略这些建议部分,专注于模型的具体参数。 最后,确保表格的格式正确,使用Markdown语法,表头对齐,内容清晰。可能需要将某些较长的描述简化,例如将“启用vLLM的PagedAttention显存管理”简化为“需要vLLM PagedAttention”或类似,以保持表格的整洁。同时,注意单位的一致性,如显存大小统一使用GB,延迟使用毫秒等。 总结来说,需要逐一处理每个模型部分,提取对应的数据到表格的各个列中,确保准确性和简洁性,同时注意用户可能的深层需求,比如希望比较不同模型的最低消费级显卡配置,或者了解不同量化技术的影响。这有助于确保表格不仅满足表面要求,还能有效传达关键信息。</think>以下是整理后的表格,基于您提供的信息进行精简与结构化: --- | **模型名称** | **最低显卡要求** | **最低消费级显卡要求** | **显卡参数** | **硬件要求** | **性能代价** | |---------------------|-------------------------------|----|--------------------------------------------------|------------------------------------------| | **DeepSeek-7B** | RTX 3090 24GB(FP16) | RTX 3060 12GB(INT4量化) | 显存≥12GB,支持INT4量化 | 需启用vLLM PagedAttention或双卡NVLink互联 | 量化后精度损失≤7% | | **DeepSeek-13B** | A40 48GB(FP16) | 2×RTX 4090 24GB(FP8量化+NVLink) | 单卡显存≥24GB,支持FP8量化 | 需Tensor并行+NVLink互联 | 多卡通信延迟增加15% | | **DeepSeek-33B** | 4×RTX 4090(FP16合并显存) | 3×A10G 24GB(INT8量化) | PCIe 4.0×16总线带宽 | 必须启用CUDA统一内存管理 | 显存-内存交换导致延迟翻倍 | | **DeepSeek-70B** | 8×A100 40GB(FP8量化) | 16×RTX 4090(需RoCE网络聚合) | 单卡显存≥24GB,支持FP8量化 | 必须启用模型分片存储+ZeRO-Offload | 量化精度损失3-5%,延迟≥500ms/token | | **DeepSeek-671B** | 128×H100(FP8+MoE稀疏化) | 不可用消费级显卡 | 集群显存≥512GB,支持FP8量化 | 需动态卸载+分布式推理框架 | INT4量化后速度≤1 token/秒,精度损失≥10% | --- ### **表格说明** 1. **显卡参数**:包含显存容量、量化支持、总线带宽等关键指标 2. **硬件要求**:标注特殊技术依赖(如NVLink、CUDA统一内存等) 3. **性能代价**:量化精度损失、延迟增加、速度下降等量化数据 4. **消费级替代**:优先标注NVIDIA RTX系列可行性方案 建议结合**NVIDIA官方显存计算公式**验证: ``` 所需显存(GB)= 模型参数量(B) × 2(FP16字节) × 量化系数 ÷ 1024^3 ``` (例:7B模型INT4量化需求:7×2×0.25÷1024^3≈13.1GB,需RTX 3060 12GB+内存Swap补充)
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值