Gen9 arch
EU
- simultaneous multithreading and interleave multithreading
- 4-issue from diff threads , piplined accross multiple threads
- GRF:
- 28KB/EU, 128 X (SIMD-8-32bit) regs/thread X 7threads
- 4KB/thread
- 16 32b-float / cycle:
- (add+mul) X 2FPUs X SIMD-4(physically 4 per FPU)
interconnect
slice
memory hierarchy
- eDRAM bypass or cache
- coherent region, overhead?
configuration
issue
-
subclice data port
- SIMD gather and scatter
- can access shared memory
- coalescing scatter read mem // mem access pattern
-
L3 data cache on GPU
- banked data cache
- highly banked shared memory //bank conflict, OpenCL refer it as work-group local mem, its programmer managed data
- atomic barrier usage, three part ratio is configurable
-
LLC-shared
* shared between intel HD and graphic //how to use it
* distributed shared cache //coherence overhead? cache ping-pong -
sharing DRAM
- zero copy
- bandwidth contention
-
eDRAM memory side cache or bypass
-
64 byte data path on many place
- 1 SIMD-16 instruction can source 64byte wide operands from 64byte wide regs,
- 2 such 64byte wide regs read or written from L3 over 64-byte wide data bus
- 3 within L3 data cache , cache line is 64byte wide
- 4 LLC cache’s bus to SoC-shared LLC is also 64 byte wide
-
EU: flexible SIMD width; 4KB reg file / thread; 28 KB/ EU
-
16bit float support: mixed accuracy computing
-
many consistency part might influence performance
-
same virtual address can be shared seamlessly across device, programmable via SVM in openCL 2.0
- net effect pointer rich data structure can be shared directly between code run on CPU and code on GPUs
source:
The Compute Architecture of Intel ® Processor Graphics Gen9 Version 1.0