FPGA Architecture Overview/FPGA内部架构简览

第一块商用FPGA是1985年Xilinx推出的XC2064,如今FPGA因其自身的可重复编程和并行性等特征已经是人们在克服冯诺依曼架构缺陷的一个重要方向。在日常的使用中,一直想深入了解下FPGA的内部构成,而每次看到LUT、CLB等都是云里雾里的。这次幸运地在"Data Processing in FPGAs"这本书里找到了我想要的答案,其介绍很好地契合了我想了解的level,感兴趣的推荐直接去阅读原文。
另补充一篇Xilinx逻辑单元-ug474笔记


下面先从图1所示设计流程内容进行阐述,分为Logical和Physical两个层级。

Logical
  • FPGA自身开发难度大,学习成本相对较高,因此有工作就致力于实现high-level synthesis,以从京生老师的研究组为代表,其成立的公司后来被xilinx收购。High-level synthesis的目的是让开发人员直接编程高级语言(如C语言,学习曲线较合理),之后由软件自动将其转化成HDL或电路实现,但以目前的效果来看,离直接上手写HDL还是有不小的距离。
  • RTL synthesis将RTL转换成未优化的布尔逻辑表达式。
  • logic optimization在上述基础上进行优化,如去掉冗余逻辑。
  • technology mapping主要功能是将抽象的设计表达映射到the cell of a technology library(cell library)。由设备制造商提供,种类涵盖standard cell(即基本逻辑门)、复杂点的megacell(如DMA controller)。在进行映射的过程中,结合预先设定的面积、功耗和时序等约束,可以在不同的候选cell中进行取舍。
Physical
  • 通过前述工作得到的即网表(netlist),包含instance(对应前一步骤中的cell)和net(即不同instance之间的连线)。为了将逻辑真正地在硬件上固化下来,接下来需要进行布局和布线的工作。
  • 布局:通过floorplanning识别出空间位置需要尽可能接近的结构,之后进行初步布局,在此基础上,加入clock tree并且重新评估改变布局。
  • 布线:在布局完成的基础之上,结合cell的具体几何尺寸,进行布线。
    Design flow: formal circuit specification --> physical circuit

Look-Up Tables

在ASIC中,组合逻辑是通过将不同的逻辑门按特定形式连接起来实现的,逻辑门内部实现可以再追溯到NMOS和PMOS层级;而在FPGA中,逻辑门由LUT模拟实现。
一个n输入的LUT需要2^n个SRAM用以保存真值表,reprogramming也就体现在SRAM中的数据可以依据具体不同实现而被更新。如下图左侧所示,一个4输入的组合逻辑,如果写出真值表,一共16行,每行代表一种输入组合形式,LUT实现的方式便是将输入信号作为mux的选择信号,将最终想要的输出从对应SRAM中取出来。
LUT的常见标准输入数目已经从4输入发展至6输入,输出也不再只局限于1输出。
如下图右侧所示,SRAM通常以移位寄存器的形式组织实现,这意味着写SRAM需要2^n个cycle,而读要快得多。之所以如此实现,是为了减少芯片面积的一个trade-off。
LUT除了能用来实现组合逻辑以外,也可以被用作distributed RAM,实现小型FIFO(first-in, first-out)。
查询FPGA的对应datasheet可以得知信号通过一个LUT的延时大约在0.28ns(以Xilinx XUPV2P开发板为例)。
Internal structure of a 4-input LUT

Elementary Logic Units(Slice/ALM)

Elementary logic unit由固定数目的LUT组合而成,Xilinx称之为Slice,Altera称之为ALM。Elementary logic unit的内部结构依不同厂商、不同产品代次会有不同,但其主要的四个部件如下图所示,包括1)LUT(典型数量值在2和8之间),2)arithmetic/carry logic,3)1-bit registers(D-flip flop),4)Multiplexers。每个LUT后面接一个寄存器缓存,用以实现流水化操作。Mux在直接输出LUT和缓存之间进行选择,选择信号由其他SRAM决定。Elementaty logic unit
相邻的LUT可以通过carry chain进行交互(如上图右所示),基于此种设计,多个LUT可以用来实现加法器或者乘法器。

Routing

如上所述,直接将相邻的elementary logic unit互联可以得到加法器或乘法器。但是要实现复杂的电路设计,如SOC等,就需要考虑不同子电路模块间的互联(interconnect)。
logic island
多个elementary logic unit组合在一起的形式称为logic island,对应于Xilinx的CLB和Altera的LBA。如上图所示,两个elementary logic unit均有竖向的carry chain用以相邻互联。同时,二者也通过switch matrix(见下图)与其他logic island建立联系。上图右侧已经初具在各种手册和PPT中见到的FPGA结构的雏形了,logic island以二维阵列的形式进行组织,周边辅以I/O blocks (IOB)。
switch matrix
不同logic island通过连线进行交互,形成的结果就是连线互相交织形成switch matrix。在连线的交点处,由programmble link来决定如何进行互联。以任一条竖线为例,其与三条横线均有交点,但是programmble swtich会决定哪一个交点处于active状态。基于此,连线也可以被编程控制。
前文讲到的IOB支持多种I/O标准,如单端PCI或差分PCIe、SATA等。同时,串并转换、解码编码等模块也会被集成于此。
在这里插入图片描述
FPGA整体从a bag of gates 向 a bag of computer parts演进,整合进BRAM和DSP等模组。单一BRAM可以保存几KB的数据,通常FPGA上的BRAM数量在几百的数量级上。与前文所述的distributed RAM相比,BRAM更适合相对较大规模数据的片上存储。
在Virtex FPGA上,BRAM是双端的(见上图右),即BRAM可以同时被两个不同的电路访问,并且每一个端口的宽度均是可配置的,两个端口的时钟也是独立的。双端BRAM也可被配置成两个单端的BRAM来使用,或者配置成FIFO-queues.
从软核和硬核的角度来看,BRAM和DSP可以被视作硬核的一种。

FPGA Programming

FPGA design flow

  • Xilinx synthesizer (XST)将HDL转换成门级网表(native generic circuit (NGC)格式),映射到Xilinx的技术库UNISIM。在这一过程中,第三方的综合工具如Synplicity也可能会参与。
  • ngdbuild工具再将上述门级网表和约束转换成native generic database (NGD)形式的网表。约束信息通常保存在user constraint file (UCF)中,用以向Module间添加如I/O pins和时钟等特殊模组,同时指明时序要求。NGD基于的库是SIMPRIM库,可以进行时序仿真。
  • map工具将NGD网表中的SIMPRIM原语转换成具体的硬件资源,如logic islands, IOB, BRAM等,生成native circuit description(NCD)文件。
  • par工具完成布局布线任务,结合UCF文件中的时序约束,将NCD文件中物理器件放在正确的位置上并且互连起来。通常是最为耗时的一个阶段。
  • bitgen工具生成FPGA可以read的bitstream形式,比特流通过JTAG和iMPACT工具进行上板。FPGA中的一个FSM受控于比特流,从比特流中提取数据并将其block-wise load进FPGA中。Xilinx将这些block称为frame,每一个frame存储在configured SRAM的特定区域中(见下图)。
    configuration regions

Dynamic Partial Reconfiguration

该项技术允许在不干扰另一部分电路工作的同时,将一部分电路reprogram(只更新partial reconfiguration region(PRR)的frames)。如上图右所示,假设FPGA从外存如DRAM中加载数据,传输给internal configuration access port (ICAP)。partial bitstream只能传输给预设的区域,如A1\A2只能传输给PRR A,而不能是B。解决这一缺陷,引出了partial bitstream relocation的研究方向。

Built-in Memories in FPGAs

Virtex-7 FPGAs have built-in memories that can be used to implement FIFOs. This section introduces two types of memories, Block RAMs and Distributed RAMs.
It is well known that a FIFO can be implemented as a ring buffer data structure using a RAM. Elements in a FIFO are stored in a RAM with dual ports for reading and writing. Two pointers, read pointer and write pointer are used to specify the head and the tail of keys. Hence, FIFOs can be implemented using a simple dual-port RAM, which has independent writing address input and reading address input. We will show that how RAM can be configured in FPGAs. We assume that keys to be stored in FIFOs have 32 bits.
Virtex-7 FPGAs has a lot of Block RAMs, which can be used as ring buffers for FIFOs. For example, XC7VX485T has 1,030 Block RAMs, each of which can be configured as one 36kb Block RAM or two 18kb Block RAMs [5]. A 36kb Block RAM and a 18kb Block RAM can be configured as a 1k×36 and a 512×36 simple dual-port memory as illustrated in Figure 8. They have three input ports for writing data, writing address, and reading address. Writing ports are used to append a key in the tail and reading address port is used to read a key in the head. Thus, FIFOs with 1k and 512 keys with 32 bits can be implemented using 36kb and 18kb Block RAMs, respectively. Also, larger FIFOs can be implemented using multiple Block RAMs in an obvious way.
Virtex-7 FPGAs also have a lot of Configurable Logic Blocks (CLBs), each of which has two slices [4]. For example, XC7VX485T has 37,950 CLBs, that is, 75,900 slices. Each slice is either a SLICEM or a SLICEL, and XC7VX485T has 32,700 SLICEMs and 43,200 SLICELs. Each slice has four 6-input Look-Up Tables (6LUTs), each of which is a 26 =64-bit memory. Those in a SLICEL is read-only, in the sense that the values stored in 6LUT cannot be updated after the programming of the FPGA. On the other hand, the values stored in a 6LUT in a SLICEM can be changed, and thus, it can be used as a 64 × 1 RAM. Also, each 6LUT in a SLICEM can be configured as four 5-input Look-Up Tables (5LUTs) such that each 5LUT has 2-bit data input and 2-bit data output. Hence, each 5LUT can be used as a 32×2 RAM. However, address ports of one of the four LUTs in a SLICEM are shared and so it is not possible to use them independently. In particular, since one of the four LUTs has one address input used for specifying both reading and writing addresses, Hence, it cannot be used to implement a simple dual-port RAM. The remaining three LUTs can be used to implement a simple dual- port RAM. As illustrated in Figure 9, four LUTs in a SLICEM can be configured as either a 32×6RAM or a 64×3RAM. Therefore, we can construct a 32×36 RAM using 6 SLICEMs and a 64×33 RAM using 11 SLICEMs. Thus, FIFOs with 32 keys and with 64 keys can be constructed using 6 SLICEMs (i.e. 24 LUTs) and 11 SLICEMs (i.e. 44 LUTs), respectively. RAMs constructed by LUTs are called Distributed RAMs.
XC7VX485T has 75,900 slices, out of which 43,200 and 32,700 are SLICELs and SLICEMs, respectively. Since 4 LUTs in each SLICEM can be configured as a Distributed RAM, totally 32, 700 × 4 = 130, 800 LUTs can be used for Distributed RAMs. Also, since both SLICELs and SLICEMs can be used for embedded logics, 75, 900 × 4 = 303, 600 LUTs can be used for implementing logics. Further, each slice has 8 flip-flops, and so, we can embed registers with totally 75, 900×8 = 607, 200 bits in XC7VX485T. It also has 1,030 Block RAMs, each of which can be configured as either one 36kb Block RAMs or two 18kb Block RAMs
在这里插入图片描述
From 《Optimal Parallel Hardware K-Sorter and Top K-Sorter, with FPGA Implementations》

  • 2
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值