Very Long Instruction Word

from: https://en.wikipedia.org/wiki/Very_long_instruction_word

The traditional means to improve performance in processors include dividing instructions into substeps so the instructions can be executed partly at the same time (termed pipelining), dispatching individual instructions to be executed independently, in different parts of the processor (superscalar architectures), and even executing instructions in an order different from the program (out-of-order execution). These methods all complicate hardware (larger circuits, higher cost and energy use) because the processor must make all of the decisions internally for these methods to work. In contrast, the VLIW method depends on the programs providing all the decisions regarding which instructions to execute simultaneously and how to resolve conflicts. As a practical matter, this means that the compiler (software used to create the final programs) becomes far more complex, but the hardware is simpler than in many other means of parallelism.

==> the compiler will take longer to develope, more error prone and specific to VLIW structures, which is the sacrifice for more efficient hardware structure.

In superscalar designs, the number of execution units is invisible to the instruction set. Each instruction encodes one operation only. For most superscalar designs, the instruction width is 32 bits or fewer.

In contrast, one VLIW instruction encodes multiple operations, at least one operation for each execution unit of a device. For example, if a VLIW device has five execution units, then a VLIW instruction for the device has five operation fields, each field specifying what operation should be done on that corresponding execution unit. To accommodate these operation fields, VLIW instructions are usually at least 64 bits wide, and far wider on some architectures.

For example, the following is an instruction for the Super Harvard Architecture Single-Chip Computer (SHARC). In one cycle, it does a floating-point multiply, a floating-point add, and two autoincrement loads. All of this fits in one 48-bit instruction:

f12 = f0 * f4, f8 = f8 + f12, f0 = dm(i0, m3), f4 = pm(i8, m9);

Since the earliest days of computer architecture,[1] some CPUs have added several arithmetic logic units (ALUs) to run in parallel. Superscalar CPUs use hardware to decide which operations can run in parallel at runtime, while VLIW CPUs use software (the compiler) to decide which operations can run in parallel in advance. Because the complexity of instruction scheduling is moved into the compiler, complexity of hardware can be reduced substantially.[clarification needed]

==> complexity at hardware level enables abstraction of hardware to the software and thus can greatly enhance compatiblity with multiple software environments/frameworks fostering a better application eco-system for the hardware.

A similar problem occurs when the result of a parallelizable instruction is used as input for a branch. Most modern CPUs guess which branch will be taken even before the calculation is complete, so that they can load the instructions for the branch, or (in some architectures) even start to compute them speculatively. If the CPU guesses wrong, all of these instructions and their context need to be flushed and the correct ones loaded, which takes time.

This has led to increasingly complex instruction-dispatch logic that attempts to guess correctly, and the simplicity of the original reduced instruction set computing (RISC) designs has been eroded. VLIW lacks this logic, and thus lacks its energy use, possible design defects, and other negative aspects.

In a VLIW, the compiler uses heuristics or profile information to guess the direction of a branch. This allows it to move and preschedule operations speculatively before the branch is taken, favoring the most likely path it expects through the branch. If the branch takes an unexpected way, the compiler has already generated compensating code to discard speculative results to preserve program semantics.

==> profiling is static, cheap and not necessarily yielding good results; with the application of JIT compiler, the prediction can be dynamic as hardware level preditions.

Vector processor (single instruction, multiple data (SIMD)) cores can be combined with the VLIW architecture such as in the Fujitsu FR-V microprocessor, further increasing throughput and speed.

other references

Very Long Instruction Word (VLIW) Architecture - GeeksforGeeks

Advantages :

  • Reduces hardware complexity.
  • Reduces power consumption because of reduction of hardware complexity.
  • Since compiler takes care of data dependency check, decoding, instruction issues, it becomes a lot simpler.
  • Increases potential clock rate.
  • Functional units are positioned corresponding to the instruction pocket by compiler.

Disadvantages :

  • Complex compilers are required which are hard to design.
  • Increased program code size.
  • Larger memory bandwidth and register-file bandwidth.
  • Unscheduled events, for example a cache miss could lead to a stall which will stall the entire processor.
  • In case of un-filled opcodes in a VLIW, there is waste of memory space and instruction bandwidth.

What is the VLIW Architecture?

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值