Hardware Parallelism

Adding Floating-Point Numbers:

The follow steps are involved in solving the equation

Compare $\QTR{Large}{b}$ and $\QTR{Large}{\ d.}$
shift the decimal point of $\QTR{Large}{.a\ }$ $\QTR{Large}{d-b\ }$ places to the left, else shift he decimal point of $\QTR{Large}{.b\ }$ $\QTR{Large}{b-d}$ places to the left. (If $\QTR{Large}{b<d}$ equation would now looks like )
Again, assuming $\QTR{Large}{b<d}$ ,compute to get the possibly unnormalized answer
Round and, if necessary, adjust exponent and significand.

A blackbox implementation of this process would look like:

The pipeline arithmetic is as follows: Assuming each step in the Floating-Point add takes one clock cycle, a single fp add would take four clock cycles. However, if we could start a second add as soon as the first is done with the "Comparer," it would take five clock cycles to do two fp adds and $\QTR{Large}{n+3}$ cycles to do $\QTR{Large}{n}$ fp adds.

_______________________________________________________________________

Prefetched Instruction Cache:

The IBM 7094 (1962) used an "Instruction Backup Register" to buffer "the next instruction." This was used to overlap the execution of one instruction with the fetch of the next. The result was about a 25% increase in performance.

________________________________________________________________________

Separate Instruction and Data Spaces:

Another strategy for overlapping execution steps is to completely separate program memory for data memory.

________________________________________________________________________

Array Processors:

Vector and Matrix processing provide examples of programs that perform identical procedures on different data elements. An array of processors operating on a single instruction stream provides a very important example of hardware parallelism.