The follow steps are involved in solving the equation
Compare
and
shift
the decimal point of
places
to the left, else shift he decimal point of
places
to the left. (If
equation
would now looks like
)
Again, assuming
,compute
to get the possibly unnormalized answer
Round and, if necessary, adjust exponent and significand.
A blackbox implementation of this process would look like:
The pipeline arithmetic is as follows: Assuming each step in the
Floating-Point add takes one clock cycle, a single fp add would take four
clock cycles. However, if we could start a second add as soon as the first is
done with the "Comparer," it would take five clock cycles to do two fp adds
and
cycles to do
fp adds.
The IBM 7094 (1962) used an "Instruction Backup Register" to buffer "the next instruction." This was used to overlap the execution of one instruction with the fetch of the next. The result was about a 25% increase in performance.
Another strategy for overlapping execution steps is to completely separate program memory for data memory.
Vector and Matrix processing provide examples of programs that perform identical procedures on different data elements. An array of processors operating on a single instruction stream provides a very important example of hardware parallelism.