The Design of pirules and pirulesfaster:
Each node is sent an interval as is asked to compute t he area of the curve over that interval. The results are returned to head node, which adds them up. So, we can think of the split as follows: The parallel part, when all the processors do the work, and then the sequential part, when the head node adds them all together.
How do we parallel
reduce pirules?
To improve this, lets change the sequential part to add a parallel component. We send the data to the head node by a tree process, so we get parallel adds going on. So half the nodes send their data to other nodes. These nodes that receive(half the total) add up their number with the one that received. Then the ones that did the adding, send their partial solutions to another half, and so forth, until one node sends its data to the head node..
At this point he drew a true binomial tree on the board, and even though the “physical” binomial tree had 15 nodes, we only need 8 nodes. The reason is that some of the nodes are "reused.” However, since, in the code, the binomial tree is Mapped onto these nodes, from an algorithmic point of view, we can implement parallel reduction.
4 Hypercubes
Original structure for powerful computers, since the alternative was everyone using a shared party line. Using a hypercube, lets call it "direct ethernet", so some machines were tied directly together. We do have some penalties doing this, as now while we can send data fast to machines "directly connected" to us, but those at a distance must have data sent in multiple hops, we cannot directly talk. At the time switches werent available, so this was the best they could do. Diagram of the hypercubes on class note, as well as a discussion of Hamming distance, which is the minimum number of bit flips required to change one binary numb= er to another number. We can label the nodes in a hypercube using binary numbers, with the Hamming distance of two nodes giving us the number of hops required to communicate between them.
Draws a diagram on board showing beowolf cluster, with all nodes tied together, next to a hybercube of the same nodes, labels both of these "physical". Then draws the binomial tree diagram and labels it "abstract". Now, thinking about our beowolf cluster, all nodes tied to center with no direct connects. We can tie our abstract structure to the beowolf cluster pretty much anyway we want. However, when mapping the tree to our hypercube, then we have to map it according to the hardware, so each "hop" of the tree only takes one "hop" on the hypercube. If we did not take into account this underlying structure, we could be forced to wait as data had to be routed through multiple hops, wasting time.
5 Logpirules
Shows a diagram showing communication in logpirules. First master node sends data out to the workers. They work, but instead of the data being all sent to the master node and then added up, data is added up with a binomial tree. Points out that suppose we have 1024 nodes, then we only need 10 reductions to add all the numbers.
Sounds good, but what is the expected savings for logpirules? Will it really help us in a problem like pirules? In the notes is some math showing some rough calculations. But the basic argument is, if we did pirules with 10M intervals, and had 1000 processors, suppose we used the midpoint rule that requires 8 operations. Then we divide up the 10M intervals among the 1000 processors, so each processor has 10,000 intervals. So each processor has to do 80,000 timecyles roughly. So the whole process of pirules takes 81,000(1000 for the REDUCE at the end).
Now with logpirules, we would have 80,000+10. So obviously not a big savings. There is just no getting around the fact that with this example, the parallel reduction can't do anything about our real "timekiller", the 10,000 intervals that each processor must calculate.