A recommended programming style for performance

StackThreads/MP: version 0.77 User's Guide

9.2: A recommended programming style for performance

There are many ways to express parallelism that exist in your program. Consider you want to perform 100,000 procedure calls f(i) (i = 0, ... 99,999) in parallel. It is most likely to be written using a for loop if they were done in series:

        for (i = 0; i < 100,000; i++) {
@           f(i);
@        }

Therefore, you are likely to parallelize it as follows (actually, f may take extra arguments to achieve synchronization, and we may have to insert some ST_POLLINGs, but we ignore them for simplicity):

        for (i = 0; i < 100,000; i++) {
@           ST_THREAD_CREATE(f(i));
@        }

This works only when each f involves a significant amount of work.

To directly get to the point, the recommended style is so-called divide-and-conquer method. We introduce an auxiliary function f_aux(p, q), which performs f(p), ... , f(q-1), and perform recursive calls in parallel (again, synchronization and ST_POLLING are omitted).

        f_aux(p, q) {
@          if (p + 1 == q) f(p);
@          else {
@            m = (p + q) / 2;
@            ST_THREAD_CREATE(f_aux(p, m));
@            f_aux(m, q));
@          }
@       }

This works well even when each f(p) does a small amount of work. The previous section has explained why this style works.

(This code performs roughly the same number of procedure calls to f_aux as the number of procedure calls to f. Therefore it adds the overhead of an extra procedure call to each f. If f is too tiny to tolerate this overhead, change the condition in the second line so that f_aux performs several calls to f in a sequential loop. For example:

        f_aux(p, q) {
@          if (p + T >= q) {
@            for (i = p; i < q; i++) f(i);
@          } else {
@            m = (p + q) / 2;
@            ST_THREAD_CREATE(f_aux(p, m));
@            f_aux(m, q));
@          }
@       }

Here, T is chosen so that calling f T times is likely to mask the overhead of a procedure call).

What's wrong with the first program? Well, it is not always wrong, and I know it is preferred in terms of maintenance cost and simplicity. Here I explain what happens when we execute the first program and in which circumstances is it OK and in circumstances should it be avoided.

When a worker, W, creates a thread for f(0), W starts f(0). W keeps its continuation (representing f(1) ... f(99,999) on its stack. A request from an idle worker, X, will soon arrive and W gives the continuation to X. Now, X creates a thread for f(1), leaving the rest of the work f(2) ... f(99,9999) on its stack. A request from another idle worker then steals it and invokes f(2), leaving f(3) ... f(99,999) on its stack, and so on. Observe that all these thread migrations occur in sequence. Assume for simplicity that each f(i) takes approximately the same time, this program assigns iterations in a round-robin manner, resulting in approximately 100,0000 sequential thread migrations. This places a lower bound of the time required to finish this loop, which is 100,000 times M, where M is the migration cost. In the original sequential loop, it takes 100,000 times F, where F is the cost of a f(i). Therefore, the achievable speedup on any number of processors is F/M, which may be quite small when F performs only a tiny amount of work (M is an order of thousand instructions).

Here is a summary: if F represents a significant amount of work, or you have only a small number of processors so that the above limit will not be observed, the simple parallel loop will be OK.