StackThreads/MP: version 0.77 User's Guide
There are many ways to express parallelism that exist in your
program. Consider you want to perform 100,000 procedure calls
f(i)
(i = 0, ... 99,999) in parallel. It is most likely
to be written using a for
loop if they were done in series:
for (i = 0; i < 100,000; i++) { @ f(i); @ }
Therefore, you are likely to parallelize it as follows (actually,
f
may take extra arguments to achieve synchronization, and we may
have to insert some ST_POLLING
s, but we ignore them for
simplicity):
for (i = 0; i < 100,000; i++) { @ ST_THREAD_CREATE(f(i)); @ }
This works only when each f
involves a significant amount of work.
To directly get to the point, the recommended style is so-called
divide-and-conquer method. We introduce an auxiliary function
f_aux(p, q)
, which performs f(p), ... , f(q-1)
, and perform recursive calls in parallel (again,
synchronization and ST_POLLING
are omitted).
f_aux(p, q) { @ if (p + 1 == q) f(p); @ else { @ m = (p + q) / 2; @ ST_THREAD_CREATE(f_aux(p, m)); @ f_aux(m, q)); @ } @ }
This works well even when each f(p)
does a small amount of
work. The previous section has explained why this style works.
(This code performs roughly the same number of procedure calls to
f_aux
as the number of procedure calls to f
. Therefore it
adds the overhead of an extra procedure call to each f
. If
f
is too tiny to tolerate this overhead, change the condition in
the second line so that f_aux
performs several calls to f
in a sequential loop. For example:
f_aux(p, q) { @ if (p + T >= q) { @ for (i = p; i < q; i++) f(i); @ } else { @ m = (p + q) / 2; @ ST_THREAD_CREATE(f_aux(p, m)); @ f_aux(m, q)); @ } @ }
Here, T
is chosen so that calling f
T times is likely
to mask the overhead of a procedure call).
What's wrong with the first program? Well, it is not always wrong, and I know it is preferred in terms of maintenance cost and simplicity. Here I explain what happens when we execute the first program and in which circumstances is it OK and in circumstances should it be avoided.
When a worker, W, creates a thread for f(0)
, W starts
f(0)
. W keeps its continuation (representing f(1) ... f(99,999)
on its stack. A request from an idle worker, X
,
will soon arrive and W
gives the continuation to X
. Now,
X
creates a thread for f(1)
, leaving the rest of the work
f(2) ... f(99,9999)
on its stack. A request from another
idle worker then steals it and invokes f(2)
, leaving f(3) ... f(99,999)
on its stack, and so on. Observe that all these thread
migrations occur in sequence. Assume for simplicity that each
f(i)
takes approximately the same time, this program assigns
iterations in a round-robin manner, resulting in approximately 100,0000
sequential thread migrations. This places a lower bound of the time
required to finish this loop, which is 100,000 times M, where M
is the migration cost. In the original sequential loop, it takes 100,000
times F, where F is the cost of a f(i). Therefore, the
achievable speedup on any number of processors is F/M, which may
be quite small when F performs only a tiny amount of work (M is
an order of thousand instructions).
Here is a summary: if F represents a significant amount of work, or you have only a small number of processors so that the above limit will not be observed, the simple parallel loop will be OK.