StackThreads/MP: version 0.77 User's Guide
Q: My program didn't speedup at all. Why?
A: I don't know :-) Again, it is not likely to be my fault, but here are some guidelines to shoot out the problem.
ST_POLLING
. The
program never speed up when you do that.
If the graph is mostly busy
(red), parallelization itself is
good. If the graph contains a wide steal
(green) area, there are
something wrong in parallelization. One possibility is that you did not
create enough threads in the green sections. If you are sure that there
should be enough runnable threads there, check if you call
ST_POLLING
often enough.
In general, if the graph is mostly busy, and yet the program does not exhibit speedup, it implies that the total busy time increases as the number of processors increase. It in turn implies either that (1) your program performs extra work when run on multiple processors or (2) workers are invisibly descheduled by the OS. Note that when a worker is invisibly descheduled by, for example, a blocking system call, the event is not observed by the StackThreads/MP runtime system and hence counted as a busy time.
When you are fairly sure that (1) is not the case, (2) is likely the case. Most frequently, such situations result when the program calls malloc often. Since some malloc implementations simply serialize all malloc request using mutex lock of the underlying thread library (such as Pthreads), calling such mallocs within a parallel section is a bad practice. To our knowledge, Solaris malloc serializes all malloc calls. We provide several alternatives to this problem. See Memory Management for more information.
You can tell how much does your busy time increase using profiler. When
you run mkxgrp
command, it displays a single line message that
shows how much time did workers spend on each state. Each number is the
total over all workers. You can run the same program with various
numbers of workers (using -nw
option) and see how does busy time
increase. When (2) is the case, the typical behavior is that the busy
time is roughly proportional to the number of processors.