StackThreads/MP provides fine-grain multithreading in GNU C/C++, without a sophisticated preprocessor or a native code generator specifically implemented for multithreading. Instead, it works with unmodified GNU C compiler and preprocessor.
With StackThreads/MP, programmers can parallelize existing C code. Unlike traditional thread libraries such as Pthreads, StackThreads/MP supports fine-grain threads, threads whose granularities are much smaller than typical threads in traditional thread libraries. StackThreads/MP creates and finishes a thread with a very small overhead (about 10 RISC instructions). Thus it enables a programming style in which a thread is dynamically created when an independent task is identified. This allows a natural description of many parallel programs, such as parallel divide-and-conquer.
Other goals of StackThreads include (1) help parallel language developers implement parallel primitives, (2) provide common runtime framework in which multiple parallel languages can inter-operate, and (3) propose minimal changes to the current calling standard so that any standard-conforming languages support a common multithreading model in future.
Please report bugs and comments on the StackThreads/MP system and this document to the author.
sthreads-author@yl.is.s.u-tokyo.ac.jp.
Please join StackThreas/MP mailing list for getting helps and listening
to announcements (visit http://www.yl.is.s.u-tokyo.ac.jp/sthreads
).
Currently, StackThreads/MP is running under the following platforms:
The following software must be installed on your platform.
GNU C compiler and GNU awk are mandatory. We do not recommend you to
think about how to dispense with them. GNU make is somewhat optional but
we use `ifeq' notations to write our Makefiles. Be sure these commands
are invoked by typing gcc
, gawk
, and gmake
(or make
),
respectively. Also be sure that nm
(either native or GNU) works
correctly.
Installation consists of the following steps.
% gzip -dc sthreads.tar.gz | tar xf -You will find directory /usr/local/src/sthreads created. We hereafter refer to this directory as the toplevel directory.
ST_DIR
to the toplevel directory to which you unzipped this
software.
ST_OS_TYPE
to the operating system on which you use this
software. Possible values are:
solaris
(for Solaris)
irix
(for IRIX)
osf1
(for Digital UNIX)
linux
(for Linux)
winnt
(for Windows NT)
ST_CPU_TYPE
to the CPU on which you use this software. Possible
values are:
sparc
(for SPARC)
mips
(for MIPS)
alpha
(for Alpha)
i386
(for Intel)
ST_THREAD_PKG
to the underlying thread package. This specifies
the thread package that StackThreads/MP uses. Possible values are:
st_solaris_thread
(for Solaris threads. possible only on
Solaris. recommended on Solaris).
st_pthread
(for Pthreads. possible except on Windows
NT. The only choice on Linux and Digital UNIX).
st_sfork_thread
(for sproc. possible only on IRIX. recommended
on IRIX).
st_nt_thread
(for NT thread. possible only on NT. The only
choice on NT).
st_no_thread
(for uniprocessors. available on every platform).
$(ST_DIR)/bin
to your path. These variables are needed
not only when you build StackThreads/MP, but also when you compile
applications that use StackThreads. Here is an example startup
file for csh
, to use StackThreads/MP on SPARC Solaris.
setenv ST_DIR /usr/local/src/sthreads setenv ST_THREAD_PKG st_solaris_thread setenv ST_OS_TYPE solaris setenv ST_CPU_TYPE sparcHere is an example startup for
sh
, to use StackThreads/MP on
SPARC Solaris.
ST_DIR=/usr/local/src/sthreads export ST_DIR ST_THREAD_PKG=st_solaris_thread export ST_THREAD_PKG ST_OS_TYPE=solaris export ST_OS_TYPE ST_CPU_TYPE=sparc export ST_CPU_TYPE
libst.a
) by:
% cd src % makeIn success, this generates library file
libst.a
as
$(ST_DIR)/lib/libst.a
libstext.a
) by:
% cd ../ext_src % makeIn success, this generates library file
libstext.a
as
$(ST_DIR)/lib/libstext.a
% cd ../ex/small % makeIt builds many small test programs. When everything is compiled successfully, run any. Most programs do not take any command line arguments and those that do display appropriate messages when invoked with no arguments. Some programs are verbose and others are not, but all programs say
OK
in the last line when nothing went wrong.
On i386, program nestcall
does not run successfully. This is a
limitation of the current implementation with unmodified GCC. See
section Floating SP Problem for the description of the problem. We have a
patch to GCC that fixes this problem. See section Applying Safety Patches to GCC for how to
apply patches.
On SPARC, strpass
does not run successfully. This is another
limitation of the current implementation with unmodified GCC. See
section Structure passing on SPARC for the description of the problem. We
have a patch to GCC that fixes this problem. See section Applying Safety Patches to GCC for
how to apply patches.
Try program bin
with various number of processors. The number of
processors can be specified with -nw P
, where P is
the number of processors. Specify a value between 1 and the number of
processors you have. Generally, performance is maximized when P is
the number of processors on small systems (like four processor systems)
and a value slightly smaller than it on medium- or large-scale systems
(like 10 or more processor systems). Giving a value higher than the
number of processors is allowed, but is generally a bad idea. For
example, if you have 16 processors, try:
./bin -nw 1 ./bin -nw 5 ./bin -nw 10 ./bin -nw 14 ./bin -nw 15 ./bin -nw 16and see how speed is improved.
% cd ../doc % makeThis generates
stman.dvi
and stman.info
. The
stman.info
should be copied into the directory where info files
are installed in your system.
lib
, bin
, and include
to whatever
directory you like. If you do that, run ranlib for archives and change
your ST_DIR
environment variable to the directory you copied them
into. For example, if your StackThreads/MP source is in
/usr/local/src/sthreads
and you want to copy lib
,
bin
, and include
to /usr/local/lib/sthreads
,
% cd /usr/local/src/sthreads % cp -r lib bin include /usr/local/lib/sthreads % ranlib /usr/local/lib/sthreads/lib/libst.a % ranlib /usr/local/lib/sthreads/lib/libstext.aand change your
ST_DIR
settings to:
setenv ST_DIR /usr/local/lib/sthreadsin
csh
, or to:
ST_DIR=/usr/local/lib/sthreads export ST_DIRin
sh
.
When you have done that, you may delete the source tree, if you ever
wish to do that.
If you are not going to use StackThreads/MP on multiple platforms, or
you do not mind having a separate source tree on each platform, do not
bother to copy the above directories. You can just keep them in place.
-mflat
option), so you can normally skip it and consider
applying pathces later. These patches guarantee the safety of
StackThreads/MP programs compiled by gcc. The distribution includes
diffs for gcc 2.7.2.3 and gcc 2.8.1, in gccpatch/gcc-2.7.2.3
and
gccpatch/gcc-2.8.1
, respectively. See section Applying Safety Patches to GCC for more
detail.
Patch gcc 2.8.1 on SPARC! gcc 2.8.1 on SPARC processors has a
simple bug in handling -mflat
option that StackThreads/MP heavily
relies upon. When you are going to use StackThreads/MP on SPARC with gcc
2.8.1, apply patches.
The following program illustrates the use of StackThreads/MP using the "Fibonacci" function.
1 2 #include <st.h> 3 4 5 int fib(int n) 6 { 7 if (n < 2) { 8 9 return 1; 10 } else { 11 int a, b; 12 13 14 15 16 a = fib(n - 1); 17 b = fib(n - 2); 18 19 return a + b; 20 21 22 } 23 } 24 25 void pfib(int n, int * r, st_join_counter_t *c) 26 { 27 if (n < 2) { 28 *r = 1; /* write the result */ 29 st_join_counter_finish(c, 1);/* say I have done */ 30 } else { 31 int a, b; 32 st_join_counter_t cc[1]; 33 ST_POLLING(); /* check if somebody is asking work 34 and free stack */ 35 st_join_counter_init(cc, 2);/* initialize join counter */ 36 ST_THREAD_CREATE(pfib(n - 1, &a, cc)); /* fork fib(n-1, ...) */ 37 pfib(n - 2, &b, cc); /* fork fib(n-2, ...) */ 38 st_join_counter_wait(cc); /* wait for child's completion */ 39 *r = a + b; /* write the result */ 40 st_join_counter_finish(c); /* say I have done */ 41 ST_POLLING(); /* check if somebody is asking work 42 and free stack */ 43 } 44 } 45 46 int st_main() 47 { 48 49 int n = 33; 50 long pfib_time, sfib_time; 51 52 { 53 long t0, t1, t2, t3; 54 int pr, sr; 55 st_join_counter_t c[1]; 56 57 /* do parallel fib */ 58 t0 = st_current_time_ms(); 59 st_join_counter_init(c, 1); 60 pfib(n, &pr, c); 61 st_join_counter_wait(c); 62 t1 = st_current_time_ms(); 63 pfib_time = t1 - t0; 64 65 /* do sequential fib */ 66 t0 = st_current_time_ms(); 67 sr = fib(n); 68 t1 = st_current_time_ms(); 69 sfib_time = t1 - t0; 70 71 if (sr != pr) { 72 fprintf(st_errout, "wrong\n"); 73 st_stack_trace_and_die(); 74 } 75 } 76 77 printf("pfib: %d ms on %d processors, sfib: %d ms\n", 78 pfib_time, st_n_toplevel_workers(), sfib_time); 79 return 0; 80 } 81
fib
is a sequential function that computes Nth Fibonacci
number. pfib
is a parallel version. The algorithm is just to
create a thread for the first recursive call (line 36). Before going
into details of the program, we summarize some practical points.
<st.h>
(line 2).
st_main
instead of main
. The main
function defined
in the library processes some command-line arguments that are common to
all StackThreads/MP programs (such as -nw
), performs necessary
setup, and invokes st_main
, passing unrecognized command-line
arguments in argc
, argv
, and envp
. The return value
of st_main
becomes the exit status of the program. There is a way
to use StackThreads/MP from your own main
(see section Calling StackThreads/MP Procedure from Sequential Procedure).
STHREADS
predefined macro. It is defined iff the program is
compiled for StackThreads/MP. (see section STGCC/STG++: A wrapper for GCC
for more details).
Compile the sample program in the previous section (call it
fib.c
) by:
% stgcc fib.c -o fib
We use a simple frontend driver for gcc
, called stgcc
(stg++
is provided for C++ programs). It is NOT a preprocessor,
but a simple script that invokes gcc
with appropriate options
(such as include search paths and library search paths). Also, it
automatically links the StackThreads/MP library and utilities.
It is intended to accept ALL gcc options (unless it fundamentally conflicts with StackThreads/MP), so that any C program that can be compiled with gcc can at least be compiled with stgcc.
StackThreads/MP provides a small utility that generates a simple Makefile for new applications. Try:
stmkmf "app_name" > Makefile
For example, if you want to create an application called "fib", you type:
stmkmf fib > Makefile
The generated Makefile has some necessary settings such as
CC=stgcc
and some useful targets (such as clean
). It
assumes that you write a single source file (fib.c
or
fib.cc
). For programs with multiple sources, look for
BASES
variable in the generated Makefile. Add base names
of other object files there. For example, if the executable fib
is going to be generated from fib.o
, myutil.o
, and
main.o
, you add util main
to the BASES
variables,
like this:
BASES = $(APP) util main
If you already have a Makefile for your program, do not bother to
generate it and merge it into the old one. The generated Makefile does
nothing particularly important. Simply set your CC
variable to
stgcc
in your original Makefile.
If you successfully compiled the sample program, you should have
obtained executable fib
(or fib.exe
on Windows NT). Run
the program normally by typing the program name from the command line,
like this:
harp:366% ./fib
You will see the following message on the terminal:
harp:366% ./fib pfib: 10714 ms on 1 processors, sfib: 2114 ms
By default, a StackThreads/MP program runs using a single processor. You
can use multiple processors by adding -nw
option at the command
line. For example, if you want to use 10 processors, try this:
harp:367% ./fib -nw 10 pfib: 1091 ms on 10 processors, sfib: 2062 ms
st_main
does not see the command line arguments -nw 10
in its
argv
; they are removed by the runtime system before entering
st_main
. (see section Command Line Options Common for All StackThreads/MP Programs for more options).
Assuming you create a large number of threads, performance is generally maximized when you give a number equal to the number of available processors (on small-scale systems like 4 processor systems) or a number slightly smaller than it (on medium- or large-scale systems like 16 or 64 processor systems). It generally does not improve performance to give a number larger than the number of processors. Also do not try to use many processors when the machine is heavily loaded.
You may notice that the parallel version runs much (5 times) slower than
the sequential version. There are simple hacks to improve this up to a
point like 3 times, but for very small functions like fib
, the
overhead like this is the current state of the art. In practice, the
granularity of a thread is typically much larger. StackThreads/MP well
tolerates a thread whose granularity is something like 200 instructions.
(see section A recommended programming style for performance for several
advices to achieve good performance on StackThreads/MP ).
NOTE: The number of processors specified
by -nw
is irrelevant to the number of threads you can create in
the program. By giving -nw 10
, the program spawns 10
OS-level threads during its initialization. An OS-level thread is
a thread directly supported by OS, such as LWP on Solaris. We hereafter
call an OS-level thread a worker, to avoid confusion between
threads supported by StackThreads/MP (created via
ST_THREAD_CREATE
) and OS-level threads. StackThreads/MP runtime
system dispatches dynamically created threads to workers. Also note that
although we say -nw
specifies the number of processors, it
actually specifies the number of workers. In particular, a worker is not
permanently bounded to a particular processor. Confusing workers with
processors does not normally cause any problem.
Let's now move our focus to the parallel Fibonacci function.
1 void pfib(int n, int * r, st_join_counter_t *c) 2 { 3 if (n < 2) { 4 *r = 1; /* write the result */ 5 st_join_counter_finish(c); /* say I have done */ 6 } else { 7 int a, b; 8 st_join_counter_t cc[1]; 9 ST_POLLING(); /* check if somebody is asking work 10 and free stack */ 11 st_join_counter_init(cc, 2); /* initialize join counter */ 12 ST_THREAD_CREATE(pfib(n - 1, &a, cc)); /* fork fib(n-1, ...) */ 13 pfib(n - 2, &b, cc); /* fork fib(n-2, ...) */ 14 st_join_counter_wait(cc); /* wait for child's completion */ 15 *r = a + b; /* write the result */ 16 st_join_counter_finish(c); /* say I have done */ 17 ST_POLLING(); /* check if somebody is asking work 18 and free stack */ 19 } 20 }
The algorithm is a simple parallel recursion in which the first recursive call is done in parallel with the caller. This is done by
ST_THREAD_CREATE(fib(n - 1, &a, cc));
at line 12. Unlike sequential function, the parallel version takes two additional parameters. The second parameter, r, is the pointer to a storage to which the result value will be stored. The third parameter, c, is a pointer to a storage through which the function signals the end of its computation to the caller.
The third parameter is necessary because, once a procedure call is done
in parallel with the caller, the caller at some point needs to make sure
that the procedure is finished. This synchronization is done via what we
call "synchronization counter." The type of the counter is
st_join_counter_t
. A storage for a
counter is declared at line 8:
st_join_counter_t cc[1];
Line 11 initializes the counter to 2.
st_join_counter_init_(cc, 2); /* initialize join counter */
This declares that two procedures will be associated to the counter cc. The two procedures are called at line 12 and 13.
ST_THREAD_CREATE(pfib(n - 1, &a, cc)); /* fork fib(n-1, ...) */ pfib(n - 2, &b, cc); /* fork fib(n-2, ...) */
cc
is given to both procedure calls.
When pfib
is finished, it decrements the counter received through
the third argument at line 5 and 16.
st_join_counter_finish(c);
When two procedures are finished, the value of the counter will become zero, because it is initialized to two and decremented twice. At line 14, the parent waits for the value of the counter to become zero.
st_join_counter_wait(cc);
Note that the procedure first writes the return value and then decrements the counter. Conversely, the parent first waits on the counter and then reads the return value. This order is important; this guarantees that when the parent reads the value, the called procedure has written a value. If, for example, the procedure first decrements the counter and then writes the result value, it is possible for the parent to observe that a counter becomes zero before the callee has written the result value.
If you feel that this program is very awkward just for writing small things like fib, you are right. This example is meant to describe primitives and mention general caution in parallel programming. You can usually make the program syntactically nicer by having a data structure that embeds the synchronization counter and wrapper functions that encapsulate synchronization. In the above pfib function, for example, you might define a data structure "int_box" as follows:
typedef struct int_box { int val; st_join_counter_t c[1]; } * int_box_t;
and two wrapper functions void int_box_write(int_box_t b, int x) and int int_box_read(int_box_t b), as follows:
void int_box_write(int_box_t b, int x) { b->val = x; st_join_counter_finish(b->c); } int int_box_read(int_box_t b, int x) { st_join_counter_wait(b->c); return b->val; }
It is also not difficult to add a new synchronization primitive for different constraints. See section Implementing Synchronization Primitives for the bottom-level machinery for implementing new synchronization primitives.
Up to this point, we have explained that a thread is created by
ST_THREAD_CREATE
and threads can be synchronized by a
synchronization counter. Remaining things are calls to ST_POLLING
at line 9 and 17. Well, it does not have any visible effect. One thing
that happens in ST_POLLING
is that the current processor checks
if an idle processor is requesting a thread to the current processor,
and gives a thread if any (and the current processor has enough
threads). You do not have to understand the details about how threads
are migrated or when a thread belongs to a particular processor, but you
have to understand that threads never get executed in parallel
unless you occasionally call ST_POLLING
, which
distributes work if there are idle processors. See section Where should you insert ST_POLLING? for
hints as to where to insert ST_POLLING
.
In summary, converting a sequential program into a parallel one involves the following steps:
ST_CREATE_THREAD
where parallelism is
necessary. Unlike traditional thread libraries, creating threads much
more than the number of processors is no problem in StackThreads/MP.
ST_POLLING
to periodically distribute work
among processors.
Syntax:
ST_THREAD_CREATE(e);
where e is a procedure call expression in C/C++ that does not have nested procedure calls in argument positions. On SPARC, special care must be taken when you pass structures as arguments (see section Structure passing on SPARC). Unlike a similar thread creating functions in traditional thread libraries such as Pthreads, the procedure can take any number of parameters and, more importantly, the cost of a thread creation is very small (typically something like 10 instructions in addition to a procedure call overhead). Hence, if your computation can tolerate the overhead of a procedure call, it is likely to tolerate the overhead of creating a thread as well.
The actual constraint for argument positions is slightly more relaxed; we actually permit procedures that never call StackThreads/MP primitives, most notably procedures in standard libraries (such as printf, atoi, etc.). But you can remember the above rule as a simple safety net. Here are some examples:
ST_THREAD_CREATE(f(1, 2, 3, 4, 5));
is valid,
ST_THREAD_CREATE(f(atoi(argv[1])));
is valid, because we know atoi never call StackThreads primitives. Finally,
ST_THREAD_CREATE(f(g(x)));
is invalid if g calls any StackThreads primitive.
Syntax:
ST_POLLING();
Although it does not have any instantly visible effect, you must make sure that this macro is periodically called. It does two tasks. One is to check if a thread migration request from an idle processor is coming and respond to it if any. The other is to free some stack frames. See section Where should you insert ST_POLLING? for where to insert this in your program.
Do not worry so much about the overhead of ST_POLLING
. If all the
workers are already busy and you do not have to free any stack frame,
the overhead is something like 10 -- 20 instructions.
StackThreads/MP provides standard synchronization primitives including mutex, semaphore, and condition variables. Mutex and condition variables behave like their Pthread counter part, and semaphore (not supported by Pthread) behaves like its Solaris thread counter part. StackThreads/MP also supports a very commonly used producer-consume synchronization primitive, called join counter.
All these waiting primitives (mutex lock, semaphore wait, condition
wait, and join counter wait) block the currently running thread (the
thread created by ST_THREAD_CREATE
) and schedule another thread,
when the waiting condition is not immediately satisfied.
NOTE: In the following description, we say "xxx signals an error, when ...
". The meaning of this statement depends on how the library was
compiled. These procedures are defined in file src/st_sync.c
in
StackThreads/MP distribution and the file contains a macro #define
ST_DIE_ON_ERROR 1
at the head of it. When this flag is set to 1 (the
default), "signals an error" means the application will die when an
error occurs (It has nothing to do with UNIX signals). This is useful
when debugging (and also when your application is stable enough not to
perform such an error). When this flag is zero, it simply means the
procedure returns an error value (ST_FAULT
).
Syntax:
int st_join_counter_init(j, v); int st_join_counter_wait(j); int st_join_counter_finish_1(j, v); int st_join_counter_finish(j) int st_join_counter_spawn_1(j, v); int st_join_counter_spawn(j) int st_join_counter_destroy(j); st_join_counter_t * j, unsigned long v; st_join_counter_t x = ST_JOIN_COUNTER_INITIALIZER;
Join counter provides a simple producer-consume style
synchronization. It internally maintains a non-negative integer
(counter). st_join_counter_init(j, v)
initializes the
counter of j to v. It must be called when no other threads
access to j concurrently (e.g., immediately after creation).
st_join_counter_finish_1(j, v)
and
st_join_counter_spawn_1(j, v)
atomically decrements and
increments the counter of j by v,
respectively. st_join_counter_finish(j)
and
st_join_counter_spawn(j)
are synonyms for
st_join_counter_finish_1(j, 1)
and
st_join_counter_spawn_1(j, 1)
, respectively.
st_join_counter_wait(j)
waits for the counter of j to
become zero. st_join_counter_destroy(j)
destroys j. It
must be called when no other threads access to j concurrently. It
also checks if the counter is zero. If this is not the case, it signals
an error. st_join_counter_t x =
ST_JOIN_COUNTER_INITIALIZER(c)
defines a global or static variable
that is initialized to c.
It is allowed to recycle a storage for a counter many times, as long as the counter is correctly initialized before each use and destroyed after each use.
Typically, you use a synchronization counter to wait for the completion (or an arrival to a certain point) of child threads. Here is a simple example.
{ st_join_counter_t j[1]; st_join_counter_init(j, n); for (i = 0; i < n; i++) { ST_THREAD_CREATE(f(..., j)); } st_join_counter_wait(j); }
Assume that f(..., j)
calls st_join_counter_finish(j)
when it is finished, st_join_counter_wait(j)
waits for the
completion of all threads created in the loop.
st_join_counter_spawn
and st_join_counter_spawn_1
are
provided in cases where you do not know how many threads are going to be
associated with the counter. Each time you create a thread, you
increment the counter by st_join_counter_spawn
and then create a
thread.
All these functions return zero upon success and signals an error if an error occurred.
Syntax:
int st_sema_init_1(s, c, lim); st_sema_init(s, c); int st_sema_wait(s); int st_sema_trywait(s); int st_sema_post(s); int st_sema_destroy(s); st_sema_t * s; int c, int lim; st_sema_t x = ST_SEMA_INITIALIZER_1(c, p);
Semaphore manages limited resources shared by threads. A semaphore
internally maintains a counter. st_sema_init_1(s, c,
lim)
initializes the counter of s to c. When lim is
positive, it specifies the upper limit on the
counter. st_sema_init(s)
is synonym for
st_sema_init_1(s, 0)
(no upper limits).
st_sema_wait(s)
waits the counter of s to become positive
and decrements the counter by one. st_sema_trywait(s)
behaves
like st_sema_wait(s)
, except it immediately returns value
ST_BUSY
, when the counter is not
positive. st_sema_post(s)
increments the counter of s by
one. If the upper limit was imposed by st_sema_init(s)
and
the counter would exceed that value after the increment, an error is
signaled. st_sema_destroy(s)
destroys s. It signals an
error if the counter of s is not the same as the value given at
initialization (i.e., st_sema_post
were not called as many times
as st_sema_wait
.). st_sema_t x =
ST_SEMA_INITIALIZER_1(c, p)
defines a global or static variable
x whose counter is c and the upper limit is p.
All these functions return zero upon success and signals an error if an error occurred. Read standard thread books on introduction to semaphore.
Syntax:
int st_mutex_init(m); int st_mutex_trylock(m); int st_mutex_lock(m); int st_mutex_unlock(m); int st_mutex_destroy(m); st_mutex_t * m; st_mutex_t x = ST_MUTEX_INITIALIZER;
Mutex is an essentially a binary semaphore with upper limit 1 (a
semaphore that can only have 0 or 1 as its value). The mutex is either in
locked or unlocked state. st_mutex_init(m)
initializes it to
an unlocked state. st_mutex_lock(m)
waits for m to become
unlocked and lock it. st_mutex_trylock(m)
behaves like
st_mutex_lock(m)
, except that it immediately returns value
ST_BUSY
when m is locked. st_mute_unlock(m)
unlocks m. It signals an error if m is not
locked. st_mutex_destroy(m)
destroys m. It signals an
error if m is locked. st_mutex_t x = ST_MUTEX_INITIALIZER
defined a global or static mutex.
Syntax:
int st_cond_init(c); int st_cond_wait(c, m); int st_cond_signal(c); int st_cond_broadcast(c); int st_cond_destroy(c); st_cond_t * c; st_mutex_t * m;
Condition wait is a general mechanism upon which various synchronization can be built. Read standard thread books on how to use it.
StackThreads/MP provides a simple set of spin-lock primitives. A location can be atomically read and locked by read-and-lock primitives. They wait for the location to become unlocked and then atomically lock the location. They return the value previously written in the location. A locked location can be unlocked with write-and-unlock primitives. They take a value and a locked location as the arguments, write the value to the location, and unlock the location. The location must be locked or the behavior is undefined. These primitives are useful for performing atomic read-modify-write on a single location without a separate "lock" word. In addition a value can be read even when the location is currently locked.
CAUTION: All locking primitives described below
(st_read_and_lock_xxx
) fail after certain number of locking
attempts. When they fail, the program terminates after printing stack
trace information. This is typically a good thing, as it makes debugging
easier (a simple error that forgets to unlock a location is detected at
the next lock attempt). On the other hand, it is in theory possible to
kill a program that is otherwise correct. However, in most cases, when
you fail to obtain a lock, it indicates that you lock a location too
long or you spin-lock a highly-contended location, both of which are
considered wrong. In most cases, you should change the locking strategy
for performance anyway. See section Initialization, Lock, and Unlock for more
about this.
Syntax:
typedef struct st_int_loc st_int_loc_t; typedef struct st_long_loc st_long_loc_t; typedef struct st_ptr_loc st_ptr_loc_t; int st_read_and_lock_int(il); long st_read_and_lock_long(ll); void * st_read_and_lock_ptr(pl); void st_write_and_unlock_int(il, i); void st_write_and_unlock_long(ll, l); void st_write_and_unlock_ptr(pl, p); int ST_INT_LOC_INIT(il, i); long ST_LONG_LOC_INIT(ll, l); void * ST_PTR_LOC_INIT(pl, p); st_int_loc_t * il; int i; st_long_loc_t * ll; long l; st_ptr_loc_t * pl; void * p; st_int_loc_t v = ST_INT_LOC_INITIALIZER(x); st_long_loc_t v = ST_LONG_LOC_INITIALIZER(x); st_ptr_loc_t v = ST_PTR_LOC_INITIALIZER(x);
Type st_int_loc_t
is int
type that is `tagged' by `lock bits.'
Similarly, st_long_loc_t
and st_ptr_loc_t
are tagged long
and void *
, respectively.
st_read_and_lock_int(il)
waits for the location pointed to by
il to become unlocked. When it is unlocked, it atomically lock the
location and returns the value previously written at the location
pointed to by il. Lock bits are
stripped. st_read_and_lock_long(ll)
and
st_read_and_lock_ptr(pl)
work similarly, but on locations that
hold values of type st_loc_long_t
and st_loc_ptr_t
,
respectively.
st_write_and_unlock_int(il, i)
writes value i to the
location pointed to by il and unlocks the location. The location
must be locked upon calling the procedure.
st_write_and_unlock_long(ll, l)
and
st_write_and_unlock_ptr(pl, p)
are similar procedures for
st_long_loc_t *
and st_ptr_loc_t *
, respectively.
ST_INT_LOC_INT(il, i)
initializes the location pointed to by
il by i. The location is initialized to `unlocked' state.
ST_LONG_LOC_INT(ll, l)
and ST_PTR_LOC_INT(pl, p)
are similar procedures for st_long_loc_t *
and st_ptr_loc_t *
,
respectively.
st_int_loc_t v = ST_INT_LOC_INITIALIZER(x);
defines a global
or static variable of st_int_loc_t
that is initialized to x.
ST_LONG_LOC_INITIALIZER(x)
and ST_PTR_LOC_INITIALIZER(x)
are similar initializers for st_long_loc_t
and st_ptr_loc_t
,
respectively.
st_read_and_lock
and st_write_and_unlock
can be used to perform
an atomic read-modify-write sequence on a single location. Here is a
typical example.
st_int_loc_t il = ST_INT_LOC_INITIALIZER(1); void read_modify_write() { int x = st_read_and_lock_int(&il); st_write_and_unlock_int(&il, 3 * x); }
This code atomically multiplies the value at the location il by 3.
Remark 1: Primitives st_read_and_lock_xxx
never block the
calling thread; they just wait for the location to become unlocked by
spinning. Therefore, do every effort to make the critical
section as short as possible. How short should they be? Well, as a
rough approximation, avoid anything more than a couple of arithmetics
and a couple of loads and stores. Avoid dynamic memory allocations such
as malloc when you are holding a lock. Also, never try to lock
a location when you are locking another location. It is a common source
of deadlocks.
If you cannot avoid having a long critical section, you should not use
spin-lock primitives in the first place, but use blocking
synchronization primitives. As already shown, StackThreads/MP provides
synchronization primitives that block the calling thread and schedule
another thread when necessary. st_mutex_t
will be most useful for
such purposes.
Nonetheless, spin-locks are useful for short-critical sections. You cannot unconditionally adopt blocking primitives, as critical sections can be made short in many cases and in such cases spin-locks are both time- and space- efficient.
To avoid deadlocks and severe performance degradation due to a wrong use
of spin-locks, st_read_and_lock
primitives will timeout after a
certain number of attempts. More specifically, when they fail to obtain
a lock 1,000 times in a row, they sleep the calling worker 1 millisecond
and retry. They repeat this process 10 times. When the lock is not still
granted, it displays an error message and the application dies. It may
kill an application that would otherwise be correct. However, it can
detect a common programming error in which the programmer forgets to
write a necessary unlock operation. Even if the application works,
holding a lock 10 milliseconds typically degrades performance of the
application, so you should take the error message as a strong suggestion
that your locking strategy is wrong.
Remark 2: These primitives assume that two bits are embedded
for tags within a single word. For example, on machines where int
is
32 bits, st_int_loc_t
uses the least significant two bits for
tags. The value is left-shifted by two bits and stored in the most
significant 30 bits. Consequently the most significant two bits are
discarded when a value is stored. Conversely, when a value is read, the
most significant two bits are sign-extended. Therefore, values
representable by st_int_loc_t
are restricted to -2^30 ... 2^30 -
1. For pointers, we assume the value of a pointer is 4 bytes
aligned. That is, the least significant two bits of a pointer are zeroes
and we use these two bits for tags. When a value that is not properly
aligned is given to st_write_and_unlock_ptr
, the least significant
two bits are just discarded.
Syntax:
int st_read_int(il); long st_read_long(ll); void * st_read_ptr(pl); int ST_INT_LOC_CHECK(il, i); int ST_LONG_LOC_CHECK(ll, l); int ST_PTR_LOC_CHECK(pl, p); int ST_INT_LOC_LOCKED(il); int ST_LONG_LOC_LOCKED(ll); int ST_PTR_LOC_LOCKED(pl);
st_read_int(il)
reads the value at the location pointed to by
il, but does not lock the location. It can immediately read the
value even if the location is locked. st_read_long(ll)
and
st_read_ptr(pl)
are similar procedures for locations of type
st_long_loc_t *
and st_long_ptr_t *
.
ST_INT_LOC_CHECK(il, i)
is 1 when il is currently
unlocked and its value equals to i. This is semantically equivalent
to (st_read_int(il) == i)
, but on some CPUs, the former is
considerably faster. ST_LONG_LOC_CHECK(ll, l)
and
ST_PTR_LOC_CHECK(pl, p)
are similar primitives for
st_long_loc_t *
and st_ptr_loc_t *
, respectively.
ST_INT_LOC_LOCKED(il)
is 1 if the location il is currently
locked. ST_LONG_LOC_LOCKED(ll)
and ST_PTR_LOC_LOCKED(pl)
are similar primitives for st_long_loc_t *
and st_ptr_loc_t *
,
respectively.
Syntax:
int st_read_and_lock_any_int(il, els, n, ires); int st_read_and_lock_any_long(ll, els, n, lres); int st_read_and_lock_any_ptr(pl, els, n, pres); int st_try_read_and_lock_int(il, ires); int st_try_read_and_lock_long(ll, lres); int st_try_read_and_lock_ptr(pl, pres); int els; int n; st_int_loc_t * il; int * ires; st_long_loc_t * ll; long * lres; st_ptr_loc_t * pl; void ** pres;
Suppose il points to address a,
st_read_and_lock_any_int(il, els, n, ires)
locks
any of locations { a, a + els, a + 2 els, ... ,
a + (n - 1) els }. It returns the index of the locked
location. That is, if address a + k els, it returns k.
It also writes the value previously written at a + k els to
*ires. The locked location can then be freed by applying
st_write_and_unlock_int
on the address a + k els.
st_read_and_lock_any_long
and st_read_and_lock_any_ptr
are
similar functions for st_long_loc_t *
and st_ptr_loc_t *
,
respectively.
These primitives are useful when multiple resources are available and you want to lock whatever resource is available.
st_try_read_and_lock_int(il, ires)
tries to lock the
location il. If it successfully obtains the lock, it writes the
value previously written at il to *ires and returns zero. It
otherwise returns -1. st_try_read_and_lock_long
and
st_try_read_and_lock_ptr
are similar functions for
st_long_loc_t *
and st_ptr_loc_t *
, respectively.
Syntax:
int st_fetch_and_add_int(il, ix); long st_fetch_and_add_long(ll, lx); st_int_loc_t * il; int ix; st_long_loc_t * ll; long lx;
st_fetch_and_add_int
atomically increments the value written at
il by ix and returns the value previously written at il.
st_fetch_and_add_long
is a similar function for
st_long_loc_t *
.
Syntax:
void st_yield();
temporarily deschedules the current thread. The thread will be scheduled later. For most parallel programs, there should be no reasons why you have to deschedule a thread that is runnable. We discourage using this procedure to wait for a particular event to occur. In such circumstances, you have to suspend the thread and let another thread resume it when a condition is satisfied.
There are several procedures that inform you of the use of stack of the running worker. All these procedures are normally of no interest for you. They are provided primarily for debugging and understanding stack usage and for precisely controlling when stack frames are reused.
Syntax:
long st_stack_used_bytes()
returns the stack size of the calling worker in bytes. It
returns the distance between the current stack top and the stack bottom.
It specifically does not return the stack limit, but return the size of
the stack currently occupied. The stack size increases when a procedure
is called and may decrease when a procedure is returned or
ST_POLLING
or ST_FREE_STACK
is called.
Syntax:
void st_stack_trace() void st_show_context(c) st_context_t c; void st_show_exported_frames() int st_n_live_exported_frames()
st_stack_trace()
prints stack trace of the calling worker, from
the current frame down to the stack bottom. It stops printing when it
reaches the stack bottom of the calling worker, 50 frames have been
printed, or when it encounters any procedure that is not
postprocessed. It stops printing after printing 50 frames to safely
print corrupted stacks. If you want to print more frames, find
definition of MAX_STACK_TRACE_DEPTH
in st.c
, change it as
you like and recompile the library).
st_show_context(c)
, where c is a pointer to a
context filled by st_suspend_thread
, displays frames represented
by c.
st_show_exported_frames()
displays suspended contexts that are
allocated in the calling worker's stack. st_n_exported_frames()
returns the number of such contexts.
Syntax:
void ST_FREE_STACK()
invokes a stack management. This examines frames that are no
longer used and reset the stack pointer of the calling worker to an
appropriate point (the point above which there are no frames still
used). This is implicitly called when you perform ST_POLLING()
, so
this should be usually of no interest for you. Use this API only when
you understand how StackThreads/MP works and when you want to free stack
frames without causing thread migration.
Syntax:
void * asm_get_fp() void * asm_get_sp() void * asm_get_gp()
asm_get_fp()
returns the frame pointer, asm_get_sp()
the
stack pointer. On machines that have a global register,
asm_get_gp()
returns the value of the global register. On
machines that do not have a global register, asm_get_gp()
is not
available. Every procedure returns the value visible to the
calling procedure. For example, when a procedure f
calls
asm_get_fp()
, it returns the value of the frame pointer register
when f
is running.
When a procedure calls asm_get_fp()
and asm_get_sp()
, you
may be surprised at their values; in StackThreads/MP execution, the
frame pointer register always points to the running frame, whereas the
stack pointer register always points to the stack top of the
worker. These two values do not significantly differ in sequential
execution, because both point to the frame allocated at the stack
top. In StackThreads/MP, however, a running frame may not be on the
stack top and may be far away from the stack top. It may even be another
worker's stack.
Also, when a procedure calls asm_get_sp()
several times, it may
return different values each time. This is also because other frames may
be above the running frame. The value of asm_get_fp()
, on the
other hand, returns the same value, as long as they are called from the
same procedure. The returned value thus serves as the identifier of the
procedure call.
Syntax:
void st_config_profile_resolution(resolution); int resolution; void st_config_profile_max_intervals(buffer_size); int buffer_size; void st_config_profile_filename_prefix(filename); char * filename; void st_begin_profile(); void st_end_profile();
See section Performance Profiler for more information.
Syntax:
long st_current_time_ms() long st_current_time_us()
return the current (real) time in milliseconds and microseconds,
respectively. The absolute value does not make sense; the only
difference between two calls do. The resolution depends on the operating
system. Specifically, on Windows NT, the resolution is 10ms, so
st_current_time_us
is no more useful than st_current_time_ms
.
A worker refers to an underlying OS-thread such as a thread of Solaris
Thread library or Pthreads library. Normally, StackThreads/MP program
creates a fixed number of workers at program initialization and they
share threads created by st_main
, directly or indirectly. In this
simple use of StackThreads/MP, you do not have to understand the
following procedures and do not even have to know the notion of workers
or worker groups.
StackThreads/MP allows other ways to access the provided multithreading
facility. Specifically, you can dynamically create another group of
workers and add workers to an existing group. Creating a group of
workers dynamically is useful in several circumstances. First, when you
want to have multiple groups of workers and to keep them isolated;
threads are shared only among workers within a single group and threads
are never migrated from one group to another. Second, creating a group
of workers at runtime allows you to increase or decrease the number of
OS-threads (the number of threads visible to OS) according to the
application's need. In particular, an application can spend most of the
time with a single thread and create a group of workers to perform a
compute-intensive task in parallel. When the task is finished, all
workers will die and the application goes back to its single-threaded
state. Actually, such an application could be built without dynamically
creating worker groups, if you do not care about the fact that many idle
threads may consume CPU time when the application has no parallelism.
Third, unlike other StackThreads/MP procedures, the procedure that
creates a worker group (stf_create_worker_group
) can be called from
programs that is not compiled by stgcc
. Thus, it can be used as a
callback from library whose source is not available.
Syntax:
#include <st_foreign.h> void* stf_create_sync_worker_group(wgc, n, f, a0, a1, a2, a3) void stf_create_async_worker_group(wg, wgc, n, f, a0, a1, a2, a3) void* stf_wait_for_wg_to_exit(wg) int stf_did_wg_exit(wg) struct worker_group * wg; struct worker_group_conf * wgc; int n; void * (*f)(void *, void *, void *, void*); void * a0, * a1, * a2, * a3;
Both stf_create_sync_worker_group
and
stf_create_async_worker_group
create a group of workers that
initially consists of n workers. f specifies the entry function
and a0 ... a3 its arguments; the group as a whole performs
f(a0, a1, a2, a3)
, sharing threads created
directly or indirectly from f. wg is a pointer to an
uninitialized storage of type struct worker_group
.
Created workers exit when f returns, no matter whether there are
running threads that belong to the
group. stf_create_sync_worker_group
returns when workers exit. The
return value is the return value of f.
On the other hand, stf_create_async_worker_group
returns immediately.
stf_wait_for_wg_to_exit(wg)
waits for workers that belong to
wg to exit (by blocking the underlying worker) and returns the
return value of the entry function of wg. stf_did_wg_exit(wg)
returns 1 if workers that belong to wg have already exited and 0
otherwise.
Syntax:
void st_add_slave_worker()
adds a worker to the worker group that the current thread belongs to. Currently there are no ways to remove a worker from a group.
Syntax:
void ST_STEAL_REQ_CHECK();
checks if an idle worker in the same group as the calling worker is
requesting a thread migration from the calling worker. If it is, it
gives an appropriate thread to the requesting worker. This is implicitly
called by ST_POLLING()
, so this should be usually of no interest
for you. Use this API only when you understand how StackThreads/MP works
and when you want to migrate threads without freeing stack frames.
Syntax:
void st_yield_os_thread() void st_sleep_os_thread_ms(ms) int ms;
st_yield_os_thread()
instructs the operating system to temporarily
deschedule the calling worker (not thread). Recall that in
general many threads are running on top of a worker. Descheduling a
worker effectively deschedules all threads that are running on top of
it.
The exact effect depends on the operating system and the underlying execution machinery of OS-level threads. Specifically, it is likely to have no effects when the number of workers are less than the number of available CPUs.
st_sleep_os_thread_ms(ms)
instructs the operating system to
deschedule the calling worker (not thread) for ms
microseconds.
Use these APIs only when you understand how StackThreads/MP works and you are sure they are really necessary. Specifically, it is usually a bad idea to call these primitives to achieve a synchronization. You should normally suspend the current thread (not the entire worker) and explicitly resume it when it can make a further progress.
Syntax:
ST_CALLBACK_BEGIN(); ST_CALLBACK_END();
These two procedures together declare that the surrounding procedure may
be called from a sequential procedure (a procedure that is compiled by
plain gcc
). When a procedure f is called from a sequential
procedure, ST_CALLBACK_BEGIN()
must be placed in the top of
f's variable declaration (ST_CALLBACK_BEGIN()
is a macro that
is expanded into a variable declaration). ST_CALLBACK_END()
must
be placed immediately before returning from f. More precisely, f
must execute ST_CALLBACK_BEGIN()
before any sub-procedure call in
f and ST_CALLBACK_END()
after any sub-procedure call in
f. ST_CALLBACK_END()
may syntactically appear twice or more
(unless only one of them is executed at runtime). See section Cooperating with Sequential Modules for more about this topic.
Syntax:
void st_wg_exit(wv); void st_wg_die(wv); void st_app_exit(av); void st_app_die(av); void * wv; int av;
st_wg_exit(wv)
exits the worker group the calling thread
belongs to and returns wv as the return status of the worker
group. When the worker group is the toplevel group, the group that
executes st_main
, it exits the entire application. This is the
preferred way of terminating an application.
st_wg_die(lv)
behaves like st_wg_exit(av)
, except
that it prints stack trace of the calling worker before exiting. It is
advisable to call this procedure whenever a fatal error occurs, so that
the application prints some useful pieces of information for debugging.
st_app_exit(av)
simply kills the entire application by
exit(av)
on UNIX. st_app_die(av))
behaves like
st_app_exit(av)
, except it prints the stack trace of the
calling worker before dying.
Syntax:
void st_suspend_thread(st_context_t c);
where st_context_t
is defined as follows:
typedef struct st_context * st_context_t;
This procedure and the next (resume_context) are the bottom-level primitives on top of which you can build synchronization. We do not expect (and do not want) the programmers to write a program at this level. We still describe them because they are sometimes useful to implement a new synchronization primitive that cannot be easily synthesized from existing ones.
st_suspend_thread
suspends (or blocks) a current thread and fills
the structure pointed to by C with pieces of information, which we call
the `context' of the thread, so that another thread can later resume it.
Upon calling this procedure, a field `valid' must be cleared (filled by
zero). Here is an example that illustrates the correct use of this
procedure:
struct st_context c[1]; c->valid = 0; st_suspend_thread(c); ... (xxx) ...
The control does not immediately return to ... (xxx) ...
.
It will be executed only when somebody explicitly resumes it (by
st_resume_context
described next).
When C->valid == 0
, it indicates that the structure is going to be
filled by a context of a thread, but it has not been completely
filled. st_suspend_thread
sets `valid' to 1 when it is ready to be
scheduled. Before anything else, st_suspend_thread
checks if
C->valid
is cleared and signals a runtime error if it is not.
In most cases, you want to store the pointer c to somewhere else
before calling st_suspend_thread
, so that another thread can find
it. Typically, c is stored in some sort of "waiting queue," in
which threads are waiting for a condition to be satisfied. To do this,
you must do the following steps in the right order.
st_suspend_thread
.
The order between 1 and 2 is important; otherwise, another thread may
find c and try to resume it, while c has not been
completely filled by st_suspend_thread
.
Syntax:
void st_resume_context(st_context_t c);
st_resume_context
is the bottom level primitive that resumes a
thread that has been suspended by st_suspend_thread
. It takes a
pointer to a structure struct st_context
that has been passed to
st_suspend_thread
, and resumes the execution of the thread that
called st_suspend_thread
.
Syntax:
st_proc_info_t st_get_proc_info(pc, zero_if_not_found, pi) void * pc; int zero_if_not_found; st_proc_info_t pi; int st_is_fork_point(pc) void * pc; void st_add_procedure_information_table(table) st_proc_info_t * table;
Not documented yet.
Syntax:
extern long tls(worker_id); long st_n_current_workers(); extern long tls(thread_id);
tls(worker_id)
is an identifier of the calling worker within the
worker group it belongs to. It holds a value from zero to n-1, where
n is the number of workers withing the worker group the calling
worker belongs to. When a worker group is created by
stf_create_async_worker_group
or
stf_create_sync_worker_group
, the entry function starts execution
on the worker zero.
tls(thread_id)
is a global identifier of the calling worker. It
holds from zero to n-1, where n is the number of workers which
have ever been created from the beginning of the program execution. The
same identifier is (currently) not reused.
st_n_current_workers()
returns the number of workers that
currently belongs to the worker group that the calling worker belongs
to. The worker group is initially created with the number of workers
specified with the argument to stf_create_async_worker_group
or
stf_create_sync_worker_group
(or the argument to -nw
option in the case of the toplevel worker group) and may later be
extended by st_add_worker()
. The value returned by
st_n_current_workers()
at any moment may be smaller than the
number of workers you have requested to create. For example, if you
create a worker group that initially has 4 workers and performed
st_add_worker()
3 times, st_n_current_workers()
at a given
moment may return any value from 1 to 7.
tls(worker_id)
and tls(thread_id)
can be used to
guarantee that accesses to a single location do not occur
concurrently. For example, the following code increments an element of a
counter counters_array
by one. The summation of all the elements
in counters_array
hold the total number of updates
performed. Assuming counters_array
is acceseed only from this
worker group, there are no race conditions in counters_array
because each access that may occur in parallel at runtime is given a
unique value for tls(worker_id)
.
Syntax:
x = compute_increment(); counters_array[tls(worker_id)] += x;
Note that the value of tls(worker_id)
or tls(thread_id)
may change even within a single function, as a result of thread
suspension or migration. Therefore, in practice, it is always dangerous
to assume that they are persistent across a procedure call (including
ST_POLLING()
and ST_STEAL_REQ_CHECK()
). For example, you
cannot manually optimize away tls(worker_id)
by caching it to a
local variable; the following code is probably unsafe.
Syntax:
w = tls(worker_id); x = compute_increment(); counters_array[w] += x;
Here, there are no guarantee that w
at the two increments
corresponds to the current worker id, because compute_increment
may cause the rest of the computation to block or migrate.
The difference between tls(worker_id)
and tls(thread_id)
is important only when you create multiple worker groups within a
program. However, it is recommended that you use tls(worker_id)
where possible, because such programs are likely to be reusable.
Having said all, using tls(worker_id)
or tls(thread_id)
to
gurantee freedom from races are not recommended at all. It is advisable
to use a general synchronization method or a spin lock method, such as
semaphore
or st_read_and_lock_any
. For example, the
following code achieves the same effect without using
tls(worker_id)
, given that counters_array
is an array of
st_int_loc_t
.
Syntax:
int v; int x = compute_increment(); int idx = st_read_and_lock_any_int(counters_array, sizeof(st_int_loc_t), n, &v); st_write_and_unlock_int(counters_array + idx, v + x);
It dynamically finds a location that is not locked. The size of
counters_array
is now arbitrary; it does not have to match the
number of workers just to avoid out of range indices.
It may sound strange to talk about memory management in this manual. Yes, you can call any memory allocation function, including the standard malloc, from StackThreads/MP programs. However, once you start using StackThreads/MP in any serious parallel programs, you will soon notice that the system's default "malloc" function becomes a severe bottleneck. More specifically, typical malloc functions (including ones in Solaris libc) simply serialize all malloc calls using a mutex. If many threads contend, they cause many thread context switches of the underlying workers. In my experience, calling mallocs within a parallel section almost always makes a significant (negative) effect on performance. There are many xxx-malloc libraries, but most of them similarly serialize all malloc calls or are simply MT-unsafe (sure?).
We therefore provide some recommended alternatives to malloc. We also provide a mechanism with which you can easily replace the underlying memory allocator without changing application sources. With this mechanism, you can switch between several memory allocators (including the system's malloc). In this way, you will know how important are they for performance of parallel programs and what I am saying is not just a threat :-)
There are two (orthogonal) alternatives to the default malloc. One is to replace malloc with SGC library, and the other is a region-based allocator built on top of an underlying malloc. They are orthogonal. You can use either of them or combine both. We describe them in the following sections and then describe a mechanism with which you can easily switch between different memory-management alternatives.
SGC is a conservative garbage collection (GC) library for shared-memory
platforms. Its interface is the same as malloc, but calls to malloc are
serialized only occasionally. In addition, it provides GC. A GC is also
performed in parallel. If your particular program suffers from GC, you
can control when to invoke a GC for better performance. If you are a
creature who irrationally thinks GC is a slow thing, you can turn off GC
and explicitly free objects (in this case you are effectively using SGC
as a concurrent malloc/free library). SGC is provided as a separate
library. You can obtain it from
http://www.yl.is.s.u-tokyo.ac.jp
. To use it from StackThreads/MP
programs, you must define WITH_STHREADS
in config.mk
when
you compile it, and you must define WITH_SGC
when you compile
StackThreads/MP library. See README.parallel
in SGC for
details. It is currently available on SPARC (Solaris), Pentium
(Solaris), and R10000 (IRIX). Port to Pentium on Linux will come
soon. SGC was derived from Bohem's conservative GC library (which is
mainly for sequential platforms) and can be used with or without
StackThreads/MP.
The basic API is almost identical to malloc/free:
Syntax:
#include <sgc.h> void * GC_MALLOC(sz); unsigned int sz; void GC_FREE(ptr) void * ptr;
Header file sgc.h
comes with SGC, so you have to give an
appropriate include search path to stgcc/stg++
(with
-I$SGC_DIR
, where $SGC_DIR
is the directory when SGC is
in).
GC_MALLOC(sz)
allocates sz bytes of
memory. GC_FREE(ptr)
, where ptr is a value previously
returned by a call to GC_MALLOC
, returns the allocated memory to
the allocator. Functions and variables control behavior of SGC:
Syntax:
void GC_gcollect(void); int GC_dont_gc = 0; int GC_free_space_divisor = 4;
GC_gcollect()
explicitly triggers a GC at a point you
desire. When GC_dont_gc
is zero (this is the default),
GC_MALLOC
automatically triggers a GC when necessary. In this
mode, you neither have to call GC_FREE
nor
GC_gcollect
. When GC_dont_gc
is non-zero, GC_MALLOC
does not trigger GC (it always expands heap when heap
overflows). You reclaim objects either by calling GC_gcollect
at
a point you desire or by calling GC_FREE
on individual objects.
GC_free_space_divisor
controls heap expansion policy. It only
makes sense when GC_dont_gc
is zero. The larger that value is,
the more frequently GCs occur and the less aggressively the system
expands heap. When heap overflows, SGC either triggers a GC or expands
heap. Roughly, when this value is D, it tries to keep one Dth
of heap available for allocation. More specifically, suppose a simple
application that repeats allocating objects forever and keeps M
bytes of them live all the time, the system tries to keep heap size
(roughly) H to M + M/(D-1), so that H/D bytes
are available for allocation. The long-term behavior in this case is
that every H/D bytes of allocations trigger a GC, which
reclaims H/D bytes from somewhere in heap. The default value
is 4, meaning that it tries to keep 25% of the heap free for allocation.
If you feel the system spends too much time on GC, set it to 3 or
2. Other values are hardly useful (note that D = 1
effectively
prohibits GC).
There are other functions in SGC. See sgc.h
and
README.parallel
in SGC.
Memory management based on explicit regions is available on top of any malloc/free implementation. In this memory management model, an object is allocated on a specified region. You can dynamically create a region and then allocate an arbitrary number of objects in that region. Unlike malloc and free, you do not free individual objects; instead, you reset a region. Resetting a region reclaims all objects allocated in the region. The model is borrowed from Aiken and Gay, Memory-Management with Explicit Regions, PLDI 98, but is different from theirs in that (1) we do not provide any safety guarantee, (2) we do not require a dedicated compiler that understands region, and (3) we provide a reset operation, which make blocks associated with a region immediately reusable for subsequent allocations from the region.
Syntax:
#include <ralloc.h> st_region_t st_make_region(void); void * st_region_malloc(sz, rg); void st_reset_region(rg); void st_abandon_region(rg); st_region_t rg; unsigned sz;
ralloc.h
is located at a place stgcc
automatically
searches. st_make_region()
creates a new
region. st_region_malloc(sz, rg)
allocates sz bytes
of memory from region rg. st_reset_region(rg)
resets
region rg. That is, it makes all blocks allocated in region rg
reusable for subsequent requests with st_region_malloc(...,
rg)
. st_abandon_region(rg)
is similar to
st_reset_region(rg)
, but it abandons all blocks associated
with rg; it gives them back to the underlying memory allocator.
A region abstracts objects' death point. The basic observation is that objects allocated during a course of computation often die together. Such objects can be allocated from a single region and then freed all at once. This saves time for memory management and avoids fragmentation without a general and sophisticated mechanism.
A region internally keeps as many free lists as the number of workers. Each worker has its own private free list. Each free list is a list of fixed-sized chunks. A single chunk is, by default, 7KB. Associated with a free list are the current chunk pointer (C), which points to the chunk from which allocations are served, the current chunk limit pointer (L), which points to the end of the current chunk, and the allocation pointer (A), which points to a point in the current chunk from which the next allocation will be served. An allocation simply checks if A + the requested size exceeds L, and if it does not, it returns A and advances it by the requested size. Otherwise, the allocation goes on to the next chunk, changing C, L, and A appropriately. When a worker runs out of its private free list, it calls the underlying memory allocator to obtain a new chunk. No attempts are made to obtain one from another worker's free list.
st_reset_region(rg)
simply resets its current chunk pointer
to the first chunk of the free list and sets other pointers accordingly.
It does not give chunks to the underlying
allocator. It is st_abandon_region(rg)
that does this.
When a region is created by st_make_region()
, free lists are
empty. Therefore, allocations from a region gradually grow its free
lists, occasionally calling the underlying allocator. As is mentioned in
the beginning of this section, it may still suffer performance of
parallel sections, if the underlying allocator serializes all allocation
requests. One idea is to use SGC as the underlying allocator. If you
want to (or must) deal with this problem with the default malloc, you
may pre-allocate some chunks upon region creation. Here are the full
APIs for creating regions. st_make_region()
is repeated for
completeness.
Syntax:
#include <ralloc.h> st_region_t st_make_region_aux(chunk_size, init_size, pre_alloc_option, alloc, free); st_region_t st_make_region_highly_concurrent_1(init_size, alloc, free); st_region_t st_make_region_highly_concurrent(size); st_region_t st_make_region();
With st_make_region_aux
, you can control all the parameters.
chunk_size is the size of a single chunk. It is the unit of
allocation via the underlying allocator. alloc and free
specify the underlying allocator; alloc is a function that behaves
similarly to the malloc
in libc, whereas free
is a function
that behaves similarly to free
in libc.
init_size specifies the initial size of the region in bytes. Each worker allocates init_size/N bytes from the underlying allocator when it makes the first allocation from the region, where N is the number of workers in the worker group the calling worker belongs to. The net effect is that, the region soon becomes approximately init_size bytes in total as long as all workers will soon attempt to allocate at least one byte from this region. If you can approximate the total size of allocations that are going to be made from the region, you can give this value to avoid occasionally calling underlying allocators in parallel sections.
pre_alloc_option specifies whether this first allocation is made at
the region's creation time. When pre_alloc_option ==
st_region_pre_allocated_option_none
, no pre-allocation is performed
upon creation. Each worker allocates the specified bytes upon first
allocation. When pre_alloc_option ==
st_region_pre_allocated_option_self
, the calling worker allocates
the specified bytes only for the calling worker. When
pre_alloc_option == st_region_pre_allocated_option_all
, the
calling worker allocates the specified bytes for all the workers.
A commonly used combination is provided by
st_make_region_highly_concurrent_1
. It intends to be useful for
regions from which many allocations are going to be made in a parallel
section. It allocates init_size
bytes upon creation and evenly
distributes them to all the workers.
st_make_region_highly_concurrent
is a simple wrapper to
st_make_region_highly_concurrent_1
. It gives the value of a macro
RG_ALLOC_FUNC
and RG_FREE_FUNC
to
st_make_region_highly_concurrent_1
as the underlying alloc and free
functions, respectively. Their default values are the system malloc and
free, respectively, but you can override them before including
ralloc.h
. For example,
#define RG_ALLOC my_malloc #define RG_FREE my_free #include <ralloc.h>
replaces malloc/free
with my_malloc/my_free
.
On top of two memory allocators described, StackThreads/MP provides a simple layer with which you can easily switch both the underlying allocator and whether the region-based allocator is used. The idea is that once you write a program in such a way that allocates and frees memory using the following primitives, you can experience several memory management mechanisms without changing your program.
Syntax:
#include <st_mm.h> void * MM_ALLOC(sz); void MM_FREE(ptr); void MM_ZERO_CLEAR(ptr, sz); void MM_GC_POINT(); void MM_SETUP(fd);
They are macros whose behavior change according to the following parameter.
#define UNDERLYING_MM ...
It chooses the basic memory allocator and can be selected from the following values:
UNDERLYING_MM_SGC_AUTO_GC UNDERLYING_MM_SGC_EXP_GC UNDERLYING_MM_SGC_EXP_FREE UNDERLYING_MM_MALLOC
UNDERLYING_MM_SGC_AUTO_GC
uses SGC and relies on its automatic
memory management. UNDERLYING_MM_SGC_EXP_GC
uses SGC, but
triggers GC only when explicitly requested by
MM_GC_POINT()
. UNDERLYING_MM_SGC_EXP_FREE
uses SGC, but
never frees objects by GC. Finally, UNDERLYING_MM_MALLOC
uses the
system's default malloc
.
MM_ALLOC
calls the appropriate allocator. MM_FREE(ptr)
frees object pointed to by ptr when
UNDERLYING_MM_SGC_EXP_FREE
or UNDERLYING_MM_MALLOC
is
given, and otherwise has no effects. MM_ZERO_CLEAR(ptr,
sz)
clears a block of sz bytes from ptr, only when
MM_ALLOC
does not zero-fill allocated memory. Whenever you need
to zero-clear allocated memory, use MM_ZERO_CLEAR
instead of
bzero
, to avoid duplication of work in case MM_ALLOC
already does it. MM_GC_POINT()
invokes a GC when
UNDERLYING_MM_SGC_EXP_GC
is given, and has no effects
otherwise. MM_SETUP(fd)
sets GC_dont_gc
parameter
appropriately when a UNDERLYING_MM_SGC_xxx
is given. It also sets
GC_free_space_divisor
to fd.
Interfaces to region-based memory manager is also wrapped.
Syntax:
#include <st_mm.h> st_region_t MM_MAKE_REGION(); st_region_t MM_MAKE_REGION_AUX(chunk_size, init_size, pre_alloc_option, alloc, free); st_region_t MM_MAKE_REGION_HIGHLY_CONCURRENT_1(init_size, alloc, free) void * MM_REGION_MALLOC(sz, rg); void MM_REGION_FREE(ptr); void MM_RESET_REGION(rg); void MM_ABANDON_REGION(rg);
They are macros whose behavior change according to the following parameter.
#define RALLOC ...
RALLOC
is either 0 or 1. When RALLOC
is 1, all
MM_REGION_xxx
, except for MM_REGION_FREE(ptr)
, are
equivalent to st_region_xxx. For example, MM_MAKE_REGION()
is
equivalent to st_make_region()
. MM_REGION_FREE(ptr)
has no effects. When RALLOC
is 0, all MM_REGION_xxx
things are redirected to the underlying allocator. For example,
MM_REGION_MALLOC(sz, rg)
ignores the region parameter and
is equivalent to MM_ALLOC(sz)
. MM_REGION_FREE(ptr)
behaves like MM_FREE(ptr)
.
alloca
is sometimes used to optimize memory allocation. It
behaves like malloc
, but automatically frees the allocated
storage when the procedure that called it returns. It is normally
implemented by allocating a block of memory on top of the stack,
effectively extending the currently running frame.
I have to tell you that alloca
is currently not supported with
StackThreads/MP, as the standard implementation of alloca
fundamentally conflicts with the execution scheme of
StackThreads/MP. Please use above alternatives (malloc, SGC, or
region-based allocator). Region-based allocator will be particularly
useful when a single procedure calls alloca
many times, as blocks
allocated in a single procedure die together.
stgcc
currently redefines alloca
as the name of an
undefined function, so that programs that use alloca
fail to
link. If you want to optimize allocation on thin ice, you can call
the builtin alloca function (__builtin_alloca
),
as long as you know it is safe. Calling __builtin_alloca
from a procedure is safe, as long as the procedure does not call any
procedure compiled by stgcc
. In other words, if you place all
calls to __builtin_alloca
to the procedure entry, you are safe.
We are planning to support alloca
in future, although it will not
allocate memory from stack but from heap. Like standard alloca
,
it automatically frees memory allocated by a procedure after the
procedure exits. However, it will not be as efficient as
__builtin_alloca
; its overhead is likely to be similar to that of
the region-based allocator.
StackThreads/MP has a simple performance profiler with which you can
view the status of each processor after running a program. It is not a
real time profiler and the interface is not sophisticated, but it is
still very useful to identify performance bottleneck. To use this, you
run your program with -tp
option, convert the log into xgraph format,
and then look at it using xgraph. You need xgraph to use this feature.
The most convenient way is to run your program with -tp
option. For example,
harp:367% ./fib -nw 10 -tp pfib: 1091 ms on 10 processors, sfib: 2062 ms
It produces a lot of files whose names are 00stprof.xx.yy
(where
xx and yy is a number) in the current directory. Using 10 processors,
you typically have 10 files.
You may want to profile a particular section of your program. In such
cases, add a call to st_begin_profile()
where you want to begin
profiling and st_end_profile()
where you want to finish
profiling. Currently, you can call them only once in a program run.
The resolution of profiling is, by default, 100 microseconds. Each
processor measures how much time it spends in each state (busy, idle,
etc.) and, at every 100 microseconds, calculates the dominating state of
that period. The log file records a state of each period. You can change
the resolution of profiling by command line option
--time_profile_resolution S
, where S specifies the length
of a period in microseconds. Specifying a large number saves space but
the result may be inaccurate. Specifying a smaller number makes result
more reliable at the expense of space.
Each processor keeps a fixed sized buffer for accumulating profiles and
saves it into a file when the buffer overflows (and when the profile is
finished). It may introduce a large Heisenberg effect into your
profiling. You can increase the size of the in-memory buffer by
--time_profile_buffer_size N
where N specifies the number
of entry in an in-memory buffer. The default is 8100. N does not
necessarily represent the number of periods you can tolerate without
saving the in-memory profile into a secondary storage, because a single
entry describes a number of consecutive periods in a single state. For
example, a processor is busy most of the time, the necessary storage
will be quite small.
To convert generated log files into xgraph format, use mkxgrp
. It takes
two parameters. One is the prefix of filename (00stprof
if you
did not give one explicitly when you ran the application) and the other
is the number of processors and writes to standard out. For example,
harp:367% mkxgrp 00stprof 10 > data.xg
produces an xgraph file called data.xg from files named
00stprof.xx.yy
in the current directory. Files 00stprof.xx.yy
are removed when data.xg
the output is successfully generated.
This chapter describes issues that may arise when StackThreads/MP
modules cooperate with sequential modules. In this chapter, a procedure
is said to be a StackThreads/MP procedure when it is compiled via
stgcc
(or stg++
). A procedure is otherwise said to be a
sequential procedure. Whether a procedure actually does something
in parallel is irrelevant. There is a set of rules for calling
sequential procedure from within a StackThreads/MP procedure and another
set of rules for calling in the other direction.
Remember that stgcc
can almost always substitute gcc
,
therefore you can make any C procedure a StackThreads/MP procedure by
recompiling it by stgcc
. So you can normally avoid complication
by recompiling everything with stgcc
. You may still want to have
non-StackThreads/MP module either because the source of the sequential
module is not available or because recompiling for StackThreads/MP is
undesirable.
There are almost no issues in this direction. A StackThreads/MP procedure can call any sequential procedure, as long as it does not call back any StackThreads/MP procedure. Issues that arise due to callbacks are discussed in the next section.
More precisely, StackThreads/MP procedures can safely call a sequential procedure provided (1) it does not perform a call back to StackThreads/MP, and (2) it conforms to the calling standard on the platform. Most sequential compilers generate procedures that conform to the calling standard, therefore, as an approximation, you can call virtually any sequential procedure, whether it is written in C, Pascal, Eiffel or whatever.
By "you can call a sequential procedure", we simply mean that you can correctly pass arguments and get the result of the procedure call. Specifically, it is completely the responsibility of the programmer that guarantees its proper functioning in the pre-sense of multiple threads of control. If the sequential procedure is already MT-safe (i.e., works correctly in the presence of multiple threads of control), you do not have to do anything special. Otherwise, you may have to make it MT-safe or guard the procedure call by some sort of mutual exclusion.
First of all, you cannot call any StackThreads/MP procedure from a sequential procedure by default. A special setup procedure is mandatory. This is primarily because StackThreads/MP procedures reserve a register and assume that the register always holds the pointer to the worker-local storage, and a sequential procedure does not know the fact. Even if a StackThreads/MP procedure does not call any StackThreads/MP primitive that explicitly accesses a worker-local storage, it does so in its epilogue code. Therefore, as an approximation, remember that you cannot call any StackThreads/MP procedure unless you follow the necessary steps described below.
We provide two ways for different situations. One is suitable when you want to extract parallelism in the callback. The other is suitable when the amount of work done in the callback is very small.
StackThreads/MP provides a convenient way to create a worker group that
executes a given StackThreads/MP procedure
(stf_create_sync_worker_group
and
stf_create_async_worker_group
). They serve as gates from a sequential
procedure to any StackThreads/MP procedure. They create a specified
number of workers, setup pointers to the worker local storages, and then
call the specified entry procedure. The entry procedure is a
StackThreads/MP procedure and thus is allowed to access any StackThreads/MP
primitives.
void* stf_create_sync_worker_group(wgc, n, f, a0, a1, a2, a3) worker_group_conf_t wgc; int n; void * (*f)(void *, void *, void *, void*); void * a0, * a1, * a2, * a3;
creates n workers and creates a group of n
workers. The job of the worker group is given by an entry procedure
f and arguments to it (a0 ... a3). Specifically, the worker
group's job is to complete a procedure call f(a0, a1,
a2, a3)
. stf_create_sync_worker_group
returns when
f(a0, a1, a2, a3)
returns. The return value
of f(a0, a1, a2, a3)
becomes the return value
of the stf_create_sync_worker_group
. If non-zero value is given
to wgc, it is a pointer to a structure that describes configuration
options of the worker. If wgc is zero, a default configuration is
used. stf_create_sync_worker_group
can be called both from a
sequential procedure and from a StackThreads/MP procedure. If you call
this procedure from a sequential module (a module that is compiled by
plain gcc
), include <st_foreign.h>
, which defines the
correct prototype for the procedure.
For example, suppose you are writing an X Window application and wish to add a callback function that uses StackThreads/MP primitives. Also assume that you don't want to recompile the Xlib or your administrator didn't allow you to replace nor duplicate it, thus the Xlib itself must remain sequential. In this case, you first write a StackThreads/MP procedure that accomplishes your job.
... my_st_proc(...) { /* do whatever you can do with StackThreads/MP */ ST_THREAD_CREATE(...); ... ST_THREAD_CREATE(...); ... }
You may want to write a wrapper procedure of the above function, so that
the entry procedure conforms to the interface of
stf_create_sync_worker_group
. The wrapper's job is simply to pass
parameters to the above procedure, possibly casting them to meaningful
types. It must be a StackThreads/MP procedure.
void * my_st_proc_wrapper(void * a0, void * a1, void * a2, void * a3) { /* you translate arguments somehow */ ... /* and call my_st_proc to do a real job */ my_st_proc(...); }
Finally, you write a callback procedure that invokes the entry procedure and registers it. This must be a sequential procedure (compiled by a plain C compiler), because X library calls that procedure.
void my_callback(...) { int n = ...; void * a0 = ..., * a1 = ..., * a2 = ...; * a3 = ...; stf_create_sync_worker(0, n, my_st_proc_entry, a0, a1, a2, a3); } ... XAddCallback(my_callback, ...); ...
stf_create_async_worker
is an asynchronous version of
stf_create_sync_worker
. See section Workers and Worker Groups for details.
The method just described is suitable when the following conditions are satisfied:
stf_create_sync_worker_group
or
stf_create_async_worker_group
is called. This will be the case when
the job you wish to accomplish is fairly large (taking a second or more
by a single CPU).
There are situations where the method is not very appropriate. For
example, you may want to write a tiny procedure for comparing two items
and pass it to qsort
function or write a tiny callback procedure
that just displays a menu. Such procedures have nothing to do with
parallelism, hence creating a worker group for each callback is
impractical.
StackThreads/MP provides a way to declare that a StackThreads/MP procedure may be called from a sequential procedure. Specifically, A StackThreads/MP procedure can be safely called from a sequential procedure when the following conditions are met:
ST_CALLBACK_BEGIN();at the head of the procedure. This is a macro expanded into a variable declaration, so make sure you put it in a place where a variable declaration is allowed.
ST_CALLBACK_END();immediately before returning from the procedure (more precisely, before returning from the procedure and after any procedure call performed within the procedure).
ST_CREATE_THREAD
,
ST_POLLING
, and any blocking synchronization primitives such as
st_mutex_lock
and st_mutex_unlock
).
Procedures written according to these rules may be called both from sequential procedures and from StackThreads/MP procedures. Here is an example:
int less_than_p(void * a, void * b) { ST_CALLBACK_BEGIN(); int xa = evaluate(a); int xb = evaluate(b); ST_CALLBACK_END(); return xa - xb; }
Here, evaluate
may be a sequential procedure, a regular
StackThreads/MP procedure, or another StackThreads/MP procedure that
again surrounds its body by ST_CALLBACK_BEGIN()
and
ST_CALLBACK_END()
.
With this setup, you can pass the function less_than_p
to the
qsort
:
qsort(a, n, sizeof(element), less_than_p);
In the above three rules, the last one is very restrictive. In essence,
it says that even though such a callback procedure is compiled as a
StackThreads/MP procedure, it must stay sequential, in the sense it does
not use any threading operation. In our example, evaluate
may not
use StackThreads/MP primitives such as ST_POLLING
,
ST_THREAD_CREATE
, and other synchronization primitives.
You might ask, "Given a callback cannot call any StackThreads/MP
primitives after all, why do I want to write it as a StackThreads/MP
procedure?" As a matter of fact, you can always write all such
procedures in a separate file and compile them using plain
gcc
. What the current StackThreads/MP currently supports is no
more powerful than this solution. We provide these declarations to allow
callbacks and proper StackThreads/MP procedures to be written in a
single file.
stgcc
or stg++
are shell scripts that call gcc
or
g++
, respectively. Below we say gcc
to refer to both
gcc
and g++
, and use stgcc
to refer to both
stgcc
and stg++
.
What it basically does is to intercept the compilation process that outputs object file just after gcc emits assembly, extract some information about stack frame formats of each procedure from, and put some tricky pieces of code at each procedure epilogue sequence.
stgcc
is intended to be usable wherever gcc
is used,
unless a command line option fundamentally conflicts with what
StackThreads/MP assumes. Most command line options are passed as is to
gcc
, but it must recognize some of the parameters that controls
compilation. Certainly you do not have to know what it is exactly doing
and you should not even have to learn anything, as long as you already
know gcc
.
What you have to know, however, is things that might conflict with your
tricky way of using gcc
. At present, stgcc
does not
produce any error even when a conflicting option is given. So, it is
possible that even when a compilation succeeds, the resulting program is
erroneous. We are trying to improve this in future. To be sure, give
--show_commands
option to stgcc
, which displays invoked
subcommands.
-ffixed-REG
in gcc
). The reserved register is:
stgcc
so you do not normally
notice this. But if you want to reserve this register or another for
your purpose, this is likely to be a problem. Contact me if you want to
do this anyway.
stgcc
turns off inlining by giving
-fno-inline-functions
and -fno-default-inline
to
gcc
. We assume that a procedure call written in
ST_CREATE_THREAD(e)
is translated into a procedure
call in assembly level. In other words, e must not be inlined. This
may severely hurt performance of some programs, but for now there are no
ways to prohibit inlining of a particular call site or a particular
function.
You can instruct the compiler to explicitly inline some functions by
giving inline
directive before a procedure definition.
stgcc
gives -fno-omit-frame-pointer
to gcc
,
so that each procedure keeps a separate frame pointer. Therefore giving
-fomit-frame-pointer
to stgcc
conflicts with this. This
has no performance effects on i386 and SPARC, but may have some on Mips
and Alpha.
stgcc
turns off the use of register windows by
giving -mflat
to gcc
. Hence giving -mno-flat
to
stgcc
conflicts with this.
To summarize, the following options are prohibited in StackThreads/MP.
-finline-functions, -fdefault-inline, -fomit-frame-pointer
,
and -ffixed-REG
are prohibited on all CPUs
-mno-flat
is prohibited on SPARC
Here is the list of options supported by stgcc
, in addition to
all options supported by gcc
.
--show_commands
shows all subcommands invoked by
stgcc
. This should be of no interest for you, but when you get
strange compile-time errors, you might want to look at what stgcc
is doing.
--leave_tmps
does not remove temporary files created during
compilation. This should be of no interest for you, but when you get
strange compile-time errors, you might want to look at temporary files.
--dbg
is equivalent to giving both --show_commands
--leave_tmps
.
--no_postprocess
does not postprocess the assembly
file generated by gcc
. Modules compiled with this option cannot
use StackThreads/MP primitives. This option is primarily useful when you
(1) when you want squeeze performance of some modules that never use
StackThreads/MP primitives, or (2) investigate performance impact of
postprocessing. Normally, compiling everything without this option will
suffice.
--compiler=cc
specified the underlying C compiler
invoked from stgcc
. We currently assume the underlying C compiler
is stgcc
in many ways, so the presence of this option does not
imply you can use any C compiler. Do not give anything except for
gcc
or g++
.
Normally, stgcc
defines STHREADS
to 1. When
--no_postprocess
is given, stgcc
does not define
STHREADS
but instead defines STHREADS_NOPP
to 1.
stlink
is a command that is invoked from stgcc
. It is
again a shell script that eventually calls a linker (it actually calls
gcc
or g++
). When an assembly file is postprocessed,
stgcc
generates pieces of information about procedures in the
file and attaches them in the assembly file. The assembly file is then
compiled into an object file. At link time, stlink
collects all
these tables and generate a single table that point to tables of all
files. The table is generated in a form of a C source file, compiled by
gcc
(by default), and then linked together with all other files.
It also collects thread startup hooks and worker startup hooks and
generate a table of them.
Since this command is implicitly invoked by stgcc
, it should be
usually of no interest for the programmer. We nevertheless describe it
for its potential use in debugging strange link-time errors.
Accepted command line options are again the same as those of gcc
or g++
(specified by --linker=
option described
below). Most options are passed as is to gcc
or g++
. Here
is the list of options specific to stlink
.
--dbg
shows the command lines of the invoked
subcommands. If you get a strange link time error, run the same command
with this option. You normally implicitly use stlink
from
stgcc
, you may not know how stlink
is invoked. In this
case, first run stgcc
with --dbg
option and see its
output. You will find a line that invokes stlink
. Copy the line
and add --dbg
option.
--linker=cc
specifies the underlying linker (which
actually must be either gcc
or g++
) invoked by
stlink
. The default is gcc
.
--compiler=cc
specifies the compiler used to compile
the generated table file. The default is gcc
and I don't think of
any reason why you have to change this (even if you are linking C++
programs).
We describe a little more details about its semantics and machinery. It collects all procedure tables and thread/worker startup hooks from files specified via explicit filenames (such as xxx.o or xxx.a) or libraries found under explicit search pathos (via -Lxxx). For example,
stlink a.o b.a c.o -L. -L../lib -lmylib
will collect tables and hooks from a.o
, b.a
, and
c.o
which are assumed to be found in the current directory and
may collect them also from libmylib.a
, if it is found in the
current directory (.
) or ../lib
directory. Tables and
hooks in libmylib.a
will not be collected when it is found in
other directories that do not explicitly appear in the command
line. Specifically, standard libraries such as libc
or
pthread
are not examined. In short, stlink
does not know
standard library search paths.
stlink
uses nm
command to collect tables and hooks from
object files. For each file to be examined, it invokes nm
and
scans the output to find the table and hooks. Simple naming rules are
assumed to find them. A surprising thing might occur when you accidently
use names that are recognized as hooks or tables. To list them,
_proc_info
is assumed to be the
name of the procedure table of the file.
sthreads_thread_start_hook_
is
assumed to be a name of a thread startup hook.
sthreads_worker_start_hook_
is
assumed to be a name of a worker startup hook.
Any StackThreads/MP program accepts the following command line
arguments. A shorter synonym is provided for commonly used ones. These
command line arguments are effective only when you do not write your own
main function but write st_main
.
--n_workers P
, or -nw P
: specifies
the number of processors by P.
--stack_size S
, or -ss S
: specifies
the stack size of an OS-thread by S. S is either an integer, or
an integer followed by a k
or m
, which says the number
should be interpreted as kilobytes and megabytes, respectively. For
example, -ss 4m
sets a stack size to 4 megabytes.
--print_toplevel_worker_stat,
or -ps
: prints
a simple statistics after a run. It prints statistics like how many
threads were created and how many blocking occurred.
--time_profile
, or -tp
: profiles processor
status and generates log files. The file names begin with 00stprof by
default. This can be changed by --time_profile_filename below.
--time_profile_filename N
: specifies the prefix of log
files by N. For example, --time_profile_filename
my_important_logs
generate files of names
my_important_logs.xx.yy
where xx and yy are
numbers.
--time_profile_resolution S
: specifies the
resolution of the profile. S is the length of a period in
microseconds. Each period is represented by a state in which a processor
spends its longest time during that period.
--time_profile_buffer_size N
: specifies the
size of in-memory buffer size of each processor. N is the number of
entries in the log file. Longer the N is, each processor can
tolerate a long profiling run without saving the log into a secondary
storage.
These command line arguments are removed from the command line before
entering st_main
and only remaining arguments are passed to
st_main
. For example, when you write:
./fib -nw 10 30 -ps
then the st_main
of this program will receive only
{ "./fib" "30" }
as its second parameter (argv
) and receive 2 as its first
parameter argc
.
Although StackThreads/MP can be practically used with unmodified GCC, there are certain cases where execution scheme of StackThreads/MP breaks the assumption that code generator of GCC exploits for optimization (See section Known Limitations and Subtleties for descrpitions about such cases). We provide simple extensions to gcc that fix the problem in a form of patches to gcc (2.7.2.3 and 2.8.1). Note that this is not a bug of gcc, but just a subtle mismatch between StackThreads/MP execution scheme and the sequential exeuction scheme that is assumed by the compiler. Patches are quite small (essential modifications are just a few lines) and the patched gcc supports a couple of additional command-line options. Without these command-line options, the patched gcc functions exactly the same as the original gcc. Even when these command-line options are given, generated code is different only in cases where the original generated code is unsafe. Of course, generated code is always compatible (inter-operable) with code compiled normally (code compiled with such an option can call functions compiled normally, and vice versa). Therefore, the patched gcc can safely replace the existing gcc, if you do not like having two copies of gcc. You need not recompile any sequential library compiled by the original gcc (such as X library and Tk library).
Modifications to source code are always guarded by:
#if !defined(STHREADS_SAFE) || STHREADS_SAFE modified source #else /* !defined(STHREADS_SAFE) || STHREADS_SAFE */ original source #endif /* !defined(STHREADS_SAFE) || STHREADS_SAFE */
so that you can revert to the original code by defining flag
STHREADS_SAFE
to zero. Note that if STHREADS_SAFE
is
undefined (this is default), the modified source is used.
The patch is not a requirement for StackThreads/MP installation. In practice, you can start using the library with unmodified gcc and later apply the patch, when you like to use the library for serious purposes. If you do not want to patch gcc, you can workaround them by hand. Problematic cases are described below and you can avoid them (in particular, if StackThreads/MP is used as a compilation target, the frontend compiler are likely to be able to deal with them). If you want to worry about them and guarrantee safety, however, we recommend you to apply the patch to gcc someday. These patches are important only on SPARC and i386.
Patches are placed in directory gccpatch
in a form of diffs to
sources of gcc 2.7.2.3
and gcc 2.8.1
. The name of the
files are gcc-patch.2.7.2.3
and gcc-patch.2.8.1
,
respectively. If you are going to use gcc 2.7.2.3
or gcc
2.8.1
, just copy the appropriate patch file to the source directory
(the directory where files like calls.c
and flags.h
exist)
and do:
patch -p 0 < patch-file
where patch-file is either
gcc-patch.2.7.2.3
or gcc-patch.2.8.1
.
If the version number of your gcc source is neither 2.7.2.3 nor 2.8.1, you may still have a chance to apply the patches mechanically, as long as the major version number matches. Just try the same command and look at messages from patch command. Even if you fail, since patches are very small, it is practical to apply them by hand.
Once you successfully patch gcc sources and rebuild gcc, you setup
things so that stgcc
exploits it.
ST_GCC_IS_PATCHED
. The value does
not matter. When stgcc
finds that this variable is defined, it
gives necessary options when it calls gcc
and g++
(we
hereafter only mention gcc, but the following statements also apply to
g++).
path
variable so that you run the patched gcc by just
typing gcc
(g++
), you are done.
Recall that the patched gcc produces the exactly same output as the
original does, unless explicit options are given, so replacing the
original gcc is almost always practical. You may still want to call
patched gcc
only from stgcc
and use the original gcc
for other cases. If you are such a skeptical person, put the patched
gcc to anywhere and tell stgcc
the location of them by setting
environment variable ST_COMPILER_PATH
. For example, you may write
the following line in your .cshrc
.
setenv ST_COMPILER_PATH /home/nomo/my_gcc/binThen,
stgcc
calls /home/nomo/my_gcc/bin/gcc
and
/home/nomo/my_gcc/bin/g++
instead of gcc
and g++
.
ex/nestcall
. On SPARC, compile and run
ex/strpass
. Supply -O4
option to enable optimization. When
you run it with no arguments, if it says OK
, the patched gcc is
working. Before gcc is not patched, these programs do not say OK
and will produce the following outputs:
strpass
on SPARC:
harp:1014% ./strpass 1 : st_main forks foo 2 : foo blocks 3 : foo blocked, restart foo 4 : foo restarted and finishes ************************************************************* Structure argument to ST_THREAD_CREATE(foo(p)) is corrupted. This is a known problem on SPARC. You may either patch gcc to fix this problem (see manual), or just avoid passing struct args to procedure calls with ST_THREAD_CRREATE *************************************************************
nestcall
on i386:
Here we describe what these patches do in detail. The purpose is to give information to an expert. The following description makes sense only when you understand what is the problem with the original gcc. See section Floating SP Problem and section Structure passing on SPARC for the description of the problems with the original gcc.
calls.c
in function expand_call
:
There is a loop that determines, for each argument to a procedure call
expression, whether the evaluated value of the argument should be placed
on a temporary location or can be directly placed on the stack. The loop
in gcc 2.7.2.3 looks like this (including preceeding comments).
1: /* If we preallocated the stack space, and some ... 2: 3: ... 4: 5: ... into the stack. */ 6: 7: for (i = 0; i < num_actuals; i++) 8: if (is_const 9: || ((args_size.var != 0 || args_size.constant != 0) 10: && calls_function (args[i].tree_value, 1)) 11: || (must_preallocate && (args_size.var != 0 || args_size.constant != 0) 12: && calls_function (args[i].tree_value, 0))) 13: { 14: /* If this is an addressable type, we cannot pre-evaluate it. */ 15: if (TREE_ADDRESSABLE (TREE_TYPE (args[i].tree_value))) 16: abort (); 17: ...This loop judges whether each argument in turn includes a nested procedure call or a call to
alloca
function.
calls_function(a, 1)
returns 1 if expression a includes a
call to alloca
, and calls_function(a, 0)
returns 1 if
expression a includes any procedure call. The essential modification
is to change the second parameter to calls_function
at line 10 to
zero, which was originally one. This essentially treats any function
call as if it were a call to alloca
.
The patch passes zero to calls_function
at line 10 only when a
flag (flag_calls_clobber_sp
) is set. It also guards the
modification by a #if
directive in the way described earlier in
this chapter. The resulting code looks like this:
/* If we preallocated the stack space, and some arguments must be passed on the stack, then we must precompute any parameter which contains a function call which will store arguments on the stack. Otherwise, evaluating the parameter may clobber previous parameters which have already been stored into the stack. */ for (i = 0; i < num_actuals; i++) if (is_const #if !defined(STHREADS_SAFE) || STHREADS_SAFE || ((args_size.var != 0 || args_size.constant != 0) && calls_function (args[i].tree_value, (flag_calls_clobber_sp ? 0 : 1))) #else /* !defined(STHREADS_SAFE) || STHREADS_SAFE */ || ((args_size.var != 0 || args_size.constant != 0) && calls_function (args[i].tree_value, 1)) #endif /* !defined(STHREADS_SAFE) || STHREADS_SAFE */ || (must_preallocate && (args_size.var != 0 || args_size.constant != 0) && calls_function (args[i].tree_value, 0))) { /* If this is an addressable type, we cannot pre-evaluate it. */ if (TREE_ADDRESSABLE (TREE_TYPE (args[i].tree_value))) abort ();What this patch essentially does is to tell gcc that any function call may alter the value of stack pointer in an unexpected way, just as
alloca
does. On Pentium, where stack pointer is allowed to move
to push and pop arguments to procedure calls, the compiler can push an
argument on stack top as soon as it is evaluated, no matter whether
other arguments have a nested procedure call. A nested procedure call
can push its arguments on top of the arguments under construction for
the outer procedure call. When the nested procedure call returns, its
arguments can be freed and the rest of the arguments to the outer
procedure call can then be pushed on the stack. There is one exception
to this rule; when a nested procedure call is a call to alloca
,
the compiler cannot infer the value of stack pointer after the call,
hence cannot place arguments evaluated before alloca
on top of
the stack at that point. It instead must perform calls to alloca
first and evaluate the rest of the arguments. The results from
alloca
s must be saved in temporary slots, which are addressed via
frame pointer and then moved to stack top just before calling the outer
function.
By treating any function call as if it were a call to alloca
, we
effectively force gcc to pre-evaluate all nested procedure calls.
flags.h
: Just add a declaration of
flag flag_calls_clobber_sp
, which is set when a command line
option -fcalls-clobber-sp
is given. Add the following piece of
code at the end of the series of declarations of flags.
/* Non-zero any function call is supposed to clobber SP */ #if !defined(STHREADS_SAFE) || STHREADS_SAFE extern int flag_calls_clobber_sp; #endif /* !defined(STHREADS_SAFE) || STHREADS_SAFE */
toplev.c
: Just add a definition of flag
flag_calls_clobber_sp
and update the table of flags
(f_options
). Add the following piece of code at the end of the
series of definitions of flags.
/* Non-zero any function call is supposed to clobber SP */ #if !defined(STHREADS_SAFE) || STHREADS_SAFE int flag_calls_clobber_sp = 0; #endif /* !defined(STHREADS_SAFE) || STHREADS_SAFE */Also add the following at the end of the table
f_options
:
#if !defined(STHREADS_SAFE) || STHREADS_SAFE , {"calls-clobber-sp", &flag_calls_clobber_sp, 1} #endif /* !defined(STHREADS_SAFE) || STHREADS_SAFE */
function.c
in function assign_parms
:
There is a code fragment that copies incoming parameters that are passed
by an invisible reference, if necessary. Search
FUNCTION_ARG_CALLEE_COPIES
. You will find a fragment that looks
like:
#if !defined(STHREADS_SAFE) || STHREADS_SAFE #ifdef FUNCTION_ARG_MAY_BE_DESTROYED /* If we are passed an arg by reference and it may be destroyed for any unknown reason (even if the callee does not modify it by itself). PASSED_TYPE and PASSED mode now refer to the pointer, not the original argument, so we must recreate them in the call to FUNCTION_ARG_CALLEE_COPIES. */ else if (passed_pointer && FUNCTION_ARG_MAY_BE_DESTROYED (args_so_far, TYPE_MODE (DECL_ARG_TYPE (parm)), DECL_ARG_TYPE (parm), ! last_named) && ! TREE_ADDRESSABLE (DECL_ARG_TYPE (parm))) { rtx copy; tree type = DECL_ARG_TYPE (parm); /* This sequence may involve a library call perhaps clobbering registers that haven't been copied to pseudos yet. */ ...This is an
else if
branch of a larger if
statement. Make a
copy of the original else if
branch, guard it by an #if
directive, and insert it before the original one. Then, replace
FUNCTION_ARG_CALLEE_COPIES
with
FUNCTION_ARG_MAY_BE_DESTROYED
in the new branch. That is, the
entire if
statement now looks like:
/* If we were passed a pointer but the actual value can safely live in a register, put it in one. */ if (passed_pointer && TYPE_MODE (TREE_TYPE (parm)) != BLKmode ... ) { ... } #if !defined(STHREADS_SAFE) || STHREADS_SAFE #ifdef FUNCTION_ARG_MAY_BE_DESTROYED /* ... */ else if (passed_pointer && FUNCTION_ARG_MAY_BE_DESTROYED (args_so_far, ...) && ! TREE_ADDRESSABLE (DECL_ARG_TYPE (parm))) { rtx copy; tree type = DECL_ARG_TYPE (parm); ... } #endif /* FUNCTION_ARG_MAY_BE_DESTROYED */ #endif /* !defined(STHREADS_SAFE) || STHREADS_SAFE */ #ifdef FUNCTION_ARG_CALLEE_COPIES /* ... */ else if (passed_pointer && FUNCTION_ARG_CALLEE_COPIES (args_so_far, ...) && ! TREE_ADDRESSABLE (DECL_ARG_TYPE (parm))) { rtx copy; tree type = DECL_ARG_TYPE (parm); ... } #endif /* FUNCTION_ARG_CALLEE_COPIES */The inserted branch handles the case in which an incoming argument passed by reference may be destroyed for any unknown reason. The callee copies such arguments to its own place. GCC happens to do the same thing for handling cases in which an argument is passed by reference and it is the callee's responsibility to copy it before modifying the location. Although duplicating code in this way makes maintenance troublesome, and there is a way to handle both cases in a single
else if
branch,
we currently treat them in two separate else if
branches. Reasons
are as follows:
FUNCTION_ARG_CALLEE_COPIES(...)
is non-zero) may be improved
in future; if gcc recognizes cases where the argument is not modified by
the callee, the callee does not have to copy it. On the other hand, the
inserted branch must copy it even when the callee itself does not modify
it.
else if
branch
by #ifdef FUNCTION_ARG_CALLEE_COPIES
. This makes it difficult to
add anything that must be effective even if
FUNCTION_ARG_CALLEE_COPIES
is not defined, without modifying the
line #ifdef FUNCTION_ARG_CALLEE_COPIES
.
config/sparc/sparc.h
:
MASK_CALLEE_COPIES_STRUCT_ARGS
and
TARGET_CALLEE_COPIES_STRUCT_ARGS
.
To determine the correct value for
MASK_CALLEE_COPIES_STRUCT_ARGS
, look through sparc.h
and
search all definitions that look like #define MASK_xxx
. Any mask
is a power of two and is a mask to check if a particular SPARC-specific
flag is defined. Add a new mask to this series of masks. For example, if
you find that the last mask defined in the file is 0x400000
, you
can use 0x800000
for MASK_CALLEE_COPIES_STRUCT_ARGS
,
leading to the following definition:
#define MASK_CALLEE_COPIES_STRUCT_ARGS 0x800000Always define
TARGET_CALLEE_COPIES_STRUCT_ARGS
like this:
#define TARGET_CALLEE_COPIES_STRUCT_ARGS \ (target_flags & MASK_CALLEE_COPIES_STRUCT_ARGS)To summarize, you add the following lines:
#if !defined(STHREADS_SAFE) || STHREADS_SAFE #define MASK_CALLEE_COPIES_STRUCT_ARGS 0x800000 #define TARGET_CALLEE_COPIES_STRUCT_ARGS \ (target_flags & MASK_CALLEE_COPIES_STRUCT_ARGS) #endif /* !defined(STHREADS_SAFE) || STHREADS_SAFE */
-mcallee-copies-struct-args
to TARGET_SWITCHES
macro. Search #define TARGET_SWITCHES
and add a line {"callee-copies-struct-args",
MASK_CALLEE_COPIES_STRUCT_ARGS} \
to an appropriate place. Do not
forget the trailing backslash, which is necessary because it is in a
macro definition. To guard the modification by an #if
directive,
we have to copy the entire definition, leading to the following code:
#if !defined(STHREADS_SAFE) || STHREADS_SAFE #define TARGET_SWITCHES \ { {"fpu", MASK_FPU | MASK_FPU_SET}, \ {"no-fpu", -MASK_FPU}, \ ... {"64", MASK_64BIT}, \ {"stack-bias", MASK_STACK_BIAS}, \ {"no-stack-bias", -MASK_STACK_BIAS}, \ {"callee-copies-struct-args", MASK_CALLEE_COPIES_STRUCT_ARGS}, \ SUBTARGET_SWITCHES \ { "", TARGET_DEFAULT}} #else /* !defined(STHREADS_SAFE) || STHREADS_SAFE */ #define TARGET_SWITCHES \ {{"fpu", MASK_FPU | MASK_FPU_SET}, \ {"no-fpu", -MASK_FPU}, \ ... {"64", MASK_64BIT}, \ {"stack-bias", MASK_STACK_BIAS}, \ {"no-stack-bias", -MASK_STACK_BIAS}, \ SUBTARGET_SWITCHES \ { "", TARGET_DEFAULT}} #endif /* !defined(STHREADS_SAFE) || STHREADS_SAFE */
FUNCTION_ARG_MAY_BE_DESTROYED(CUM, MODE, TYPE, NAMED)
as follows:
/* non-zero if an argument passed by reference may be destroyed for any unknown reason, even if the callee does not write to it by itself. */ #if !defined(STHREADS_SAFE) || STHREADS_SAFE #define FUNCTION_ARG_MAY_BE_DESTROYED(CUM, MODE, TYPE, NAMED) \ TARGET_CALLEE_COPIES_STRUCT_ARGS #endif /* !defined(STHREADS_SAFE) || STHREADS_SAFE */This macro takes parameters just for possible future extensions and for consistency with existing macro
FUNCTION_ARG_CALLEE_COPIES
.
These patches essentially force a callee to copy incoming structure
parameters to its private place. In the original SPARC v8 calling
convention, when a procedure passes a structure to another procedure by
value, it copies the structure to a place in the caller's frame and
passes the pointer to it to the callee. The callee can assume that the
structure persists throughout the lifetime of the procedure
call. Setting FUNCTION_ARG_MAY_BE_DESTROYED(CUM, MODE, TYPE,
NAMED)
to non-zero tells gcc that this no longer holds, therefore the
callee must copy its incoming structure parameters to its own place, no
matter whether it modifies these arguments or not.
This chapter describes what happens on ST_THREAD_CREATE
and how
threads are distributed across processors. We then make a few remarks on
programming practices that should be adopted and avoided, and on where
should you insert ST_POLLING
.
Suppose procedure f calls ST_THREAD_CREATE(g())
, f
performs a normal C procedure call g()
. What is important here
is that it performs the procedure call just as if it were a sequential
call. It also records the fact that the procedure call is a fork
point. This way, the runtime system knows that the rest of f can
progress even when g is blocked, or when an idle worker shows
up. The exact method by which we record a fork point is unimportant; we
note that it only takes a couple of instructions.
If g()
terminates without blocking, it returns to f just
as a normal procedure returns. What happens when g blocks? A thread
blocks , for example, g calls st_mutex_lock(j)
, but m
is currently locked by another thread. In this case, we must be able to
suspend execution of g for now and later resume it when m is
unlocked. We achieve this by splitting the stack at the fork point where
g was spawned; the activation frame of g is logically decoupled
from the stack and the rest of f continues (when f is still not
migrated to another worker). g's activation frame physically remains
in the original stack, occupying the top portion of the stack. Stack
pointer is kept above g, while frame pointer now points to f (we
use a terminology as if a stack grows upward. A frame x is above
y if x is nearer to the growing end of the stack than
y). Subsequent procedure calls by f allocate a frame above
g. When g later restarts its computation, g frame is
logically connected to the current frame and becomes the child of
it. This frame may of course be different from g's original
caller. When g terminates, the control returns to the new parent, not
the original caller.
In general, each worker maintains its stack pointer below any unfinished
frame. A procedure call or ST_THREAD_CREATE
allocates a new frame
above stack pointer. Frames of blocked (yet unfinished) threads are
logically decoupled from the stack, but physically stay within the
stack. Frames of restarted threads are reconnected to the stack.
Leaving blocked frames in their original positions and making further
allocations on top of them, we may leave free spaces in a middle of a
stack. We do not try to reuse such holes within a stack. We always
allocate a new frame on top of the stack. Once a region becomes free in
a middle of a stack, it is reused only when no spaces are occupied above
it. When such a situation occurs, stack pointer is reset to top of the
unfinished frames. Exact details as to how to reclaim such spaces is not
described here; we only mention that such reclamation of space occurs
when the programmer calls ST_POLLING
. This is the first reason
why you have to call ST_POLLING
periodically.
What about thread migration? Interestingly, a thread migration is just a combination of thread blocks and a resume. Suppose that both f and g are in worker W's stack and W wants to migrate f to another (idle) worker U. At this point, W temporarily blocks g and resumes f. Next f also temporarily blocks and resumes the parent of f. The parent of f passes the context of f to U through shared-memory and then restarts g. U picks up the context of f and restarts it.
In general, to migrate a thread from W to U, W first blocks all frame above the migrating thread. Next W blocks the migrating thread too. W then restarts threads above the migrating thread. The migrating thread is picked up by U and restarted. This way, the migrating thread is pulled out from one stack and reconnected to another.
A thread migration also occurs when ST_POLLING
is called. When a
worker becomes idle, it randomly selects another worker and sends a
request. ST_POLLING
checks whether a request has arrived to the
calling worker and if it has, performs the above operation to migrate a
thread to the idle worker. This is the second (and the last) reason why
you have to call ST_POLLING
periodically.
The remaining question is, when a worker gets a request, how does it
select a thread to migrate. We simply migrate the thread at the bottom
of the stack. For example, suppose f performs
ST_THREAD_CREATE(g())
, g
performs
ST_THREAD_CREATE(h())
, and ST_POLLING()
finds a request
from an idle worker. In this case, f migrates. When the second
request soon arrives to the same worker, then g migrates. When call
tree is fairly balanced, as in the Fibonacci example (See section An Example), this strategy works very well; all workers will soon get a
sub-tree which is fairly near to the root and concentrate on their local
work. With this task distribution strategy in mind, let's go on to the
next section.
There are many ways to express parallelism that exist in your
program. Consider you want to perform 100,000 procedure calls
f(i)
(i = 0, ... 99,999) in parallel. It is most likely
to be written using a for
loop if they were done in series:
for (i = 0; i < 100,000; i++) { f(i); }
Therefore, you are likely to parallelize it as follows (actually,
f
may take extra arguments to achieve synchronization, and we may
have to insert some ST_POLLING
s, but we ignore them for
simplicity):
for (i = 0; i < 100,000; i++) { ST_THREAD_CREATE(f(i)); }
This works only when each f
involves a significant amount of work.
To directly get to the point, the recommended style is so-called
divide-and-conquer method. We introduce an auxiliary function
f_aux(p, q)
, which performs f(p), ... ,
f(q-1)
, and perform recursive calls in parallel (again,
synchronization and ST_POLLING
are omitted).
f_aux(p, q) { if (p + 1 == q) f(p); else { m = (p + q) / 2; ST_THREAD_CREATE(f_aux(p, m)); f_aux(m, q)); } }
This works well even when each f(p)
does a small amount of
work. The previous section has explained why this style works.
(This code performs roughly the same number of procedure calls to
f_aux
as the number of procedure calls to f
. Therefore it
adds the overhead of an extra procedure call to each f
. If
f
is too tiny to tolerate this overhead, change the condition in
the second line so that f_aux
performs several calls to f
in a sequential loop. For example:
f_aux(p, q) { if (p + T >= q) { for (i = p; i < q; i++) f(i); } else { m = (p + q) / 2; ST_THREAD_CREATE(f_aux(p, m)); f_aux(m, q)); } }
Here, T
is chosen so that calling f
T times is likely
to mask the overhead of a procedure call).
What's wrong with the first program? Well, it is not always wrong, and I know it is preferred in terms of maintenance cost and simplicity. Here I explain what happens when we execute the first program and in which circumstances is it OK and in circumstances should it be avoided.
When a worker, W, creates a thread for f(0)
, W starts
f(0)
. W keeps its continuation (representing f(1) ...
f(99,999)
on its stack. A request from an idle worker, X
,
will soon arrive and W
gives the continuation to X
. Now,
X
creates a thread for f(1)
, leaving the rest of the work
f(2) ... f(99,9999)
on its stack. A request from another
idle worker then steals it and invokes f(2)
, leaving f(3)
... f(99,999)
on its stack, and so on. Observe that all these thread
migrations occur in sequence. Assume for simplicity that each
f(i)
takes approximately the same time, this program assigns
iterations in a round-robin manner, resulting in approximately 100,0000
sequential thread migrations. This places a lower bound of the time
required to finish this loop, which is 100,000 times M, where M
is the migration cost. In the original sequential loop, it takes 100,000
times F, where F is the cost of a f(i). Therefore, the
achievable speedup on any number of processors is F/M, which may
be quite small when F performs only a tiny amount of work (M is
an order of thousand instructions).
Here is a summary: if F represents a significant amount of work, or you have only a small number of processors so that the above limit will not be observed, the simple parallel loop will be OK.
The final topic is where to insert ST_POLLING
. Recall that
ST_POLLING
serves two purposes.
The constraint is that you must put enough ST_POLLING
to make
them work. Another constraint is that you want to minimize
ST_POLLING
to save your coding time (and to save some overheads,
but the overhead is not large (like 10 instructions). When you are not
sure, play safe; put it).
Here are rules as to where you should insert ST_POLLING
to make
thread migration and stack management work. Each imposes a similar but
different constraints, but the total result is as follows:
In the last case, you do not have to do it when you know the loop is very small (and therefore never spends a long time).
For example, in pfib, the original code without ST_POLLING
is
structured like this:
pfib (...) { ... (a) ... if (...) { ... (b) ... return; } else { ... (c) ... pfib (...); pfib (...); ... (d) ... } }
According to the rule 1, you insert ST_POLLING to: (a)
or
(c)
. We advise (c)
, because no procedure calls are made
along the path (a)->(b)->return
. This saves the overhead of
ST_POLLING
at leaves. According to the rule 2, you insert
ST_POLLING
to (d)
.
Do you want to know why? Let me describe constraints more precisely for thread migration and stack management.
First, for thread migration, you must ensure that ST_POLLING
is
called often enough so that each worker is responsive to requests from
idle workers. Following the above rules, the maximum interval in which
no ST_POLLING
s are called is bounded (in terms of instruction
counts), regardless of input to the program. If the first rule would not
be followed, a recursive function may create an arbitrary long
polling-free interval. For example,
f(int n) { ...; f(n - 1); }
If there would be no ST_POLLING
s in ..., the length of a
polling-free interval is proportional to N
, which is not constant
across all inputs.
If the second rule would not be followed, a similar situation occurs during a chain of recursive calls are shrinking. For example,
f(int n) { f(n - 1); ...; }
If there would be no ST_POLLING
s in ..., the length of a
polling-free interval is proportional to N
, which is not constant
across all inputs.
The third rule is irrelevant for thread migration.
The fourth rule will be obvious. In practice, you might want to save
overhead of calling ST_POLLING
, if the body of the rule is very
small. If you know the entire loop does not take a long time, you may
skip it.
For stack management, you must ensure that you do not miss a point where
the entire space above the current frame (upto stack pointer) is
free. Notice that a new frame is always allocated on top of stack
pointer. Therefore, if you do not call ST_POLLING
at such a point
and make another procedure call, these frames do not reuse space that is
otherwise reusable. In the worst case, this will result in stack
overflow that can otherwise be avoided.
A frame becomes dead only when a procedure returns, thus it suffices to
call ST_POLLING
just after a procedure returns. This is where the
third rule comes from. Strictly speaking, all you need for stack
management is this rule, but it may seem impractical when manually
inserted by the programmer. As a rough approximation, the second rule is
in practice OK. This is because, in common programming styles, a
procedure F
finishes only if all its decedent have
finished. When this is the case, it is guaranteed that, if procedure
F
performs the last ST_POLLING
and returns, free space
left on top of the stack is none or the frame of F
, because all
descendent of F
have also been finished.
This section describes how to implement your own synchronizing data
structure on top of the bottom-level primitives
(st_suspend_thread
and st_resume_thread
). Before using
bottom-level primitives, examine if it is possible to implement your
synchronization just by combining higher-level primitives (mutex,
semaphore, and condition variables). High-level primitives often save
your time for debugging. In particular, condition variable is very
general. You need bottom-level primitives only when they are necessary
for performance or when your synchronization is most conveniently expressed
with them.
Consider a simple example in which two threads, a producer (writer) and a consumer (reader), are synchronized through a location. The synchronization constraint is that the consumer must wait for the producer to put the result. We assume there is only a single writer and a reader. We must still resolve the races between the writer and the reader.
1 typedef struct buffer 2 { 3 st_int_loc_t status; /* 0 for empty, 1 for full */ 4 int val; 5 st_context_t waiter; /* waiting context */ 6 } * buffer_t; 7 8 void init_buffer(buffer_t b) 9 { 10 ST_INT_LOC_INIT(&b->status, 0); 11 b->waiter = 0; /* no threads are waiting */ 12 } 13 14 int GET(buffer_t b) 15 { 16 if (st_read_int(&b->status) == 1) { 17 return b->val; /* full */ 18 } else { 19 /* atomically swap -1 and the current status */ 20 int s = st_read_and_lock_int(&b->status); 21 if (s == 1) { 22 int val = b->val; 23 st_write_and_unlock_int(&b->status, s); /* unlock */ 24 return val; 25 } else { 26 struct st_context c[1]; 27 c->valid = 0; 28 b->waiter = c; 29 st_write_and_unlock_int(&b->status, 0); /* unlock */ 30 st_suspend_thread(c); 31 return b->val; 32 } 33 } 34 } 35 36 int PUT(buffer_t b, int val) 37 { 38 /* atomically swap -1 and the current status */ 39 st_read_and_lock_int(&b->status); 40 b->val = val; 41 if (b->w) { 42 /* somebody is waiting */ 43 st_context_t w = b->w; 44 st_write_and_unlock_int(&b->status, 1); /* unlock */ 45 st_resume_context(w); 46 } else { 47 st_write_and_unlock_int(&b->status, 1); /* unlock */ 48 } 49 }
The buffer has three fields. The first one is val
, the value
written by the writer. The second is waiter
, which is a pointer
to a context. When a reader is blocked waiting for the value, the
context is written to waiter. Waiter is zero when nobody is
waiting. Finally it has status
field to describe the status of
the buffer. It also serves as a lock variable to make modifications to
multiple words atomic. It is either full, in which the writer has
already performed a PUT
, empty, in which it has not.
GET
checks b->status
and if it is not full, it suspends
the current thread by putting a context to b->waiter
(the line
26--31). PUT
checks b->waiters
and if it is not zero, it
resumes the waiting thread (the line 42--45).
To resolve possible race conditions between the writer and the reader,
the above code looks more complicated than it ought to be. When the
reader decides to suspend, we must guarantee that the writer will
see b->waiter
after the reader writes to b->waiter
. We
guarantee this by associating a lock with the status field. Before the
reader decides to suspend, it first obtains the lock by
st_read_and_lock_int(&b->status)
.
Since the writer first obtains the lock before reading b->waiter
,
it is guaranteed that at line 26, the writer is not reading
b->waiter
. The rest of the code is straightforward.
Here is very important notes on the use of
st_read_and_lock_int
. First, do not forget unlock. Second, do not
spend a long time when you are locking any location by
st_read_and_lock_int
. Once you obtain a lock, you must unlock it
as soon as possible. Typically, what you do while locking
something is a very short sequence of simple primitives like
arithmetics. Do not call any function that may take a long time. For
example, do not call malloc or printf. Finally, never try to obtain two
or more locks at the same time. This is naturally avoided if you try to
make critical sections as short as possible.
In short, st_read_and_lock_int
itself should not be regarded as a
thread synchronization primitive, but be a primitive to support atomic
transactions to a single location. If you want to have an exclusive
access to a data structure for a long time (this is already an
unadvisable programming style, though) for some reasons, use
synchronization primitives such as mutex.
stgcc
(stg++
) is a simple script that drives compilation. It
retains the same interface as gcc
(stg++
); it accepts the same
set of command line options, except for those that fundamentally
conflict with execution scheme of StackThreads/MP. The following
descriptions about stgcc
also apply to stg++
.
stgcc
is not a preprocessor or a compiler that understands
C/C++ syntax and translates C/C++ programs in non-trivial ways. In fact,
it feeds given C/C++ programs directly to gcc
, passing command line
(almost) as is. As a consequence, you can use almost all command line
options and non-standard extensions supported by gcc
.
What is stgcc
then? It is a postprocessor of
gcc
. Whatever options you gave to stgcc
, it lets gcc
to
generate assembly code and postprocess the assembly. The postprocessor
is written in an AWK script and does the following jobs:
stlink.awk
) to
collect tables for all object files (Therefor it is mandatory to call
stgcc
to generate final executable).
Options of gcc
that specifies where to stop compilation process
(which are -E
, -S
, and -c
) are also supported.
Compilation transparently proceeds, if necessary, by compiling the
postprocessed assembly file into an object file and by linking object
files into executable.
There are other minor things that stgcc
do:
-I$ST_DIR
to gcc
, so that standard
include files are automatically searched.
Undocumented yet.
While StackThreads/MP tries to support multithreading with unmodified GCC/G++, which does not have a notion of multiple threads of control, there are certain limitations which are hard to be fixed without changing the compiler or calling conventions. Here I list all the problems that I know of currently.
If you apply patches under gccpatch
to your gcc, you do not worry
about the problem (see section Applying Safety Patches to GCC for how to
apply patches).
In practice, this problem occurs only on Pentium, and only when you
write a nested procedure call to a procedure that calls
ST_THREAD_CREATE
, ST_POLLING
, ST_STEAL_REQ_CHECK
,
and any other synchronization primitives (mutex, semaphore, procedure
compiled by stgcc
). For example, if you write f(g(x), 1, 2,
3, 4)
and if g
may call st_mutex_wait
, this is a nested
procedure call. In such cases, outer procedures may receive wrong
arguments. In this example, f
may observe wrong arguments (more
specifically, 1, 2, 3, and 4 may not be correctly passed to f
).
To workaround this problem without patching gcc, avoid nested procedure calls (at least on Pentium), and do not call the above primitives within C++ constructors (because they are often implicitly called to cast arguments).
In theory, this problem may occur when the compiler would become cleverer. The execution scheme of StackThreads/MP manages stack pointer so that it always points to `top of the stack,' above which no unfinished frames are allocated. In particular, the stack pointer may not point to the currently running frame. Therefore, the value of stack pointer may be different from gcc's idea it. This mismatch causes a problem if gcc exploits it to optimize things (see section How does it work, basically? for more info about how does StackThreads/MP manage stack pointer).
The calling convention on a particular CPU determines how to manage stack pointer. The compiler may assume that stack pointer after a procedure call has a certain statically-computable offset from its original value, accoding to the calling convention. More specifically, on most calling conventions, the value of stack pointer does not change across a procedure call; in other cases (e.g., `Pascal' convention on Pentium), the callee is responsible for cleaning up arguments pushed on stack. Either case, the compiler knows where does stack pointer point to after a procedure call (more precisely, it statically knows how much does it differ from its original value), and if it wishes, it may exploit the fact to optimize arguments passing in procedure calls.
Suppose, for simplicity, a hypothetical convention in which ith
argument to a procedure is stored at SP[4(i-1)], where SP is stack
pointer, and SP does not change across a procedure call. Given a
procedure call f(g(x), 1, 2, 3, 4)
, for example, the compiler may
generate the following sequence:
1 /* write 1, 2, 3, 4 */ 2 SP[ 4] = 1; 3 SP[ 8] = 2; 4 SP[12] = 3; 5 SP[16] = 4; 6 /* call g(x) */ 7 SP[ 0] = x; 8 r = call g; 9 /* write the result value*/ 10 SP[ 0] = r; 11 call f;
Since g
is nested inside f
, it first performs a procedure
call r = g(x)
(line 8), and then f(r, 1, 2, 3, 4)
(line
11). Before calling g
, however, it puts a part of arguments to
f
at SP[4] ... SP[16]
(line 2 -- 5). When g
returns (at line 10), the value of SP is the same as the value of SP at
line 5, therefore f
can correctly receive arguments 1 ... 4.
The fundamental problem here is that the argument-writing sequence for
f
was interrupted by another procedure call g
, which was
safe in the original convention because a procedure call does not change
SP, but is not in StackThreads/MP convention in which SP points to
whatever place happens to be top of the stack at that point. In general,
we must tell the compiler that SP becomes `unknown' after any procedure
call.
As described, this problem in theory occurs whenever the compiler does a
good job at placing arguments to procedures. It may in theory occur on
any architecture and for almost any program, even if it does not use a
literally nested procedure call (note that there are no reasons why the
compiler cannot do the same optimization for non-nested version of the
above example, which is, { r = g(x); f = (r, 1, 2, 3, 4) }
.
Let us now move our focus on what does the current gcc do, and let me
describe why do I think this problem occurs only for literally nested
procedure calls on Pentium. First of all, compilers use SP as the base
register when they write arguments to procedures. For other purposes,
frame pointer (FP) is used. This is because FP never changes within a
single procedure, whereas SP in general does because of
alloca
. Therefore named locations (e.g., local variables) are accessed
via FP. On the other hand, arguments to procedures are going to be
accessed via FP of the called procedure, which is SP of the
current procedure. Therefore arguments are written via SP. This limits
the situation in which a compiler exploits the fact that SP is known
after a procedure call. To summarize, a compiler exploits the fact only
to schedule arguments passing to procedure calls.
On all CPUs I know of except for Pentium, first several (typically 4 to
6 words) arguments are passed on registers. This makes opmization less
important on RISC CPUs. Second, the optimization is potentially complex
because alloca
make SP essentially unknown. Therefore any
compiler that tries to optimize arguments passing must take it into
account. A compiler cannot simply assume "SP is constant."
We must make sure that a procedure call expression that appears in
ST_THREAD_CREATE
is not inline-expanded. For example, when you write
ST_THREAD_CREATE(f(x))
, you must make sure procedure call f(x)
is not inlined by gcc
. When this condition holds, separate stack
frames are allocated for f and the caller of f. StackThreads/MP
runtime can then execute f and the caller in parallel by executing
the two frames on different processors. This is impossible if
f(...)
is inlined, because their stack frames are
merged. stgcc
command automatically disables inlining by giving
gcc
suitable options (-fno-inline-functions
for normal functions
and -fno-default-inline
for C++ member functions defined in a class
definition). This is safe, but sometimes results in a noticeable
slowdown compared to gcc
, particularly for C++ programs.
Giving -fno-inline-functions
and -fno-default-inline
are very
pessimistic, because we only have to make sure that inlining won't occur
on parallel calls (ST_THREAD_CREATE
). Inlining normal procedure
calls are no problem, but the problem is we have no way to prohibit
inlining of designated call sites. Here are a couple of suggestions to
enable inlining sequential calls.
inline
keyword for functions that
are called only sequentially. These functions are subject of inlining
even when -fno-inline-functions
and -fno-default-inline
are
given.
inline
keyword is onerous (as is typically
the case in C++ programs), and if you are sure that procedures called
with ST_THREAD_CREATE
are never inlined somehow, you can supply
--inline_functions_is_ok
and --default_inline_is_ok
to
stgcc
. This enables inlining by stgcc
. The remaining problem is
how do you make sure that procedure calls in ST_THREAD_CREATE
are
never inline-expanded.
In practice, GCC/G++ never inline-expands procedure calls whose function
part is not a simple function name. For example, in the following
example, procedure call f(x)
is likely to be expanded.
int f(int x) { return x + 1; } g(int x) { return 2 * f(x); }On the other hand, the following procedure call is, in practice, never inline-expanded. Therefore you can always make sure that inlining won't occur by calling a function through a function pointer.
int f(int x) { return x + 1; } g(int x) { int (*pf)(int) = f; return 2 * pf(x); }In this way, for all procedures whose definitions are visible in the same compilation unit, you can make the call performed through a function pointer. The above code does not compile if the declaration of
f
were not given, but in this case the gcc
does not
inline-expand the call either.
To summarize, if the called function is not a simple function name, the
procedure call won't be inlined. If the called function is a simple
function name and its definition is not known in the compilation unit,
again gcc
does not inline it. Finally, if the definition is known,
you can obscure the called function by transforming the function call in
the way described above.
If you apply patches under gccpatch
to your gcc, you can forget
about the problem.
The problem arises when a procedure performs two or more logically concurrent procedure calls that take a structure as an argument. The problem is, in such programs, structure arguments passed to a procedure call may be overwritten by subsequent ones.
For example,
1 typedef struct S { ... }; 2 typedef struct T { ... }; 3 4 void g(S x) { ... }; 5 void h(T y) { ... }; 6 7 void f() 8 { 9 S s; T t; 10 ... 11 ST_THREAD_CREATE(g(s)); 12 h(t); 13 }
Procedure calls g(s)
and h(t)
are logically concurrent and both
take a structure argument. Procedure g
may observe that its incoming
parameter (x
) is overwritten.
This problem occurs basically because SPARC calling convention uses the
same storage to pass structure arguments to procedures. In this example,
the storage for passing s
to g
and passing t
to h
are
the same. The above program is equivalent to:
void f() { union { S s; T t; } struct_passing_area; ... ST_THREAD_CREATE(g(&struct_passing_area.s)); h(&struct_passing_area.t); }
This is not a problem in sequential programs, because when f
calls
h(t)
, procedure call g(s)
has terminated. This is not the case
when g(s)
is evaluated by a new thread. What happens is that,
concurrent activities h
and g
unconsciously share an unnamed
storage.
To sidestep this problem, all procedures that take a structure parameter
can copy the parameter into a local storage just after it is called. For
example, the above procedure g
can be rewritten as follows.
void g(S _x) { S x = _x; ... };
If such a copy is too expensive, and if you can change the signature of
g
, you may pass the structure by reference as follows.
void f() { S s; T t; ... ST_THREAD_CREATE(g(&s)); h(&t); }
This does not work if f
itself modifies s
after calling
g
. Here we are facing essentially the same problem as before, but
this time, the programmer can easily reason about the problem because
the storage is explicitly named by the programmer. The real problem in
the first program is that an unnamed storage is shared and thus the
programmer cannot reason about it.
You may wonder why doesn't this problem occur on other CPUs or for
scalar arguments? Procedures normally place arguments to a procedure
call on registers and SP-relative locations (memory locations addressed
via stack pointer). On Mips, Alpha, and Pentium, this is the case for
any type of arguments. SPARC places scalar arguments on registers and
SP-relative locations, but places structure arguments on FP-relative
locations (memory locations addressed via frame pointer). For parameters
passed via registers, the compiler-generated code, if necessary, saves
them when the control departs from the procedure (by calling another
procedure). For parameters passed via SP-relative locations,
StackThreads/MP runtime system ensures that, if the control returns to
f
before g
exits, the stack pointer is adjusted so that h(t)
does use the same storage as g(s)
. When the control returns to f
before g
exits, the values of the stack pointer at different at line
11 and line 12 are different, effectively renaming the storage for
SP-relative locations.
Machine code generated by stgcc
assumes that StackThreads/MP runtime
has been initialized. StackThreads/MP library defines a main
function that initializes the runtime before anything else and then
calls a user-supplied entry point, st_main
. When you have a C++
class that has a constructor and defines a global or static variable of
that class, then the constructor is called before main
. Therefore
these constructors run before StackThreads/MP is initialized.
The problem occurs only when a constructor explicitly calls some StackThreads/MP primitives. If the constructor is a simple sequential constructor, as is typically the case, there are no problems.
If you ever wish to call a StackThreads/MP primitive from within a
global constructor that runs before main, you must treat such a
constructor call as a callback (see section Setup TLS Pointers for more
information). To summarize, when you have such a constructor, write
ST_CALLBACK_BEGIN()
at the entry and ST_CALLBACK_END()
before
return. For example, here is a class Foo
, whose constructor creates
a thread whenever an instance is created.
class Foo { Foo { ST_CALLBACK_BEGIN(); ST_THREAD_CREATE(fib(n, ...)); ... ST_CALLBACK_END(); } }
See section Alloca is not supported (yet) for information and possible workarounds.
This problem rarely occurs. When a procedure calls another procedure with extremely large arguments. In practice the problem only occurs when you have a structure that embeds a large array and passes an instance of the structure by value. Here is an example that may cause the problem.
typedef struct bigstr { a[1000]; } bigstr; void bigstr_eather(bigstr bs) { ... } void caller() { bigstr b; bigstr_eater(b); }
stgcc
command invokes gcc
and examines the assembly code
generated by gcc
. It collects information about stack frame formats
of compiled procedures. Collected information includes, among other
things, the size of the SP-relative region (the region accessed via
stack pointer) of each stack frame. The size of SP-relative regions are
determined by gcc
depending on the size of out-going
parameters. stgcc
must know the size of the region to do stack
management. It computes the size by scanning the assembly instruction
sequence and finding instructions that store something using stack
pointer as the base register. For all such instructions, we check the
offset + the operand size, and computes the maximum of these values.
This is the size of the SP-relative region for that procedure.
In programs like this, the compiler may obscure such SP-relative stores.
For example, if the location to be written cannot be reached by a store
instruction using stack pointer as the base, the location may be computed
to a general purpose register and then written using that register as
the base. Even worse, arguments are written by calling functions like
memcpy
. Thus, stgcc
may not be able to SP-relative regions
accurately for such programs and as a consequence, stack management
scheme may be broken.
I have never encountered this problem. You will not often encounter this problem either, because C/C++ programmers know that passing large structures are expensive in C/C++ and normally pass such structures by pointers.
There are some ways to make this problem even unlikely to occur, but I
do not think of a perfect idea right now. To guarantee 100% safety, I
guess we must eventually hack the compiler. We simply want to know
gcc's idea about the size of SP-relative regions in a form of a
comment or a directive in generated assembly instructions. Such
information should be readily available inside a compiler and
gcc
seems to generate such information on some CPUs. Let's make it
available all machines!
The problem occurs when you are using a C++ header file that uses
#pragma interface
. When multiple C++ sources include such a
header file and they are linked together, you observe a lot of erros
saying symbols are duplicated. Symbols are names of member functions
defined in the header file. This is probably a gcc bug (that exists at
least in ver 2.7.2.1 on Solaris. I don't know situations in other
platforms and on other versions of gcc), but I describe it here because
the problem occurs faily often (most importantly when you use
iostream.h
that comes with g++
).
stgcc
command gives -fno-default-inline
to
gcc
. When gcc
compiles a C++ source that includes a header
file that uses #pragma interface
, it generates all the symbols
defined in the header as global symbols, rather than file-scope
or weak symbols. If multiple sources include such headers, they result
in a link error. This is purely a gcc bug, I believe.
The problem does not occur without -fno-default-inline
, in which
case definitions of these symbols are not generated. The problem does
not occurs without #pragma interface
either, in which case
symbols are defined as weak symbols. The manual of gcc
says the
specification about this pragma is in transition.
It would be nice if you would not encounter this problem unless you use
this pragma. Unfortunately, however, library header files that come with
gcc
extensively use them! Most importantly, iostream.h
uses it. A possible work around is to give --default_inline_is_ok
to stgcc
, which instructs stgcc
not to pass
-fno-default-inline
. But then, you must make sure that gcc
never inlines member function calls that appear in
ST_THREAD_CREATE
. You can achieve this by, for example, never
calling member functions by ST_THREAD_CREATE
. see section Possible Slowdown by Disabling Inline for more details about
--default_inline_is_ok
.
Q: I am already accustomed to Pthreads, why must I use StackThreads/MP in the first place?
A: Pthreads and StackThreads/MP provide a similar set of basic primitives (thread create and synchronization). The most fundamental difference is that StackThreads/MP creates a thread very cheaply and tolerates a very large number of threads. This makes a significant difference in parallel programming styles.
In parallel programming with Pthreads, the programmer normally creates a constant number of threads and threads share a common task pool to distribute work (even more simply, each thread is statically assigned to a particular task at compile time). A piece of work is represented by a datum and it enters the task pool when it is found necessary. This explicit representation of a task is already onerous (Recall that in sequential programs, you simply call a function to perform a particular task. You don't have to define a data structure that represents a task). Worse, to achieve scalability, you are often forced to make the task pool concurrently accessible from multiple threads. This complicates the structure of the task pool, making overheads of queue manipulation larger and programming errors more likely. Most importantly, the resulting parallel program obscures the logic of the computation and looks very different from the sequential program, making the cost of maintenance significantly larger.
In parallel programming with StackThreads/MP, the programmer is encouraged to create threads dynamically when a piece of work is found necessary. It provides a primitive that attaches a thread to any procedure call in C and a thread creation costs only a little more than a sequential procedure call. The runtime system regards a thread as a unit of work and schedules whatever number of threads you created on whatever number of processors you have. Therefore, you need not manage a task pool explicitly. It is a library built on top of traditional thread libraries such as Pthreads. It internally spawns a fixed number of coarse-grain threads (i.e., threads in the Pthreads' sense) and each such thread creates arbitrary number of fine-grain threads (i.e., threads in StackThreads/MP's sense) and distributes them among the underlying coarse-grain threads. StackThreads/MP programming model is particularly attractive when starting from sequential source. Basically, some procedure calls in the sequential source are replaced by thread creations. In most cases, you can retain the overall structure of the sequential source. Everything comes from the fact that a thread creation is very cheap; therefore you do not worry about "too much threads."
There are some other reasons why you might be interested in StackThreads/MP.
malloc
in the libc
. malloc
is often a source of
bottleneck in parallel programs. Alternatives include a simple
replacement of malloc
that are much more scalable than the libc
malloc
, a parallel conservative garbage collector, and explicit memory
management based on regions (a.k.a. zones).
No. It is intended to be a parallel programming tool. Pthreads, on the other hand, is intended to be both as a parallel programming tool and as a concurrent programming tool. Pthreads are sometimes used to spawn a `background' job in programs that handle asynchronous inputs. For example, GUI programs may spawn a thread to redraw the screen while another thread is listening to events. Server programs may spawn a thread to process a request from a client while another thread is servicing another. These programs create threads not so much because they want to shrink the total computation time on multiprocessors, but because they want to make them responsive to inputs. In particular, it makes sense to create a thread even on uniprocessors, as long as threads are scheduled fairly.
Threads of StackThreads/MP do not replace such threads. In particular, it does not guarantee any fairness. It only guarantees that, as long as there is any thread in the system (and the main procedure is not returned yet), it schedules a thread. However, StackThreads/MP provides a way to directly create a thread (or a group of threads) of the underlying thread package. Therefore you can comfortably mix fine-grain threads, which are cheap to create but not scheduled fairly, and threads of the underlying thread package, which are expensive to create but are scheduled fairly by the OS (see section Create a Worker Group for details).
Q: Compilation fails. Why?
A: stgcc
simply calls gcc
without any preprocessing. Therefore it
virtually never happens that whatever errors gcc
generated is caused
by stgcc
. Be sure that your program compiles with gcc
.
There are still possibility that a program that successfully compiles
with gcc
does not successfully compile with stgcc
.
ST_DIR
, ST_CPU_TYPE
, ST_OS_TYPE
,
and ST_THREAD_PKG
. See section Installation for details.
PATH
environment variable
contains $ST_DIR/bin
.
gcc
are prohibited in stgcc
. When stgcc
detects
such an error, it displays an error message and does not call
gcc
. See section STGCC/STG++: A wrapper for GCC for the list of prohibited
options.
gawk
.
stgcc
calls gcc
to generate
assembly and postprocesses generated code. There may be errors (that I
do not know of at this moment) in this process. When it happens, it
displays a simple diagnostic message (that is useful at least for me :-)
and aborts compilation. This is a bug of the postprocessor. When it
happens, please send the assembly program before postprocessing and what
did the postprocessor say. To obtain the assembly program before
postprocessing, give --dbg
flag to stgcc
. It generates, among
other things, .t
file. Send it to me (see section STGCC/STG++: A wrapper for GCC for command line options of stgcc
. See section Reporting Bugs
for how to report bugs).
The coding principle of the postprocessor is "never go a wrong way,"
meaning that it stops whenever it encounters something unfamiliar. It
assumes a stereotyped code generation of gcc
and aborts when it
finds anything unexpected.
Q: Link fails. Why?
A: First, see the previous item (See section Compilation fails). In addition,
stgcc
invokes nm
command at link time. Be sure that command
invoked by typing nm
displays contents of object files correctly.
Q: How to obtain necessary tools.
A: To use StackThreads/MP, you need make
, sh
, nm
, gawk
,
and gcc
. Virtually all UNIX platforms already have make
, sh
,
and nm
, although we recommend you to install GNU make. gawk
is
mandatory; awk
cannot replace gawk
. If
you do not have nm
or nm
on your system doesn't work, you can
install GNU nm. It is included in binutils
.
For UNIXs, all these tools are available in a variety of sites. On Windows NT, all these tools are available as Cygnus GNUWIN32 package. If you have no idea, visit StackThreads/MP home page and follow links from there.
Q: Why does compilation take longer than gcc
?
A: To be honest, the compilation is noticeably slower than plain gcc
(as a rough estimation, stgcc
takes twice longer than
gcc
). Probably, it comes mostly from the fact that the postprocessor
is written in gawk. I want to retain it in awk for now because then it
is very easy to fix bugs and add new platforms. I will eventually
rewrite it for speed, possibly in compiled languages like C.
Q: When I ran a program compiled via stgcc
, I got a segmentation
fault. Why?
A: Ninty nine times out of hundred, it is not my fault :-) Here is some possible ways to isolate the problem.
gcc
if possible (by commenting out all StackThreads/MP primitives).
M-x gdb
command, because an arrow displays the location of the
offending instruction).
If this is the case, you can increase the stack size by giving -ss
S
option to your program, where S specifies a stack size in bytes
(-ss S
is automatically recognized by StackThreads/MP
runtime. See section Command Line Options Common for All StackThreads/MP Programs for details). S can be a number or a number followed by a
k
(for Kbytes) or m
(for Mbytes). For example, -ss 4m
sets the stack size of each worker to 4Mbytes.
Also See section Stacks and Contexts for examining stack size at various
program points.
Q: When I run a program compiled by stgcc
on a single processor,
it is slower than its serial program compiled by gcc
. Why?
A: Although StackThreads/MP tries to minimize the overhead of running parallel programs, there are several sources.
ST_THREAD_CREATE
adds something like ten instructions in
addition to a normal procedure call.
ST_POLLING
costs something like ten instructions.
These overheads are insignificant when thread and synchronization granularity are large enough (several hundreds instructions between two events are large enough). Moreover, these overheads are fairly easy to reason about, because they all appear in the source program. There are other sources that are not directly visible in C/C++ sources.
stgcc
by default prohibits inlining. This may slowdown
performance of C++ programs considerably. See section Possible Slowdown by Disabling Inline for details and workarounds.
stgcc
prohibits omitting frame pointers that are commonly
adopted on Mips and Alpha. This may add a few instructions to a
procedure call overhead.
The last two items are not noticeable unless the program calls a very
small function (that perform, say, only 10 instructions in its body) too
often. Such procedures are most likely to be inline-expanded if you
attach inline
keyword. This eliminates the procedure call
overhead all together.
Q: My program didn't speedup at all. Why?
A: I don't know :-) Again, it is not likely to be my fault, but here are some guidelines to shoot out the problem.
ST_POLLING
. The
program never speed up when you do that.
busy
(red), parallelization itself is
good. If the graph contains a wide steal
(green) area, there are
something wrong in parallelization. One possibility is that you did not
create enough threads in the green sections. If you are sure that there
should be enough runnable threads there, check if you call
ST_POLLING
often enough.
mkxgrp
command, it displays a single line message that
shows how much time did workers spend on each state. Each number is the
total over all workers. You can run the same program with various
numbers of workers (using -nw
option) and see how does busy time
increase. When (2) is the case, the typical behavior is that the busy
time is roughly proportional to the number of processors.
Q: Why must I must insert ST_POLLING?
A: ST_POLLING
does two things. One is to distribute threads among
processors and the other is to manage stack. Without inserting
ST_POLLING
at all, your program never runs faster than on single
processor. See section Where should you insert ST_POLLING? for details.
Q: StackThreads/MP provides both spin-lock primitives and mutex locks. Why? And when should I use which?
A: Because both are useful on different situations. Mutex is useful when the critical section is not very short. When threads are contending on a mutex, these threads enter the waiting queue of the mutex and the workers schedule other runnable threads. The effect is to serialize computation within these critical sections and perform other tasks in parallel.
To achieve scalable speedup, you must be careful so that the total time spent on any critical section is far smaller than the ideal execution time on whatever number of processors you wish to scale. For example, if the program spends T seconds on a single processor and the total amount of time that must be spent in a critical section is X, you must make sure that T/P << X, where P is the maximum number of processors that you wish to run the program. Note that thread blocking takes a fairly long time, so the above X must account for blocking/unblocking threads on a mutex.
A common practice is to do your best to make critical sections as short as possible. A typical critical section is as short as 10 instructions. Spin-locks are useful for such short critical sections.
This document was generated on 3 April 2000 using the texi2html translator version 1.51.