[You won't be able to execute parallel Haskell programs unless PVM3
(Parallel Virtual Machine, version 3) is installed at your site.]
To compile a Haskell program for parallel execution under PVM, use the
-parallel option, both when compiling and
linking. You will probably want to import
Parallel into your Haskell modules.
To run your parallel program, once PVM is going, just invoke it
“as normal”. The main extra RTS option is
-qp<n>, to say how many PVM
“processors” your program to run on. (For more details of
all relevant RTS options, please see Section 4.14.4.)
In truth, running Parallel Haskell programs and getting information
out of them (e.g., parallelism profiles) is a battle with the vagaries of
PVM, detailed in the following sections.
Before you can run a parallel program under PVM, you must set the
required environment variables (PVM's idea, not ours); something like,
probably in your .cshrc or equivalent:
setenv PVM_ROOT /wherever/you/put/it
setenv PVM_ARCH `$PVM_ROOT/lib/pvmgetarch`
setenv PVM_DPATH $PVM_ROOT/lib/pvmd
Creating and/or controlling your “parallel machine” is a purely-PVM
business; nothing specific to Parallel Haskell. The following paragraphs
describe how to configure your parallel machine interactively.
If you use parallel Haskell regularly on the same machine configuration it
is a good idea to maintain a file with all machine names and to make the
environment variable PVM_HOST_FILE point to this file. Then you can avoid
the interactive operations described below by just saying
You use the pvm command to start PVM on your
machine. You can then do various things to control/monitor your
“parallel machine;” the most useful being:
|Control-D||exit pvm, leaving it running|
|halt||kill off this “parallel machine” & exit|
|add <host>||add <host> as a processor|
|delete <host>||delete <host>|
|reset||kill what's going, but leave PVM up|
|conf||list the current configuration|
|ps||report processes' status|
|pstat <pid>||status of a particular process|
The PVM documentation can tell you much, much more about pvm!
With Parallel Haskell programs, we usually don't care about the
results—only with “how parallel” it was! We want pretty pictures.
Parallelism profiles (à la hbcpp) can be generated with the
-qP RTS option. The
per-processor profiling info is dumped into files named
<full-path><program>.gr. These are then munged into a PostScript picture,
which you can then display. For example, to run your program
a.out on 8 processors, then view the parallelism profile, do:
$ ./a.out +RTS -qP -qp8
$ grs2gr *.???.gr > temp.gr # combine the 8 .gr files into one
$ gr2ps -O temp.gr # cvt to .ps; output in temp.ps
$ ghostview -seascape temp.ps # look at it!
The scripts for processing the parallelism profiles are distributed
The “garbage-collection statistics” RTS options can be useful for
seeing what parallel programs are doing. If you do either
+RTS -Sstderr or +RTS -sstderr, then
you'll get mutator, garbage-collection, etc., times on standard
error. The standard error of all PE's other than the `main thread'
appears in /tmp/pvml.nnn, courtesy of PVM.
Whether doing +RTS -Sstderr or not, a handy way to watch
what's happening overall is: tail -f /tmp/pvml.nnn.
Besides the usual runtime system (RTS) options
(Section 4.16), there are a few options particularly
for concurrent/parallel execution.
(PARALLEL ONLY) Use <N> PVM processors to run this program;
the default is 2.
the context switch interval to <s> seconds.
A context switch will occur at the next heap block allocation after
the timer expires (a heap block allocation occurs every 4k of
allocation). With -C0 or -C,
context switches will occur as often as possible (at every heap block
allocation). By default, context switches occur every 20ms
milliseconds. Note that GHC's internal timer ticks every 20ms, and
the context switch timer is always a multiple of this timer, so 20ms
is the maximum granularity available for timed context switches.
(PARALLEL ONLY) Produce a quasi-parallel profile of thread activity,
in the file <program>.qp. In the style of hbcpp, this profile
records the movement of threads between the green (runnable) and red
(blocked) queues. If you specify the verbose suboption (-qv), the
green queue is split into green (for the currently running thread
only) and amber (for other runnable threads). We do not recommend
that you use the verbose suboption if you are planning to use the
hbcpp profiling tools or if you are context switching at every heap
check (with -C).
(PARALLEL ONLY) Limit the thread pool size, i.e. the number of concurrent
threads per processor to <num>. The default is
32. Each thread requires slightly over 1K words in
the heap for thread state and stack objects. (For 32-bit machines, this
translates to 4K bytes, and for 64-bit machines, 8K bytes.)
(PARALLEL ONLY) Limit the spark pool size
i.e. the number of pending sparks per processor to
<num>. The default is 100. A larger number may be
appropriate if your program generates large amounts of parallelism
(PARALLEL ONLY) Set the size of packets transmitted between processors
to <num>. The default is 1024 words. A larger number may be
appropriate if your machine has a high communication cost relative to
(PARALLEL ONLY) Select a packing scheme. Set the number of non-root thunks to pack in one packet to
<num>-1 (0 means infinity). By default GUM uses full-subgraph
packing, i.e. the entire subgraph with the requested closure as root is
transmitted (provided it fits into one packet). Choosing a smaller value
reduces the amount of pre-fetching of work done in GUM. This can be
advantageous for improving data locality but it can also worsen the balance
of the load in the system.
(PARALLEL ONLY) Select a globalisation
scheme. This option affects the
generation of global addresses when transferring data. Global addresses are
globally unique identifiers required to maintain sharing in the distributed
graph structure. Currently this is a binary option. With <num>=0 full globalisation is used
(default). This means a global address is generated for every closure that
is transmitted. With <num>=1 a thunk-only globalisation scheme is
used, which generated global address only for thunks. The latter case may
lose sharing of data but has a reduced overhead in packing graph structures
and maintaining internal tables of global addresses.