Parallel programming has been around for decades, though before the
advent of multi-core processors, it was an esoteric discipline.
Numerous programmers have tripped over the common stumbling blocks by
now. By recognizing these problems you can avoid stumbling.
Furthermore, it is important to understand the common problems
before designing a parallel program, because many of the problems arise
from the overall decomposition of the program, and cannot be easily
patched later. This series of articles looks at some of these common
problems, their symptoms, and ways to circumvent them.
Too Many Threads
It may seem that if a little threading is good, then a lot must be
better. In fact, having too many threads can seriously degrade program
perform-ance. The impact comes in two ways. First, partitioning a fixed
amount of work among too many threads gives each thread too little
work, so that the overhead of starting and terminating threads swamps
the useful work. Second, having too many concurrent software threads
incurs overhead from having to share fixed hardware resources.
When there are more software threads than hardware threads, the
operating system typically resorts to round robin scheduling. The
scheduler gives each software thread a short turn, called a time slice,
to run on one of the hardware threads.
When a software thread's time slice runs out, the scheduler
preemptively suspends the thread in order to run another software
thread on the same hardware thread. The software thread freezes in time
until it gets another time slice.
Time slicing ensures that all software threads make some progress.
Otherwise, some software threads might hog all the hardware threads and
starve other software threads. However, this equitable distribution of
hardware threads incurs overhead. When there are too many software
threads, the overhead can severely degrade performance. There are
several kinds of overhead, and it helps to know the culprits so you can
spot them when they appear.
Thread register
state. The most obvious overhead is the process of saving and
restoring a thread's register state. Suspending a software thread
requires saving the register values of the hardware thread, so the
values can be restored later, when the software thread resumes on its
next time slice. Typically, thread schedulers allocate big enough time
slices so that the save/restore overheads for registers are
insignificant, so this obvious overhead is in fact not much of a
concern.
A more subtle overhead of time slicing is saving and restoring a
thread's cache state. Modern processors rely heavily on cache memory,
which can be about 10 to 100 times faster than main memory. Accesses
that hit in cache are not only much faster; they also consume no
bandwidth of the memory bus. Caches are fast, but finite.
When the cache is full, a processor must evict data from the cache
to make room for new data. Typically, the choice for eviction is the
least recently used data, which more often than not is data from an
earlier time slice. Thus threads tend to evict each other's data. The
net effect is that too many threads hurt performance by fighting each
other for cache.