Today’s embedded devices are increasingly based on multi-core
designs. Systems-on-a-chip (SoCs) often contain two or more processor
cores in homogeneous or heterogeneous combinations, and FPGA-based
designs can include a virtually unlimited number and variety of cores.
An asymmetric multiprocessing (AMP)-based RTOS is one approach to
utilizing multi-core processors; symmetric multiprocessing (SMP) is
another.
But from a historical perspective, multiprocessor/multicore designs
have been around almost as long as the computer in general. I believe
that Burroughs introduced the first commercially available one back in
1962; many more have followed. Most of them had very specific purposes
in mind.
Many of the early ones were used in data centers, and for scientific
applications. You can also make a case for them being in the early
desktops of the late 80s with the introduction of the math co-processor
and device controllers such as an Ethernet controller.
The question is: why are embedded device manufacturers becoming so
enamored with multi-core devices now? One reason is that embedded
devices have more and more tasks being heaped on them.
My first cell phone was a StarTek. It had a rudimentary LCD display
and frankly was not very useful for anything other than making a phone
call. It had a 24-hour standby time and maybe an hour talk time once
the battery was broken in. That was not a real problem though since my
first plan was for 30 minutes a month.
But look at today’s phones and the demands made on them: built-in
cameras, video capabilities, nice graphical interfaces for browsing the
Web, calendars and all sorts of games. It is no wonder that designers
are looking at ways to get more performance, and better form factor
while extending battery life.
Why multicores? Adam Smith's answer
One answer to my question may lie in economic theory, specifically Adam
Smith’s “Wealth of Nations.” It opens with a dramatic recommendation
that rather than each worker performing all tasks required for pin
making: rolling the wire, cutting the wire into pieces, sharpening the
wire and bobbing the end, each worker should specialize in performing a
single task. One worker would roll the wire; another would sharpen the
wire, another bob the ends and so on.
Smith suggested that a pin factory that had adopted such a
"specialization of labor" might produce tens of thousands of pins a day
whereas a pin factory in which each worker attempted to produce pins,
from start to finish, would produce very few pins.
Heterogeneous multicore designs such as the OMAP processor, which
employ both a DSP and an ARM core, embody Adam Smith’s concepts of both
division of labor and a specialization of labor. This approach is a
good one for manufacturers of cell phones and other electronic devices,
which perform a multitude of roles: the DSP efficiently performs
signal-processing activities and the ARM core performs the role of a
general-purpose processor. (Panel to
right)
Since power consumption is both a function of voltage and frequency
(power consumption increases proportionally with frequency and power
consumption increases to the square of voltage increase, the division
of labor aspect of multiprocessing is of particular interest to
manufacturers of portable devices.
Most portable electronic devices do not operate at peak capacity of
their processors all the time. However, to handle peak demand,
manufacturers have to provide a processor that can deliver the
processing power to meet peak demands. While the processor can be
throttled down between periods of peak performance, this may not be an
option for devices that require operation in an intermediate
performance range. For most processors they must either run at full
speed or be in an idle state.
This is where adding a number of slower processors may be appealing
to device manufacturers. If the issue of peak performance can be
segmented so that multiple processors can address peak performance
needs, they can then be throttled down or put to sleep during periods
of inactivity. For periods where performance demands are low, one or
two of the low performance processors can provide satisfactory
processing power.
The limits on your multicore design are essentially the same as the
constraints on classical multiprocessing systems: Amdahl’s law; serial
code execution; code execution granularity; load-imbalance;
synchronization; and cache coherency.
Illustrated in Figure 1 above,
Amdahl’s Law states, “The
non-parallel fraction of the code (i.e., overhead) imposes the upper
limit on the scalability of the code”. Thus, the maximum parallel
Speedup S(p) for a program that has a parallel fraction f:
S(p) < 1/(1-f)
The non-parallel (serial) fraction of the program includes the
communication and synchronization overhead.
So according to Amdahl, if you have a program or a suite of tasks
that is comprised of say 50 percent serial code, no matter how many
processors you add, you will only get a performance increase of 200
percent. On the other hand, if your code contains no serial code, like
four tasks that have no communication requirements, no shared
resources, etc, then you can achieve a 1000 percent speed up by going
from one processor to 10.
Serial code execution
What is serial execution? Execution is considered serial when one or
more cores in a multi-core system are not executing code. This can
occur due to a variety of reasons. Sometimes it is because there is no
reason for code to be executing on multiple cores. Basically, the
system is idle. From a performance perspective, this is nothing to
worry about. However, there are a number of situations where it would
be desirable for all of the processors to be executing code, but can’t
because of data or resource conflict, load-imbalance, synchronization
or the need to maintain cache coherency.
Spinlocks. One
prime example of how code becomes serialized is when there is the
potential for processes running on different processors to attempt to
gain access to a resource that is already in use by another process
(Figure 2, below).
OS developers use what is called a Spinlock to act as a gatekeeper
for a resource. A Spinlock (sometimes called a Spinlock Mutex) is a
special data structure used in multiprocessing. It is similar to a
Mutex or a semaphore in that it protects access to resources that
cannot be accessed by more than one process at a time. The primary
difference between a Spinlock and a Mutex is that the Spinlock depends
on a hardware ”test and set” mechanism that arbitrates between two or
more processes attempting access at the same time.
When a process attempts to access a resource that is in use by
another process, the “blocked” process(s) will “spin” until the
resource becomes available; hence the name Spinlock. In essence, the
processes queue up until they can get access to the resource. There are
different algorithms that can be used to assign priority to processes
waiting in a Spinlock. Regardless of the algorithm used, once a process
gets access to a resource, they keep it until they decide to give it
up.
The granularity of Spinlocks and the resources that they are used to
protect has dramatic impact on the degree to which application code is
serialized. Devices such as hard drives and other IO devices come to
mind when I think of shared resources but, depending on the type of OS,
kernel objects can also be protected by Spinlocks. The granularity of a
Spinlock is a function of the number and type of resources that they
are used to protect.
A coarse-grained implementation of a Spinlock would be used to
protect all IO devices. So, for example, once a process gains access to
any IO device, even a process that needs access to an unused resource
will spin in the Spinlock until the lock is released. The other extreme
would be to have a Spinlock protecting each shared resource in the
system.
Each approach has its advantages. A coarse-grained approach is
simple to implement, but has the potential to serialize the
application. The fine-grained approach makes more resources available
in parallel, but is more complicated and has the potential to exhibit
problems such as deadlock and other classical problems associated with
resource protection schemes used in single processor systems. This
problem can be overcome to a large degree through the use of watchdog
applications and other types of heuristics. However, it does add to the
overall overhead of the system, so it comes at a cost.
One final note: Spinlocks don’t just protect system data or
resources, they can also be applied to application data. For example,
if some device configuration data consists of a string of bytes, it
will need to be locked during update to ensure data integrity.
Granularity of Parallel Code
Execution
There are two general types of coding approaches that you can take with
multicore devices: fine-grained parallelism (in the case of the
connection machine and others that have hundreds or thousands of
processors – this is called massively parallel) and coarse-grained
parallelism.
Fine-grained architectures (Table
1, above) have between two and a
couple of dozen processors. An example of a massively parallel problem
approach would be if I want to add 2048 numbers together and had 1024
processors to do it with. In the first iteration, all 1024 processors
each add two numbers together. The second iteration of 512 processors
adds the 1024 numbers together and so on. It would only take 11
iterations to add all 2048 numbers together in this fashion.
Course-grained programming (Table
2, above and Figure 3 below), on
the other hand, approaches the problem at a level we are all more
familiar with; segmenting our programs at the task level. Legacy code
can be easily converted to run on multi-core architectures using the
course-grained approach.
It is usually a good fit for engineering departments as teams are
usually tasked at the application level. A good understanding of how
the overall system works is necessary so that the different tasks are
prioritized so that the execution is ordered in a way that optimizes
performance. The down side of course-grained parallelism is that it may
lead to load imbalance.
There are advantages and disadvantages to both approaches. The
fine-grained approach brings massive amounts of processing power to
bear on a problem. It also has the potential to utilize the processors
more equally in a load balancing system. That being said, other than
data plane applications there are very few control plane applications
in the embedded space that could be easily implemented that would
benefit from massive parallelism. Furthermore fine-grained segmentation
is almost impossible to do without the use of a compiler that is
explicitly designed to exploit massive parallelism.
On the other hand, if one has a compiler that is designed to exploit
fine-grained parallelism, or you have hand written the code to exploit
a large number of processors, then it is easy for all the processors to
be utilized efficiently. Another disadvantage in exploiting
fine-grained parallelism is that there is a large overhead to pay in
terms of synchronizing and communication overhead.
Synchronization Overhead
In the context of parallel execution, synchronization overhead is the
waiting and communicating incurred by interdependent tasks waiting for
other tasks to intermediate results in order to begin execution again.
Unlike waiting for a Spinlock to release because of resource
contention, synchronization overhead is caused when a task cannot
continue execution until an intermediate result is provided by another
task.
In the sum of 2048 numbers example used above, delays incurred due
to memory latency, interrupt or other factors cause the execution times
to vary between the different processing elements. As a result of the
different execution times, tasks that are dependent on these
intermediate results essentially stall until the result is ready. As
the parallelism of an application increases, the potential for
synchronism overhead also increases. In general, there is a point for
almost all problem domains where the addition of additional processing
elements will cause a slow down in performance.
Load Imbalance
Fine-grained applications typically take advantage of doing the same
activity many, many times in parallel. The example above used a loop
unrolling approach. Course-grained segmentation approaches parallelism
at the task level, because it is almost impossible to design and
segment code in such a way that each task has the same execution time.
The different execution times lead to what is called “load imbalance.”
As shown in Figure 4 below, if there are a discrete number of tasks
that need to be performed and the execution times are unequal, some
cores will be idle while others are still executing.
Cache Coherency in uniprocessor
designs
Cache memory is used to speed execution of a microprocessor. Cache
memory is typically implemented as a relatively small amount of
high-speed memory. Unlike main memory, cache memory can satisfy memory
reads and writes at processor frequency.
Main memory on the other hand typically takes two or more cycles to
access. When a main memory fetch occurs, the process stalls. This is
avoided by keeping anticipated or frequently used data and instructions
in cache. The processor will only stall when the information it needs
is not in cache.
One side effect of using this two-layer approach to memory access is
that if cache is written to, then cache and main memory no longer
contain the same information. When this occurs, cache and main memory
are considered incoherent. There are different approaches that can
address this problem. The two most common are “write through” and
“write back.”
The simplest cache coherency mechanism is “write through.” When
writing to cache using “write through” each time the cache is written
to, it also writes through to main memory. Thus coherency is
maintained. The down side is that the processor stalls while the write
through completes.
A more advanced implementation for maintaining cache coherency tries
to address stalling during a write by buffering up writes so that they
occur less frequently. When the buffered writes are written to memory,
they are written in “burst mode.” This technique is called “write
back.”
The write back implementation for obvious reasons is more
complicated than write through. The write back implementation requires
that a list of “dirty” cache lines be maintained. The dirty entries
indicate cache lines whose contents are not the same as in main memory.
When the cache loads a cache line from main memory, its contents are
marked as clean. When the processor writes to a cache location, an
entry is made to the cache list to indicate that the cache entry is
dirty. Later, depending on when the “caching algorithm” decides, the
dirty cache entries are written to main memory.
Both write through and write back schemes can be implemented either
in hardware or in software. All modern commercial microprocessors I am
aware of use hardware to enforce cache coherency.
Multi-core Cache Coherency
Utilizing cache in a multi-core design takes the problem of cache
coherency to the next level. There are a number of multicore
architectures on the market now that utilize shared memory. These
architectures also utilize cache memory to speed execution.
Now, in addition to maintaining coherency between memory and a
single cache, the cache coherency scheme must address simultaneous
reads and writes to an individual processor’s cache while maintaining
coherency with the shared memory as well with all the other cache
memories that are used by the other processors.
The only efficient mechanism for doing this is by using cache
coherency hardware. Regardless of the hardware assist method used to
maintain cache coherency, using cache in multi-core systems tends to
serialize process execution. The question is to what degree. In later
articles, we will explore how different hardware architectures address
this.
Todd Brian is product marketing
manager at Accelerated
Technology, a Mentor Graphics division.
Part
Two of this series will cover various aspects of multiprocessing from
an RTOS perspective, looking at two different approaches - symmetric
(SMP) and asymmetric (AMP) multiprocessing - their benefits and
limitations in real time, deterministic environments.
To learn more about this topic, go to More
about multicores, multiprocessing and tools