Part 2 looks at the tradeoffs between program and data cache optimizations, and shows how to choose the best compromise.
As we saw in the first two parts of this series, cache optimization is often based on the rearrangement of data and instructions. In this article we discuss an optimization technique for instruction caches based on the placement of functional blocks of code in memory.
The Optimization Concept
The main idea behind code placement is to position functional blocks in memory based on the program's function call tree flow. Linking functions in this manner allows us to harness spatial locality in memory accesses patterns. (For definitions of locality, see part 1.)
Modern caches have mechanisms which prefetch memory that is in the vicinity of the currently executing block. During the execution of a function, code placed after it in main memory is brought into the instruction cache. If the functions have been linked wisely, instruction cache misses can be minimized, resulting in a high cache hit rate.
As an example, consider the function call tree flow in Figure 1. Suppose function F2 is linked near function F1, but function F3 is not linked near F1. When function F1 calls F2, it is possible that F2 is already in the cache and that there will be no cache miss. (The likelihood that F2 is in the cache depends on the sizes of F1 and F2, as well as the location of the call inside F1.) In contrast, there will probably be a cache miss when F1 calls F3. Because F3 is located far away from F1 in memory, it is not likely to be in the cache when F1 calls it.

Figure 1. Function call tree flow example.
The Optimization Method
We saw that rearranging functions in memory can reduce instruction cache misses. The techniques of code positioning (which are performed at the linker level) can be extended to basic blocks of code (which can be performed at compiler level). A basic block is a linear sequence of code with one entry point, one exit point and no jump instructions.
Software optimization techniques using code positioning typically involve the following steps:
- Analyze the behavior of the code
- Generate a new function placement that yields better performance
To analyze the behavior of the code, we profile code using a variety of input data and observe the resulting path of code flow. We can evaluate the code flow for either the most likely code path (typical case) or the most cycle-consuming path (worst case). To evaluate the typical case, we build a tree with every node or leaf representing a function. A node's children represent the functions called by the parent node function, and are placed left to right in the order in which they are invoked by the parent. If a function is called from several places, it is placed only once in the tree in the first place it is called.
The functions are numbered in the order that they are first called. If the tree is traversed using the depth-first (DF) method and the numbers are recorded in the nodes, we should get 1,2,3,4,.,last_number, where last_number equals the number of functions involved.