Improved Parallel CacheOblivious Algorithms for
Dynamic Programming and Linear Algebra
Abstract
Emerging nonvolatile main memory (NVRAM) technologies provide byteaddressability, low idle power, and improved memorydensity, and are likely to be a key component in the future memory hierarchy. However, a critical challenge in achieving high performance is in accounting for the asymmetry that NVRAM writes can be significantly more expensive than NVRAM reads.
In this paper, we consider a large class of cacheoblivious algorithms for dynamic programming (DP) and linear algebra, and try to reduce the writes in the asymmetric setting while maintaining high parallelism. To achieve that, our key approach is to show the correspondence between these problems and an abstraction for their computation, which is referred to as the d grids. Then by showing lower bound and new algorithms for computing d grids, we show a list of improved cacheoblivious algorithms of many DP recurrences and in linear algebra in the asymmetric setting, both sequentially and in parallel.
Surprisingly, even without considering the readwrite asymmetry (i.e., setting the write cost to be the same as the read cost in the algorithms), the new algorithms improve the existing cache complexity of many problems. We believe the reason is that the extra level of abstraction of d grids helps us to better understand the complexity and difficulties of these problems. We believe that the novelty of our framework is of interests and leads to many new questions for future work.
1 Introduction
The idealcache model [46] is widely used in designing algorithms that optimize the communication between CPU and memory. The model is comprised of an unbounded memory and a cache of size . Data are transferred between the two levels using cache lines of size , and all computation occurs on data in the cache. An algorithm is cacheoblivious if it is unaware of both and . The goal of designing such algorithms is to reduce the cache complexity^{1}^{1}1In this paper, we refer to it as symmetric cache complexity to distinguish from the case when reads and writes have different costs. (or the I/O cost indistinguishably) of an algorithm, which is the number of cache lines transferred between the cache and the main memory assuming an optimal (offline) cache replacement policy. Sequential cacheoblivious algorithms are flexible and portable, and adapt to all levels of a multilevel memory hierarchy. Such algorithms are well studied [8, 28, 39], and in many cases they asymptotically match the best cache complexity for cacheaware algorithms. Regarding parallelism, Blelloch et al. [20] suggest that analyzing the depth and sequential cache complexity of an algorithm is sufficient for deriving upper bounds on parallel cache complexity.
Dimension  Problems  Cache Complexity  

Symmetric  Asymmetric  
LWS/GAP*/RNA/knapsack recurrences  
Combinatorial matrix multiplication,  
Kleene’s algorithm (APSP), Parenthesis recurrence 
Recently, emerging nonvolatile main memory (NVRAM) technologies, such as Intel’s Optane DC Persistent Memory, are readily available on the market, and provide byteaddressability, low idle power, and improved memorydensity. Due to these advantages, NVRAMs are likely to be the dominant main memories in the near future, or at least be a key component in the memory hierarchy. However, a significant programming challenge arises due to an underlying asymmetry between reads and writes—reads are much cheaper than writes in terms of both latency and throughput. This property requires researchers to rethink the design of algorithms and software, and optimize the existing ones accordingly to reduce the writes. Such algorithms are referred to as the \defnwriteefficient algorithms [52].
Many cacheoblivious algorithms are affected by this challenge. Taking matrix multiplication as an example, the cacheaware tilingbased algorithm [4] uses cacheline reads and cacheline writes for square matrices with size by. The cacheoblivious algorithm [46], despite the advantages described above, uses cacheline reads and writes. When considering the more expensive writes, the cacheoblivious algorithm is no longer asymptotically optimal. Can we asymptotically improve the cache complexity of these cacheoblivious algorithms? Can they match the best counterpart without considering cacheobliviousness? These remain to be open problems in the very beginning of the study of writeefficiency of algorithms [16, 29].
In this paper, we provide the answers to these questions for a large class of cacheoblivious algorithms that have computation structures similar to matrix multiplication and can be coded up in nested forloops. Their implementations are based on a divideandconquer approach that partitions the ranges of the loops and recurses on the subproblems until the base case is reached. Such algorithms are in the scope of dynamic programming (e.g., the LWS/GAP/RNA/Parenthesis problems) and linear algebra (e.g., matrix multiplication, Gaussian elimination, LU decomposition) [46, 34, 36, 31, 20, 60, 77, 74, 83, 78].
Since we try to cover many problems and algorithms, in this paper we propose a level of abstraction of the computation in these cacheoblivious algorithms, which is referred to as the d grid computation structures (short for the d grids). A more formal definition is given Section 3. This structure and similar ones are first used by Hong and Kung [56] (implicitly) in their seminal paper in 1981, and then by a subsequence of later work (e.g., [9, 2, 59, 10, 36]), mostly on analyzing the lower bounds of matrix multiplication and linear algebra problems in a variety of settings. In this paper, we show the relationship of the d grids and many other dynamic programming problems, and new results (algorithms and lower bounds) related to the d grids.
The first intellectual contribution of this paper is to draw the connection between many dynamic programming (DP) problems and the d grids. Previous DP algorithms are usually designed and analyzed based on the number of nested loops, or the number of the dimensions of which the input and output are stored and organized. However, we observe that the key underlying factor in determining the cache complexity of these computations is the number of input entries involved in each basic computation cell, and such relationship will be defined formally later in Section 3. A few examples (e.g., matrix multiplication, tensor multiplication, RNA and GAP recurrences) are also provided in Section 3 to illustrate the idea. This property is reflected by the nature of the d grids, and the correspondence between the problems and the d grids is introduced in Section 8 and 7. We note that such relationship can be much more complicated than the linear algebra algorithms, and in many cases the computation of one algorithm consists of many (e.g., ) d grids associated with restrictions of the order of computing.
The second intellectual contribution of this paper is a list of new results for the d grids. We first discuss the lower bounds to compute such d grids considering the asymmetric cost between writes and reads (in Section 4). Based on the analysis of the lower bounds, we also propose algorithms with the matching bound to compute a d grid (in Section 5). Finally, we also show how to parallelize the algorithm in Section 6. We note that the approach for parallelism is independent of the asymmetric readwrite cost, so the parallel algorithms can be applied to both symmetric and asymmetric algorithms.
In summary, we have shown the correspondence between the problems and the d grids, new lower and upper cache complexity bounds for computing the d grids in asymmetric setting, and parallel algorithms in both symmetric and asymmetric settings. Putting all pieces together, we can show lower and upper cache complexity bounds of the original problems in both the symmetric and asymmetric settings, as well as spans (length of dependence) for the algorithms. The cache complexity bounds are summarized in Table 1, and the results for the asymmetric setting answer the open problem in [16]. The span bound is analyzed for each specific problem and given in Section 7, 8 and appendices.
Surprisingly, even without considering the readwrite asymmetry (i.e., setting the write cost to be the same as the read cost in the algorithms), the new algorithms proposed in this paper improve the existing cache complexity of many DP problems. We believe the reason is that the extra level of abstraction of d grids helps us to better understand the complexity and difficulties of these problems. Since d grids are used as a tool for lower bounds, they decouple the computation structures from the complicated data dependencies, which exposes some techniques to improve the bounds that were previously obscured. Also, d grids reveal the similarities and differences between these problems, which allows the optimizations in some algorithms to apply to other problems.
In summary, we believe that the framework for analyzing cacheoblivious algorithms based on d grids provides a better understanding of these algorithms. In particular, the new theoretical results in this paper include:

We provide writeefficient cacheoblivious algorithms (i.e., in the asymmetric setting) for all problems we discussed in this paper, including matrix multiplication and several linear algebra algorithms, allpair shortestpaths, and a number of dynamic programming recurrences. If a write costs times more than a read (the formal computational model shown in Section 2), the asymmetric cache complexity is improved by a factor of or on each problem compared to the best previous results [18]. In some cases, we show that this improvement is optimal under certain assumptions (the CBCO paradigm, defined in Section 4.2).

We show algorithms with improved symmetric cache complexity on many problems, including the GAP recurrence, protein accordion folding, and the RNA recurrence. We show that the previous cache complexity bound for the GAP recurrence and protein accordion folding is not optimal, and we improve the bound to and respectively^{2}^{2}2The improvement is from an asymptotic perspective (i.e., approaching infinity). For smaller range of that , the improvement is and respectively for the two cases. (The computation fully fit into the cache when .). For RNA recurrence, we show an optimal cache complexity of , which improves the best existing result by .

We show the first racefree linear0pt cacheoblivious algorithms solving allpair shortestpaths, LWS recurrences, and protein accordion folding. Some previous algorithms [75, 41] have linear span, but they are not racefree and rely on a much stronger model (discussion in Section 2). Our approaches are under the standard nestedparallel model, racefree, and arguably simpler. Our algorithms are not inplace, but we discuss in Section 6.1 about the extra storage needed.
We believe that the analysis framework is concise. In this single paper, we discuss the lower bounds and parallel algorithms on a dozen or so computations and DP recurrences, which can be further applied to dozens of realworld problems^{3}^{3}3Like in this paper we abstract the “2knapsack recurrence”, which fits into our d grid computation structure and applies to many algorithms.. The results are shown in both settings with or without considering the asymmetric cost between reads and writes.
2 Preliminaries and Related Work
\myparagraphIdealcache model and cacheoblivious algorithms In modern computer architecture, a memory access is much more expensive compared to an arithmetic operation due to larger latency and limited bandwidth (especially in the parallel setting). To capture the cost of an algorithm on memory access, the idealcache model, a widelyused cost model, is a twolevel memory model comprised of an unbounded memory and a cache of size .^{4}^{4}4In this paper, we often assume the cache size to be since it simplifies the description and only affects the bounds by a constant factor. Data are transferred between the two levels using cache lines of size , and all computation occurs on data in the cache. The cache complexity (or the I/O cost indistinguishably) of an algorithm is the number of cache lines transferred between the cache and the main memory assuming an optimal (offline) cache replacement policy. An algorithm on this model is \defncacheoblivious with the additional feature that it is not aware of the value of and . In this paper, we refer to this cost as the \defnsymmetric cache complexity (as opposed to asymmetric memory as discussed later). Throughout the paper, we assume that the input and output do not fit into the cache since otherwise the problems become trivial. We usually make the tallcache assumption that , which holds for realworld hardware and is used in the analysis in Section 7.3.
The nestedparallel model and work0pt analysis In this paper, the parallel algorithms are within the standard nestedparallel model, which is a computation model and provides an easy analysis of the workefficiency and parallelism. In this model, a computation starts and ends with a single root task. Each task has a constant number of registers and runs a standard instruction set from a random access machine, except it has one additional instruction called FORK, which can create two independent tasks one at a time that can be run in parallel. When the two tasks finish, they join back and the computation continues.
A computation can be viewed as a (seriesparallel) DAG in the standard way. The cost measures on this model are the work and 0pt—work to be the total number of operations in this DAG and span (depth) equals to the longest path in the DAG. The randomized workstealing scheduler can execute such a computation on the PRAM model with processors in time with high probability [26]. All algorithms in this paper are racefree [43]—no logically parallel parts of an algorithm access the same memory location and one of the accesses is a write. Here we do not distinguish the extra write cost for asymmetric memory on and to simplify the description of the results, and we only capture this asymmetry using cache complexity.
Regarding parallel cache complexity, Blelloch et al. [20] suggest that analyzing the 0pt and sequential cache complexity of an algorithm is sufficient for deriving upper bounds on parallel cache complexity. In particular, let be the sequential cache complexity. Then for a processor sharedmemory machine with private caches (i.e., each processor has its own cache) using a workstealing scheduler, the total number of cache misses across all processors is at most with high probability [1]. For a processor sharedmemory machine with a shared cache of size using a paralleldepthfirst (PDF) scheduler, [15]. We can extend these bounds to multilevel hierarchies of private or shared caches, respectively [20].
Parallel and cacheoblivious algorithms for dynamic programming and linear algebra Dynamic Programming (DP) is an optimization strategy that decomposes a problem into subproblems with optimal substructure. It has been studied for over sixty years [12, 5, 38]. For the problems that we consider in this paper, the parallel DP algorithms were already discussed by a rich literature in the eighties and nighties (e.g., [49, 51, 42, 58, 57, 72]). Later work not only considers parallelism, but also optimizes symmetric cache complexity (e.g., [46, 34, 36, 31, 20, 60, 77, 74, 75, 41, 73, 32]). The algorithms in linear algebra that share the similar computation structures (but with different orders in the computation) are also discussed (e.g., [36, 41, 83, 78, 25, 40, 11, 65]).
Problem definitions Since we are showing many optimal cacheoblivious algorithms, we need formal problem definitions. It is hard to show general lower bounds that any type of operations is allowed. For example, for matrix multiplication on a semiring, the only known lower bound of operations is just for Boolean matrix multiplication assuming SETH (more details of finegrain complexity in [82]). Here we make no assumptions of the set of the ring other than “” and “” to be atomic using unit cost and unable to be decomposed or batched (i.e., using integer tricks). We borrow the term combinatorial matrix multiplication to indicate this specific problem. Such an algorithm requires operations on square matrices of size ( inner products). Regarding dynamic programming in Section 7, we discuss the recurrences rather than the problems, and make assumptions shown in Section 3 and Section 7.
Algorithms with asymmetric read and write costs Intel has already announced the new product of the Optane DC Persistent Memory, which can be bought from many retailers. The new memories sit on the main memory bus and are byteaddressable. As opposed to DRAMs, the new memories are persistent, so we refer to them as nonvolatile RAMs (NVRAMs). In addition, compared to DRAMs, NVRAMs require significantly lower energy, and have good read latencies and higher density. Due to these advantages, NVRAMs are likely to be the dominant main memories in the near future, or at least be a key component in the memory hierarchy. However, a new property of NVRAMs is the asymmetric read and write cost—write operations are more expensive than reads regarding energy, bandwidth, and latency (benchmarking results in [79]). This property requires researchers to rethink the design of algorithms and software, and motivates the need for writeefficient algorithms [52] that reduce the number of writes compared to existing algorithms.
Blelloch et al. [13, 16, 17] formally defined and analyzed several sequential and parallel computation models that take asymmetric readwrite costs into account. The model Asymmetric RAM (ARAM) extends the twolevel memory model and contains a parameter , which corresponds to the cost of a write relative to a read to the nonvolatile main memory. In this paper, we refer to the \defnasymmetric cache complexity as the number of write transfers to the main memory multiplied by , plus the number of read transfers. This model captures different system consideration (latency, bandwidth, or energy) by simply plugging in a different value of , and also allows algorithms to be analyzed theoretically and practically. Similar scheduling results (upper bounds) on parallel running time and cache complexity are discussed in [13, 17] based on work , 0pt and asymmetric cache complexity of an algorithm. Based on this idea, many interesting algorithms and lower bounds are designed and analyzed by various recent works [13, 16, 17, 23, 61, 14, 21, 19, 53].
We assume that the input and output size is larger than in the asymmetric setting (e.g., the input is not smaller or close to the cache size), since it is more theoretically and practically interesting. Otherwise the asymmetric cache complexity for all algorithms in this paper is trivially, the cost for output. In the analysis, we always assume that the input size is much larger than the cache size (which is usually the case in practice). Otherwise, both the upper and the lower bounds on cache complexity also include the term for output— times the output size. For simplicity, this term is ignored in the asymptotic analysis.
Carson et al. [29] also discussed algorithms using less writes. Their results are under some different assumptions that disallow the use of more reads, and we discuss how the assumptions affect the algorithms in the Appendix E.
Discussions of previous work We now discuss several possible confusions of this paper.
The d grid computation structure in this paper is similar to the structure in Hong and Kung [56], and some subsequence work in linear algebra (e.g., [9, 2, 59, 10, 36]). Several recent papers on DP algorithms are also based on grid structure (e.g., [31, 60, 77]), but the definitions in those paper are different from the d grids in this paper. In the d grid, the dimension of the grid is related to the number of entries per basic computation unit (formal definition in Section 3), not the dimension of the input/output arrays. However, in the special case when the number of input entries per basic computation cell is the same as the dimension of the input/output arrays, the analysis based on d grid provides the same sequential symmetric cache complexity as the previous work. One such example is matrix multiplication [46], \hide \guyI’m not sure the rest of this sentence is necessary, and we recommend the reader to use it as an instantiation to understand our improved parallel and asymmetric algorithms in Section 5 and 6. \guythe rest of this paragraph does not make much sense to me. This is also the case for other linearalgebra algorithms in the full version of this paper, and the LWS and Parenthesis recurrences in Section 7.1 and 7.4. and in these cases we still provide new lower and upper asymmetric cache bounds as well as parallel approaches (new span bounds). For other problems (GAP, RNA, protein accordion folding, knapsack), the bounds in the symmetric setting are also improved.
Since we believe that the abstraction and definition are new, the proof of lower bounds in Section 4 focuses on the simplest sequential setting with a limitedsize cache, which is different from the parallel or distributed settings with infinitesize local memory discussed in previous work (e.g., [9, 2, 59, 10]). It is an interesting future work to extend the results in this paper to the more complicated settings, especially in the asymmetric setting.
Some previous work [75, 41] achieves the linear 0pt in several problems. We note that they assume a much stronger model to guarantee the sequential and parallel execution order, so their algorithms need specially designed schedulers [41, 30]. Our algorithms are much simpler and under the nestedparallel model. Also, all algorithms in this paper are racefree, while previous algorithms heavily rely on concurrentwrites to improve span. The space issue of algorithms is discussed in Section 6.1.
The dynamic programming recurrences discussed in this paper have nonlocal dependencies (definition given in [51]), and we point out that they are pretty different from the problems like edit distance or stencil computations (e.g., [33, 47, 54, 67]) that only have local dependencies. We did not consider other types of dynamic programming approaches like rank convergence or hybrid way DAC algorithms [69, 70, 35] that cannot guarantee processor and cacheobliviousness simultaneously.
3 d Grid Computation Structure
The d grid computation structure (short for the d grid) is defined as a dimensional grid of size . Here we consider to be a small constant greater than 1. This computation requires input arrays and generates one output array . Each array has dimension and is the projection of the grid removing one of the dimensions. Each \defncell in the grid represents some certain computation that requires inputs and generates a temporary value. This temporary value is “added” to the corresponding location in the output array using an associative operation . The inputs of this cell are the projections of this cell removing each (but not the last) dimensions, and the output is the projection removing the last dimension. They are referred to as the input and output \defnentries of this cell. Figure 1 illustrates such a computation in 2 and 3 dimensions. This structure (mostly the special case for 3d as defined below) is used implicitly and explicitly by Hong and Kung [56] and some subsequence works (e.g., [9, 2, 59, 10, 36]). In this paper, we will use it as a building block to prove lower bounds and design new algorithms for dynamic programming problems. When showing the cache complexity, we assume the input and output entries must be in the cache when computing each cell.
We refer to a d grid computation structure as a \defnsquare grid computation structure (short for a square grid) of size if it has size . More concisely, we say a d grid has size if it is square and of size .
A formal definition of a square d grid of size is as follows:
where . computes a value based on the two inputs and the indices, and some constant amount of data that is associated to the indices. We assume that computing takes unit cost. Each application of corresponds to a cell, and , and are entries associated with this cell. The sum is based on the associative operator Similarly, the definition for the 2d case is:
and we can extend it to nonsquare cases and for accordingly.
We allow the output to be the same array as the input array(s) . This is used for all DP recurrences. In these algorithms, some of the cells are empty to avoid cyclic dependencies. For example, in a d grid, we may want to restrict . In these cases, a constant fraction of the grid cells are empty. We call such a grid an \defnfull grid for some constant if at least an fraction of the cells are nonempty. We will show that all properties we show for a d grid also work for the full case, since the constant affects neither the lower bounds nor the algorithms.
We now show some examples that can be matched to d grids. Multiplying two matrices of size by on a semiring (i.e., ) exactly matches a 3d square grid. A corresponding 2d case is when computing a matrixvector multiplication where does not need to be stored. Such applications are commonly seen in dynamic programming algorithms. For example, the widely used LWS recurrence (Section 7.1) that computes is a d grid, and the associative operator is . In this case the input is the same array as the output. These are the simple cases, so even without using the d grid, the algorithms for them in the symmetric setting are already studied in [34, 46].
However, not all DP recurrences can be viewed as d grids straightforwardly. As shown above, the key aspect of deciding the dimension of a computation is the number of inputs that each basic cell requires. For example, when multiplying two dense tensors, although each tensor may have multiple dimensions, each multiplication operation is only based on two entries and can be written in the previous 3d form, so the computation is a d grid. Another example is the RNA recurrence that computes a 2D array . Assuming can be queried onthefly, the computation is the simplest d grid. Despite that the DP table has size and updates in total, the computation is no harder than the simplest LWS recurrence mentioned in the previous paragraph. Similarly, in the GAP recurrence in Section 7.2, each element in the DP table is computed using many other elements similar to matrix multiplication. However, each update only requires the value of one input element and can be represented by a set of d grids, unlike matrix multiplication that is a d grid and uses the values of two input elements in each update. The exact correspondence between the d grid and the DP recurrences are usually more sophisticated than their classic applications in linear algebra problems, as shown in Section 7, 8 and appendices. The cacheoblivious algorithms discussed in this paper are based on d grids with or , but we can also find applications with larger (e.g., a Nim game with some certain rules on multiple piles [27]).
We note that our definition of the d grid cannot abstract all possible recurrences and computations, but it is sufficient to analyze the DP recurrences and algorithms shown in Section 7, 8 and appendices. Also the d grid is designed to analyze computations with nonlocal dependencies [51], so it is not useful to problems such as the classic edit distance and matrix addition.
4 Lower Bounds
We first discuss the lower bounds of the cache complexities for a d grid computation structure, which sets the target to design the algorithms in the following sections. In Section 4.1 we show the symmetric cache complexity. This is a direct extension of the classic result by Hong and Kong [56] to an arbitrary dimension. Then in Section 4.2 we discuss the asymmetric cache complexity when writes are more expensive than reads, which is more interesting and has a more involved analyses.
4.1 Symmetric Cache Complexity
The symmetric cache complexity of a d grid is simple to analyze, yielding to the following result: {theorem}[[56]] The symmetric cache complexity of a d grid computation structure with size is . The following proof is an extension of the proof by Hong and Kong [56] and we show it here again for the completeness and to help to understand the proof in Section 4.2 which has a similar outline.
Proof.
In a d grid computation structure with size there are cells. Let’s sequentialize these cells in a list and consider each block of cells that considers consecutive cells in the list. The number of input entries required of each block is the projection of all cells in this block along one of the first dimensions (see Figure 1), and this is similar for the output. LoomisWhitney inequality [68, 10] indicates that the overall number of input and output entries is minimized when the cells are in a square d cuboid, giving a total of input and output entries. Since only a total of entries can be held in the cache at the beginning of the computation of this block, the number of cacheline transfer for the input/output during the computation for such a block is . Since there are such blocks, the cache complexity of the entire computation is . ∎
Notice that the proof does not assume cacheobliviousness, but the lower bound is asymptotically tight by applying a sequential cacheoblivious algorithm that is based on way divideandconquer [46].
4.2 Asymmetric Cache Complexity
We now consider the asymmetric cache complexity of a d grid computation structure assuming writes are more expensive. Unfortunately, this case is significantly harder than the symmetric setting. Again for simplicity we first analyze the square grid of size , which can be extended to the more general cases similar to [46].
Interestingly, there is no specific pattern that a cacheoblivious algorithm has to follow. Some existing algorithms use “buffers” to support cacheobliviousness (e.g., [7]), and many others use a recursive divideandconquer framework. For the recursive approaches, when the cache complexity of the computation is not leafdominated (like various sorting algorithms [46, 16]), a larger fanout in the recursion is more preferable (usually set to ). Otherwise, when it is leafdominated, existing efficient algorithms all pick a constant fanout in the recursion in order to reach the base case and fit in the cache with maximal possible subproblem size. All problems we discuss in this paper are in this category, so we make our analysis under the following constraints. More discussion about this constraint in given in Section 9.
[CBCO paradigm] We say a divideandconquer algorithm is under the constantbranching cacheoblivious (CBCO) paradigm if it has an inputvalue independent computational DAG, such that each task has constant^{5}^{5}5It can exponentially depend on since we assume is a constant. fanouts of its recursive subtasks until the base cases, and the partition of each task is decided by the ratio of the ranges in all dimensions of the (sub)problem and independent of the cache parameters ( and ).
Notice that is a parameter of the main memory, instead of a cache parameter, so the algorithms can be aware of it. One can define resourceobliviousness [37] so that the value of is not exposed to the algorithms, but this is out of the scope of this paper.
For divideandconquer CO algorithms, the overall memory footprints in different recursive levels can either be rootdominated, balanced, or leafdominated. For existing COMM algorithms on symmetric memories that have optimal cache complexity, each task has a constant number of subtasks. This is because the recurrence of the I/O cost in matrix multiplication is leafdominated, and constant branching can fit the subtasks into the cache when the memory footprint is between and for is no more than the number of branches. A nonconstant branching on the other hand, can delay this in the worst case, and thus increases the overall cache complexity. Notice that the CBCO paradigm is not required for rootdominated or balanced divideandconquer computation.
We now prove the (sequential) lower bound on the asymmetric cache complexity of a d grid under the CBCO paradigm. The constant branching and the partition based on the ratio of the ranges in all dimensions restrict the computation pattern and lead to the “scalefree” property of the cacheoblivious algorithms: the structure or the “shape” of each subproblem in the recursive levels is similar, and only the size varies. The proof references this property when it is used. The CBCO paradigm also restricts the shape of the computation, which is a stronger assumption than the LoomisWhitney inequality used in the previous proof.
The asymmetric cache complexity of d grid is under the CBCO paradigm.
Proof.
We prove the lower bound using the same approach in Section 4.1—putting all operations (cells) executed by the algorithm in a list and analyzing blocks of cells. The cache can hold entries as temporary space for the computation. For the lower bound, we only consider the computation in each cell without considering the step of adding the calculated value back into the output array, which only makes the problem easier. Again when applying the computation of each cell, the input and output entries have to be in the cache.
For a block of cells with size , the cache needs to hold the entries in and corresponding to the cells in this block at least once during the computation. Similar to the symmetric setting discussed above, the number of entries is minimized when the sequence of operations are within a d cuboid of size where the projections on and are d arrays with sizes and . Namely, the number of entries is at least for the corresponding input or output array.
Note that the input arrays are symmetric to each other regarding the access cost, but in the asymmetric setting storing the output entries is more expensive since they have to be written back to the asymmetric memory. As a result, the cache complexity is minimized when , and let’s denote where is the ratio between and other . Here we assume since reads are cheaper. Due to the scalefree property that and are arbitrary, should be fixed (within a small constant range) for the entire recursion.
Similar to the analysis for Theorem 4.1, for a block of size , the read transfers required by the cache is , where is the number of such blocks, and lower bounds the number of reads per block because at most entries can be stored in the cache from the previous block. Similarly, the write cost is . In total, the cost is:
The second step is due to .
The cost decreases as the increase of , but it has two discontinuous points and . Therefore,
Setting minimizes
5 A Matching Upper Bound on Asymmetric Memory
In the sequential and symmetric setting, the classic cacheoblivious divideandconquer algorithms to compute the d grid (e.g., 3D case shown in [46]) is optimal. In the asymmetric setting, the proof of Theorem 4.2 indicates that the classic algorithm is not optimal and off by a factor of . This gap is captured by the balancing factor in the proof, which leads to more cheap reads and less expensive writes in each subcomputation.
We now show that the lower bound in Theorem 4.2 is tight by a (sequential) cacheoblivious algorithm with such asymmetric cache complexity. The algorithm is given in Algorithm 1, which can be viewed as a variant of the classic approach with minor modifications on how to partition the computation. Notice that in line 1 and 1, “conceptually” means the partitions are used for the ease of algorithm description. In practice, we can just pass the ranges of indices of the subtask in the recursion, instead of actually partitioning the arrays.
Compared to the classic approaches (e.g., [46]) that partition the largest input dimension among , the only underlying difference in the new algorithm is in line 1—when partitioning the dimension not related to the output array (line 1–1), has to be times larger than . This modification introduces an asymmetry between the input size and output size of each subtask, which leads to fewer writes in total and an improvement in the cache efficiency.
For simplicity, we show the asymmetric cache complexity for square grids (i.e., ) and , and the general case can be analyzed similar to [46].
Algorithm 1 computes the d grid of size with asymmetric cache complexity .
Proof.
We separately analyze the numbers of reads and writes in Algorithm 1. In the sequential execution of Algorithm 1, each recursive function call only requires extra temporary space. Also, our analysis ignores rounding issues since they will not affect the asymptotic bounds.
When starting from the square grid at the beginning, the algorithm first partitions in the first dimensions (via line 1 to 1) into subproblems (referred to as secondphase subproblems) each with size , and then partition dimensions in turn until the base case is reached.
The number of writes of the algorithm (to array ) follows the recurrences:
where is the number of writes of the secondphase subproblems with the size of being . The base case is when . Solving the recurrences gives , and .
We can analyze the reads similarly by defining and . The recurrences are therefore:
and
The difference from the write cost is in the base case since the input fits into the cache sooner when . Namely, . By solving the recurrences, we have . and
Comparing to the classic approach, the new algorithm improves the asymmetric cache complexity by a factor of , since the classic algorithm requires reads and writes. Again here we assume is much larger than . Otherwise, the lower and upper bounds should include for storing the output on the memory.
When the size of the matrix multiplication is , Algorithm LABEL:algo:AsymMM has cache complexity
To be verified.
6 Parallelism
We now show the parallelism in computing the d grids. The parallel versions of the cacheoblivious algorithms only have polylogarithmic 0pt, indicating that they are highly parallelized.
6.1 The Symmetric Case
We first discuss how to parallelize the classic algorithm on symmetric memory. For a square grid, the algorithm partitions the dimensions in turn until the base case is reached.
Notice that in every consecutive partitions, there are no dependencies in of them, so we can fully parallelize these levels with no additional cost. The only exception is during the partition in the th dimension, whereas both subtasks share the same output array and cause write concurrence. If such two subtasks are sequentialized (like in [46]), the 0pt is .
We now introduce the algorithm with logarithmic depth. As just explained, to avoid the two subtasks from modifying the same elements in the output array , our algorithm works as follows when partitioning the th dimension:

Allocating two stackallocated temporary arrays with the same size of the output array before the two recursive function calls.

Applying computation for the d grid in two subtasks using different output arrays that are just allocated (no concurrency to the other subtask).

When both subtasks finish, the computed values are merged (added) back in parallel, with work proportional to the output size and 0pt.

Deallocating the temporary arrays.
Notice that the algorithm also works if we only allocate temporary space for one of the subtasks, while the other subtask still works on the original space for the output array. This can be a possible improvement in practice, but in high dimensional case () it requires complicated details to pass the pointers of the output arrays to descendant nodes, aligning arrays to cache lines, etc. Theoretically, this version does not change the bounds except for the stack space in Lemma 6.1 when .
We first analyze the cost of square grids of size in the symmetric setting, and will discuss the asymmetric setting later.
The overall stack space for a subtask of size is .
Proof.
The parallel algorithm allocates memory only when partitioning the output (th) dimension. In this case, it allocates and computes two subtasks of size where is the size of the output dimension. This leads to the following recurrence:
The recurrence solves to when since the recurrence is rootdominated. When , we can apply the version that only allocates temporary space for one subtask, which decreases the constant before to 1, and yields . Note that we only need to analyze one of the branches, since the temporary spaces that are not allocated in the direct ancestor of this subtask have already been deallocated, and will be reused for later computations for the current branch. ∎
With the lemma, we have the following corollary: {corollary} A subtask of size can be computed within a cache of size .
This corollary indicates that this modified parallel algorithm has the same sequential cache complexity since it fits into the cache in the same level as the classic algorithm (the only minor difference is the required cache size increases by a small constant factor). Therefore we can apply the a similar analysis in [46] ( in the paper) to show the following lemma: {lemma} The sequential symmetric cache complexity of the parallel cacheoblivious algorithm to compute a d grid of size is .
Assuming that we can allocate a chunk of memory in constant time, the 0pt of this approach is simply — levels of recursion, each with 0pt for the additions [20].
We have shown the parallel 0pt and symmetric cache complexity. By applying the scheduling theorem in Section 2, we have the following result for parallel symmetric cache complexity. {corollary} The d grid of size can be computed with the parallel symmetric cache complexity of with private caches, or with a share cache of size . \hide
Proof.
The total number of steals is upper bounded by [1]. For privatecache case, the extra I/O cost by a stolen task is no more than the cost to refill the cache from the original thread, which is . Plugging in the 0pt gives the stated parallel cache complexity. (Discuss share cache case.) ∎
We now analyze the overall space requirement for this algorithm. Lemma 6.1 shows that the extra space required is for sequentially running the parallel algorithm. Naïvely the parallel space requirement is , which can be very large. We now show a better upper bound for the extra space. {lemma} The overall space requirement of the parallel algorithm to compute the d grid is .
Proof.
We analyze the total space allocated for all processors. Lemma 6.1 indicates that if the root of the computation on one processor has the output array of size , then the space requirement for this task is . There are in total processors. There can be at most processors starting with their computations of size , of size , until processors of size where . This case maximizes the overall space requirement for processors, which is:
This shows the stated bound. ∎
Combining all results gives the following theorem: {theorem} There exists a cacheoblivious algorithm to compute a d grid of size that requires work, symmetric cache complexity, 0pt, and main memory size.
Additional space required The following discussion is purely on the practical side and does not affect the theoretical analysis of all the theorems in this paper.
We believe the space requirement for the parallel cacheoblivious algorithm is acceptable since it is asymptotically the same as the most intuitively (noncacheoblivious) parallel algorithm that partitions the computation into square subtasks each with size . In practice nowadays it is easy to fit several terabyte main memory onto a single powerful machine such that the space requirement can usually be satisfied. For example, a Dell PowerEdge R940 has about and the main memory can hold more than integers, while the new NVRAMs will have even more capacity (up to 512GB per DIMM). On such machines, when , the grid needs to contain more than cells to exceed the memory size—such computation takes too long to run on a single sharedmemory machine. For , we need about cells to exceed the main memory size, which will take weeks to execute on a highestend sharedmemory machine. Hence, throughout the paper we focus on cache complexity and span. Even if one wants to run such a computation, we can use the following approach to slightly change the algorithm to bound the extra space as a practical fix.
We can first partition the input dimensions for rounds to bound the largest possible output size to be (similar to the case discussed in Section 6.2). Then the overall extra space for all processors is limited to , the same as the input/output size. If needed, the constant in the bigO can also be bounded. Such a change will not affect the cache complexity and the 0pt as long as the main memory size is larger than where is the cache size. This is because the changes of partition order do not affect the recurrence depth, and the I/O cost is still dominated by when the subproblems fitting the cache. In practice, DRAM size is always several orders of magnitude larger than .
6.2 The Asymmetric Case
Algorithm 1 considers the writeread asymmetry, which involves some minor changes to the classic cacheoblivious algorithm. Regarding parallelism, the changes in Algorithm 1 only affect the order of the partitioning of the d grid in the recurrence, but not the parallel version and the analysis in Section 6.1. As a result, the 0pt of the parallel variant of Algorithm 1 is also . The extra space upper bound is actually reduced, because the asymmetric algorithm has a higher priority in partitioning the input dimensions that does not requires allocation temporary space. {lemma} The space requirement of Algorithm 1 on processors is .
Proof.
Algorithm 1 first partition the input dimensions until subtasks are generated. Then the algorithm will partition dimensions in turn. If , then each processor requires no more than extra space at any time, so the overall extra space is . Otherwise, the worst case appears when processors work on each of the subtasks. Based on Lemma 6.1, the extra space is bounded by . Combining the two cases gives the stated bounds. ∎
Lemma 6.2 indicates that Algorithm 1 requires extra space no more than the input/output size asymptotically when , which should always be true in practice.
The challenge arises in scheduling this computation. The scheduling theorem for the asymmetric case [13] constraints on the nonleaf stack memory to be a constant size. This contradicts the parallel version in Section 6.1. This problem can be fixed based on Lemma 6.1 that upper bounds the overall extra memory on one task. Therefore the stackallocated array can be moved to the heap space. Once a task is stolen, the first allocation will annotate a chunk of memory with size order of where is the current output. Then all successive heapbased memory allocation can be simulated on this chunk of memory. In this manner, the stack memory of each node corresponding to a function call is constant, which allows us to apply the scheduling theorem in [13].
Algorithm 1 with input size requires work, asymmetric cache complexity, and 0pt to compute a d grid of size .
7 Dynamic Programming Recurrences
In this section we discuss a number of new results on dynamic programming (DP). To show lower and upper bounds on parallelism and cache efficiency in either symmetric and asymmetric setting, we focus on the specific DP recurrences instead of the problems. We assume each update in the recurrences takes unit cost, just like the d grid in Section 3.
The goal of this section is to show how the DP recurrences can be viewed as and decomposed into the d grids. Then the lower and upper bounds discussed in Section 4 and 5, as well as the analysis of parallelism in Section 6, can be easily applied to the computation of these DP recurrences. When the dimension of the input/output is the same as the number of entries in each grid cell, then the sequential and symmetric versions of the algorithms in this section are the same as the existing ones discussed in [46, 34, 36, 31, 77], but the others are new. Also, the asymmetric versions and most parallel versions are new. We improve the existing results on symmetric/asymmetric cache complexity, as well as parallel 0pt.
Symmetric cache complexity We show improved algorithms for a number of problems when the number of entries per cell differs from the dimension of input/output arrays. Such algorithms are for the GAP recurrence, protein accordion folding, and the RNA recurrence. We show that the previous cache bound for the GAP recurrence and protein accordion folding is not optimal, and we improve the bounds in Theorem 7.2 and 7.2. For the RNA recurrence, we show an optimal cache complexity of in Theorem 7.2, which improves the best existing result by .
Asymmetric cache complexity By applying the asymmetric version for the d grid computation discussed in Section 5, we show a uniform approach to provide writeefficient algorithms for all DP recurrences in this section. We also shown the optimality of all these algorithms regarding asymmetric cache complexity, expect for the one for the GAP recurrence.
Parallelism The parallelism of these algorithms is provided by the parallel algorithms discussed in Section 6. Polylogarithmic 0pt can be achieved in computing the 2knapsack recurrence, and linear 0pt in LWS recurrence and protein accordion folding. The linear 0pt for LWS can be achieved by previous work [75, 41], but they are not racefree and in the nestedparallel model. Meanwhile, our algorithms are arguably simpler.
7.1 LWS Recurrence
We start with the simple example of the LWS recurrence where optimal sequential upper bound in the symmetric setting is known [34]. We show new results for lower bounds, writeefficient cacheoblivious algorithms, and new span bound.
The LWS (leastweighted subsequence) recurrence [55] is one of the most commonlyused DP recurrences in practice. Given a realvalued function for integers and , for ,
This recurrence is widely used as a textbook algorithm to compute optimal 1D clustering [63], line breaking [64], longest increasing sequence, minimum height Btree, and many other practical algorithms in molecular biology and geology [50, 51], computational geometry problems [3], and more applications in [66]. Here we assume that can be computed in constant work based on a constant size of input associated to and , which is true for all these applications. Although different special properties of the weight function can lead to specific optimizations, the study of recurrence itself is interesting, especially regarding cache efficiency and parallelism.
We note that the computation of this recurrence is a standard d grid. Each cell and updates as the output entry, so Theorem 4.1 and 4.2 show lower bounds on cache complexity on this recurrence (the grid is (1/2)full).
We now introduce cacheoblivious implementation considering the data dependencies. Chowdhury and Ramachandran [34] solves the recurrence with work and symmetric cache complexity. The algorithm is simply a divideandconquer approach and we describe and extend it based on d grids. A task of range computes the cells such that . To compute it, the algorithm generates two equalsize subtasks and where , solves the first subtask recursively, then computes the cells corresponding to for , and lastly solves the subtask recursively. Note that the middle step also matches a d grid with no dependencies between the cells, which can be directly solved using the algorithms in Section 5. This leads to the cache complexity and 0pt to be:
Here denotes the computation of a d grid. The recurrence is rootdominated with base cases and . This solves to the following theorem.
The LWS recurrence can be computed in work, and optimal symmetric and asymmetric cache complexity respectively, and 0pt.
Notice that Galil and Park [51] denoted that a task can be solved in work and 0pt for , based on recursive matrix multiplication. Since matrix multiplication is a d grid, plugging in the result from Section 8.1 gives the cache complexity as . Thus the cacheoblivious algorithm to compute LWS recurrence can switch to this matrixmultiplicationbased approach when a task has size less or equal to , which does not include extra work, decreases the 0pt, but may potentially increase the I/O cost. More specifically, we have: {corollary} LWS recurrence can be computed in work,
7.2 GAP Recurrence
We now consider the GAP recurrence that the analysis of the lower bounds and the new algorithm make use of multiple grid computation. The GAP problem [49, 51] is a generalization of the edit distance problem that has many applications in molecular biology, geology, and speech recognition. Given a source string and a target string , other than changing one character in the string, we can apply a sequence of consecutive deletes that corresponds to a gap in , and a sequence of consecutive inserts that corresponds to a gap in . For simplicity here we assume both strings have length , but the algorithms and analyses can easily be adapted to the more general case. Since the cost of such a gap is not necessarily equal to the sum of the costs of each individual deletion (or insertion) in that gap, we define as the cost of deleting the substring of from th to th character, for inserting the substring of accordingly, and as the cost to change the th character in to th character in .
Let be the minimum cost for such transformation from the prefix of with characters to the prefix of with characters, the recurrence for is:
corresponding to either replacing a character, inserting or deleting a substring. The boundary is set to be , and . The diagonal dependency from will not affect the asymptotic analysis since it will at most double the memory footprint, so it will not show up in the following analysis.
The best existing algorithms on GAP Recurrence [34, 74] have symmetric cache complexity of . This upper bound seems to be reasonable, since in order to compute , we need the input of two vectors and , which is similar to matrix multiplication and other algorithms in Section 8. However, as indicated in Section 3, each update in GAP only requires one entry, while matrix multiplication has two. Therefore, if we ignore the data dependencies, the first line of the GAP recurrence can be viewed as LWS recurrences, independent of the dimension of (similarly for the second line). This derives a lower bound on cache complexity to be that of an LWS recurrence multiplied by , which is (assuming ). Hence, the gap between the lower and upper bounds is .
We now discuss an I/Oefficient algorithm to close this gap. This algorithm is not optimal, but reduce it to . How to remove the loworder term remains as an open problem. The new algorithm is similar to Chowdhury and Ramachandran’s approach [34] based on divideandconquer to compute the output . The algorithm recursively partitions into four equalsize quadrants , , and , and starts to compute recursively. After this is done, it uses the computed value in to update and . Then the algorithm computes and within their own ranges, updates using the results from and , and solves recursively at the end. The highlevel idea is shown in Figure 2.
We note that in Steps (b) and (d), the interquadrant updates compute LWS recurrences (with no data dependencies) each with size (assuming has size ). Therefore, our new algorithm reorganizes the data layout and the order of computation to take advantage of our I/Oefficient and parallel algorithm on d grids. Since the GAP recurrence has two independent sections one in a column and the other in a row, we keep two copies of , one organized in column major and the other in row major. Then when computing on the interquadrant updates as shown in Steps (b) and (d), we start parallel tasks each with size and compute a d grid on the corresponding row or column, taking the input and output with the correct representation. These updates require work and cache complexity shown in Theorem 7.1. We also need to keep the consistency of the two copies. After the update of a quadrant or is finished, we apply a matrix transpose [20] to update the other copy of this quadrant by taking a as the associative operator , so that the two copies of are consistent before Steps (c) and (e). The cost of the transpose is a lowerorder term. For the quadrant , we wait until the two updates from and finish, and then apply the matrix transpose to update the values in each other. It is easy to check that by induction, the values in both copies in a quadrant are updatetodate at the beginning of each recursion in Step (c) and (e).
Our new algorithm still requires work since it does not require extra asymptotic work. The cache complexity and 0pt satisfy:
The coefficients are easily shown by Figure 2. We first discuss the symmetric setting. The base cases are and for . This is a “balanced” recurrence with I/O cost per level for levels. This indicates . The toplevel computation is root dominated since the overall number of cells in a level decreases by a half after every recursion. Therefore, if , , which is the basecase cost plus the toplevel cost. Otherwise, all input/output for each 2d grid in the interquadrant update fit in the cache, so we just need to pay I/O cost for rounds of recursion, leading to . Similarly we can show the asymmetric results by plugging in different base cases.
The GAP recurrence can be computed in work, 0pt, symmetric cache complexity of
and asymmetric cache complexity of
Compared to the previous results [34, 36, 31, 60, 77, 74], the improvement on the symmetric cache complexity is asymptotically (i.e., approaching infinity). For smaller range of that , the improvement is . (The computation fully fit into the cache when .)
Similar to the LWS recurrence, Galil and Park [51] indicate that a subtask of size in the recursion can be solved in a matrixmultiplicationbased approach, which uses work, 0pt and cache complexity using the analysis in Section 5. As a result, if the 0pt of the previous algorithm is not satisfied, we can switch to this version when a subtask has size no more than . {corollary} GAP recurrence can be computed in work,
Protein accordion folding The recurrence for protein accordion folding [77] is for , with cost to precompute . Although there are some minor differences, from the perspective of the computation structure, the recurrence can basically be viewed as only containing the first section of the GAP recurrence. As a result, the same lower bounds of GAP can also apply to this recurrence.
In terms of the algorithm, we can compute d grids with the increasing order of from to , such that the input are for and the output are for . For short, we refer to a d grid as a task. However, the input and output arrays are in different dimensions. To handle it, we use a imilar method to the GAP algorithm that keeps two separate copies for , one in columnmajor and one in rowmajor. They are used separately to provide the input and output for the d grid. We apply the transpose in a divideandconquer manner—once the first half of the tasks finish, we transpose all computed values from the output matrix to the input matrix (which is a square matrix), and then compute the second half of the task. Both matrix transposes in the first and second halves are applied recursively with geometrically decreasing sizes. The correctness of this algorithm can be verified by checking the data dependencies so that all required values are computed and moved to the correct positions before they are used for further computations.
The cache complexity is from two subroutines: the computations of d grids and matrix transpose. The cost of d grids is simply upper bounded by times the cost of each task, which is and for symmetric and asymmetric cache complexity, and 0pt. For matrix transpose, the cost can be verified in the following recursions.
where indicates the matrix transpose. The base case is and . Applying the bound for matrix transpose [20] provides the following theorem. {theorem} Protein accordion folding can be computed in work, symmetric and asymmetric cache complexity of and
7.3 RNA Recurrence
The RNA problem [51] is a generalization of the GAP problem. In this problem a weight function