Parallel external memory

In computer science, a parallel external memory (PEM) model is a cache-aware, external-memory abstract machine. It is the parallel-computing analogy to the single-processor external memory (EM) model. In a similar way, it is the cache-aware analogy to the parallel random-access machine (PRAM). The PEM model consists of a number of processors, together with their respective private caches and a shared main memory.

Model

Definition

The PEM model is a combination of the EM model and the PRAM model. The PEM model is a computation model which consists of P {\displaystyle P} processors and a two-level memory hierarchy. This memory hierarchy consists of a large external memory (main memory) of size N {\displaystyle N} and P {\displaystyle P} small internal memories (caches). The processors share the main memory. Each cache is exclusive to a single processor. A processor can't access another’s cache. The caches have a size M {\displaystyle M} which is partitioned in blocks of size B {\displaystyle B}. The processors can only perform operations on data which are in their cache. The data can be transferred between the main memory and the cache in blocks of size B {\displaystyle B}.

I/O complexity

The complexity measure of the PEM model is the I/O complexity, which determines the number of parallel blocks transfers between the main memory and the cache. During a parallel block transfer each processor can transfer a block. So if P {\displaystyle P} processors load parallelly a data block of size B {\displaystyle B} form the main memory into their caches, it is considered as an I/O complexity of O ( 1 ) {\displaystyle O(1)} not O ( P ) {\displaystyle O(P)}. A program in the PEM model should minimize the data transfer between main memory and caches and operate as much as possible on the data in the caches.

Read/write conflicts

In the PEM model, there is no direct communication network between the P processors. The processors have to communicate indirectly over the main memory. If multiple processors try to access the same block in main memory concurrently read/write conflicts occur. Like in the PRAM model, three different variations of this problem are considered:

Concurrent Read Concurrent Write (CRCW): The same block in main memory can be read and written by multiple processors concurrently.
Concurrent Read Exclusive Write (CREW): The same block in main memory can be read by multiple processors concurrently. Only one processor can write to a block at a time.
Exclusive Read Exclusive Write (EREW): The same block in main memory cannot be read or written by multiple processors concurrently. Only one processor can access a block at a time.

The following two algorithms solve the CREW and EREW problem if P ≤ B {\displaystyle P\leq B} processors write to the same block simultaneously. A first approach is to serialize the write operations. Only one processor after the other writes to the block. This results in a total of P {\displaystyle P} parallel block transfers. A second approach needs O ( log ⁡ ( P ) ) {\displaystyle O(\log(P))} parallel block transfers and an additional block for each processor. The main idea is to schedule the write operations in a binary tree fashion and gradually combine the data into a single block. In the first round P {\displaystyle P} processors combine their blocks into P / 2 {\displaystyle P/2} blocks. Then P / 2 {\displaystyle P/2} processors combine the P / 2 {\displaystyle P/2} blocks into P / 4 {\displaystyle P/4}. This procedure is continued until all the data is combined in one block.

Comparison to other models

Model	Multi-core	Cache-aware
Random-access machine (RAM)	No	No
Parallel random-access machine (PRAM)	Yes	No
External memory (EM)	No	Yes
Parallel external memory (PEM)	Yes	Yes

Examples

Multiway partitioning

Let M = { m 1 , . . . , m d − 1 } {\displaystyle M=\{m_{1},...,m_{d-1}\}} be a vector of d-1 pivots sorted in increasing order. Let A be an unordered set of N elements. A d-way partition of A is a set Π = { A 1 , . . . , A d } {\displaystyle \Pi =\{A_{1},...,A_{d}\}} , where ∪ i = 1 d A i = A {\displaystyle \cup _{i=1}^{d}A_{i}=A} and A i ∩ A j = ∅ {\displaystyle A_{i}\cap A_{j}=\emptyset } for 1 ≤ i < j ≤ d {\displaystyle 1\leq i<j\leq d}. A i {\displaystyle A_{i}} is called the i-th bucket. The number of elements in A i {\displaystyle A_{i}} is greater than m i − 1 {\displaystyle m_{i-1}} and smaller than m i 2 {\displaystyle m_{i}^{2}}. In the following algorithm the input is partitioned into N/P-sized contiguous segments S 1 , . . . , S P {\displaystyle S_{1},...,S_{P}} in main memory. The processor i primarily works on the segment S i {\displaystyle S_{i}}. The multiway partitioning algorithm (PEM_DIST_SORT) uses a PEM prefix sum algorithm to calculate the prefix sum with the optimal O ( N P B + log ⁡ P ) {\displaystyle O\left({\frac {N}{PB}}+\log P\right)} I/O complexity. This algorithm simulates an optimal PRAM prefix sum algorithm.

If the vector of d = O ( M B ) {\displaystyle d=O\left({\frac {M}{B}}\right)} pivots M and the input set A are located in contiguous memory, then the d-way partitioning problem can be solved in the PEM model with O ( N P B + ⌈ d B ⌉ > log ⁡ ( P ) + d log ⁡ ( B ) ) {\displaystyle O\left({\frac {N}{PB}}+\left\lceil {\frac {d}{B}}\right\rceil >\log(P)+d\log(B)\right)} I/O complexity. The content of the final buckets have to be located in contiguous memory.

Selection

The selection problem is about finding the k-th smallest item in an unordered list A of size N. The following code makes use of PRAMSORT which is a PRAM optimal sorting algorithm which runs in O ( log ⁡ N ) {\displaystyle O(\log N)}, and SELECT, which is a cache optimal single-processor selection algorithm.

Under the assumption that the input is stored in contiguous memory, PEMSELECT has an I/O complexity of:

O ( N P B + log ⁡ ( P B ) ⋅ log ⁡ ( N P ) ) {\displaystyle O\left({\frac {N}{PB}}+\log(PB)\cdot \log({\frac {N}{P}})\right)}

Distribution sort

Distribution sort partitions an input list A of size N into d disjoint buckets of similar size. Every bucket is then sorted recursively and the results are combined into a fully sorted list.

If P = 1 {\displaystyle P=1} the task is delegated to a cache-optimal single-processor sorting algorithm.

Otherwise the following algorithm is used:

The I/O complexity of PEMDISTSORT is:

O ( ⌈ N P B ⌉ ( log d ⁡ P + log M / B ⁡ N P B ) + f ( N , P , d ) ⋅ log d ⁡ P ) {\displaystyle O\left(\left\lceil {\frac {N}{PB}}\right\rceil \left(\log _{d}P+\log _{M/B}{\frac {N}{PB}}\right)+f(N,P,d)\cdot \log _{d}P\right)}

where

f ( N , P , d ) = O ( log ⁡ P B d log ⁡ N P + ⌈ d B log ⁡ P + d log ⁡ B ⌉ ) {\displaystyle f(N,P,d)=O\left(\log {\frac {PB}{\sqrt {d}}}\log {\frac {N}{P}}+\left\lceil {\frac {\sqrt {d}}{B}}\log P+{\sqrt {d}}\log B\right\rceil \right)}

If the number of processors is chosen that f ( N , P , d ) = O ( ⌈ N P B ⌉ ) {\displaystyle f(N,P,d)=O\left(\left\lceil {\tfrac {N}{PB}}\right\rceil \right)}and M < B O ( 1 ) {\displaystyle M<B^{O(1)}} the I/O complexity is then:

O ( N P B log M / B ⁡ N B ) {\displaystyle O\left({\frac {N}{PB}}\log _{M/B}{\frac {N}{B}}\right)}

Other PEM algorithms

PEM Algorithm	I/O complexity	Constraints
Mergesort	O ( N P B log M B ⁡ N B ) = sort P ( N ) {\displaystyle O\left({\frac {N}{PB}}\log _{\frac {M}{B}}{\frac {N}{B}}\right)={\textrm {sort}}_{P}(N)}	P ≤ N B 2 , M = B O ( 1 ) {\displaystyle P\leq {\frac {N}{B^{2}}},M=B^{O(1)}}
List ranking	O ( sort P ( N ) ) {\displaystyle O\left({\textrm {sort}}_{P}(N)\right)}	P ≤ N / B 2 log ⁡ B ⋅ log O ( 1 ) ⁡ N , M = B O ( 1 ) {\displaystyle P\leq {\frac {N/B^{2}}{\log B\cdot \log ^{O(1)}N}},M=B^{O(1)}}
Euler tour	O ( sort P ( N ) ) {\displaystyle O\left({\textrm {sort}}_{P}(N)\right)}	P ≤ N B 2 , M = B O ( 1 ) {\displaystyle P\leq {\frac {N}{B^{2}}},M=B^{O(1)}}
Expression tree evaluation	O ( sort P ( N ) ) {\displaystyle O\left({\textrm {sort}}_{P}(N)\right)}	P ≤ N B 2 log ⁡ B ⋅ log O ( 1 ) ⁡ N , M = B O ( 1 ) {\displaystyle P\leq {\frac {N}{B^{2}\log B\cdot \log ^{O(1)}N}},M=B^{O(1)}}
Finding a MST	O ( sort P ( \| V \| ) + sort P ( \| E \| ) log ⁡ \| V \| p B ) {\displaystyle O\left({\textrm {sort}}_{P}(\|V\|)+{\textrm {sort}}_{P}(\|E\|)\log {\tfrac {\|V\|}{pB}}\right)}	p ≤ \| V \| + \| E \| B 2 log ⁡ B ⋅ log O ( 1 ) ⁡ N , M = B O ( 1 ) {\displaystyle p\leq {\frac {\|V\|+\|E\|}{B^{2}\log B\cdot \log ^{O(1)}N}},M=B^{O(1)}}

Where sort P ( N ) {\displaystyle {\textrm {sort}}_{P}(N)} is the time it takes to sort N items with P processors in the PEM model.