Parallel Algorithms: Difference between revisions
No edit summary |
No edit summary |
||
(132 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
Parallel Algorithms notes from CMSC751 with Uzi Vishkin. | Parallel Algorithms notes from CMSC751 with [http://users.umiacs.umd.edu/~vishkin/index.shtml Uzi Vishkin].<br> | ||
This class is based on his book [[:File:CMSC751_Classnotes.pdf | Thinking in Parallel Some Basic Data-Parallel Algorithms and Techniques]]<br> | |||
[http://users.umiacs.umd.edu/~vishkin/TEACHING/ Class Website]<br> | |||
==XMT Language== | ==XMT Language== | ||
XMTC is a single-program multiple-data (SPMD) extension of C.<br> | XMTC is a single-program multiple-data (SPMD) extension of C.<br> | ||
* Spawn creates threads | |||
* Threads expire at Join | *Spawn creates threads | ||
*Threads expire at Join | |||
*<code>$</code> represents the number of the thread | |||
*<code>PS Ri Rj</code> is an atomic prefix sum | |||
**Stores Ri + Rj in Ri | |||
**Stores the original value of Ri in Rj | |||
<syntaxhighlight lang="c"> | |||
int x = 0; | |||
// Spawn n threads | |||
spawn(0, n-1) { | |||
int e = 1; | |||
if (A[$] != 0) { | |||
// Sets e=x and increments x | |||
ps(e,x); | |||
} | |||
} | |||
</syntaxhighlight> | |||
==Models== | ==Models== | ||
===PRAM=== | ===PRAM=== | ||
Parallel Random-Access Machine/Model<br> | |||
You're given n synchronous processors each with local memory and access to a shared memory.<br> | |||
Each processor can write to shared memory, read to shared memory, or do computation in local memory.<br> | |||
You tell each processor what to do at each time step.<br> | |||
====Types of PRAM==== | |||
*exclusive-read exclusive-write (EREW) | |||
*concurrent-read exclusive-write (CREW) | |||
*concurrent-read concurrent-write (CRCW) | |||
**Arbitrary CRCW - an arbitrary processor writing succeeds | |||
**Priority CRCW - the lowest numbered processor writing succeeds | |||
**Common CRCW - writing succeeds only if all processors write the same value | |||
====Drawbacks==== | |||
*Does not reveal how the algorithm will run on PRAMs with different number of proc | |||
*Fully specifying allocation requires an unnecessary level of detail | |||
===Work Depth=== | |||
You provide a sequence of instructions. At each time step, you specify the number of parallel operations. | |||
====WD-presentation Sufficiency Theorem==== | |||
Given an algorithm in WD mode that takes <math>x=x(n)</math> operations and <math>d=d(n)</math> time, | |||
the algorithm can be implemented in any p-processor PRAM with <math>O(x/p + d)</math> time. | |||
;Notes | |||
*Other resources call this Brent's theorem | |||
*<math>x</math> is the work and <math>d</math> is the depth | |||
===Speedup=== | |||
*A parallel algorithm is ''work-optimal'' if W(n) grows asymptotically the same as T(n). | |||
*A work-optimal parallel algorithm is ''work-time-optimal'' if T(n) cannot be improved by another work-optimal algorithm. | |||
*If T(n) is best known and W(n) matches it, this is called ''linear-speedup'' | |||
====NC Theory==== | |||
Nick's Class Theory | |||
*Good serial algorithms: Poly time | |||
*Good parallel algorithm: poly-log <math>O(\log ^c n)</math> time, poly processors | |||
==Technique: Balanced Binary Trees== | |||
Example Applications: | |||
*Prefix sum | |||
*Largest element | |||
*Nearest-one element | |||
*Compaction | |||
;Inputs | |||
*Array A[1..n] of elements | |||
*Associative binary operation, denoted * (e.g. addition, multiplication, min, max) | |||
;Outputs | |||
*Array B containing B(i) = A(1) * ... * A(i) | |||
;Algorithm | |||
<math>O(n)</math> work and <math>O(\log n)</math> time | |||
<pre> | |||
for i, 1 <= i <= n pardo | |||
B(0, i) = A(i) | |||
// Summation step up the tree | |||
for h=1 to log n | |||
B(h, i) = B(h-1, 2i-1) * B(h-1, 2i) | |||
// Down the tree | |||
for h=log n to 1 | |||
if i even, i <= n/2^h | |||
C(h, i) = C(h+1, i/2) | |||
if i == 1 | |||
C(h, i) = B(h, i) | |||
if i odd, 3 <= i <= n/2^h | |||
C(h, i) = C(h+1, i/2) * B(h, i) | |||
return C(0, :) | |||
</pre> | |||
==Technique: Merge Sorting Cluster== | |||
===Technique: Partitioning=== | |||
*Merging: Given two sorted arrays, A and B, create a sorted array C consisting of elements in A and B | |||
*Ranking: Given an element x, find <math>Rank(x, A) = i</math> defined such that <math>A(i) \leq x \leq A(i+1)</math> | |||
**Note the book sometimes uses <math>Rank(i,B)</math> to denote <math>Rank(A(i), B)</math>. | |||
;Equivalence of Merging and Ranking | |||
*Given an algorithm for merging, we know A(i) belongs in C(j) so Rank(A(i), B) = j-i | |||
*Given an algorithm for ranking, we get A(i) belongs in C(j) where j = Rank(A(i), B) + i | |||
;Naive algorithm | |||
Apply binary search element wise on concurrent read system. | |||
O(log n) time, O(nlog n) work | |||
<pre> | |||
for 1 <= i <= n pardo | |||
Rank(A(i), B) | |||
Rank(B(i), A) | |||
</pre> | |||
;Partitioning paradigm | |||
Input size: n | |||
*Partition the input into p jobs of size approx. n/p | |||
*Do small jobs concurrently using a separate (possibly serial) algo for each. | |||
Example: | |||
Total work O(n), total time O(log n) | |||
<pre> | |||
// Partitioning | |||
for 1 <= i <= p pardo | |||
b(i) = Rank((n/p)(i-1)+1, B) | |||
a(i) = Rank((n/p)(i-1)+1, A) | |||
// Total slice size is <= 2n/p. Total 2p slices. | |||
// E.g. slice size is 2 * log(n) if p=n/log(n). | |||
// Work | |||
for 1 <= i <= p pardo | |||
k = ceil(b(i)/(n/p)) * (n/p) | |||
merge slice A[(n/p)(i-1)+1:min(a(i), end_a] and B[b(i):end_b] | |||
</pre> | |||
;Notes | |||
*We do not need to compute A_end or B_end. Just continue merging until we hit some index where (i % x)=0 in A or B. | |||
===Techinque: Divide and Conquer=== | |||
;Merge Sort | |||
Input: Array A[1..n]<br> | |||
Complexity:<br> | |||
Time: <math>T(n) \leq T(n/2) + \alpha \log n</math><br> | |||
Work: <math>T(n) \leq 2 * T(n/2) + \beta n</math><br> | |||
Time <math>O(\log^2 n)</math>. Work: <math>O(n \log n)</math> | |||
<pre> | |||
MergeSort(A, B): | |||
if n == 1 | |||
return B(1) = A(1) | |||
call in parallel: | |||
MergeSort(A[1:n/2], C[1:n/2]) | |||
MergeSort(A[n/2+1:n], C[n/2+1:n]) | |||
Merge C[1:n/2] and C[n/2:n] using O(log n) algorithm | |||
</pre> | |||
==Technique: Informal Work-Depth (IWD) and Accelerating Cascades== | |||
===Technique: Accelerating Cascades=== | |||
Consider: for problem of size n, there are two parallel algorithms.<br> | |||
*Algo A: <math>W_1(n)</math> work, <math>T_1(n)</math> time | |||
*Algo B: Work <math>W_2(n) > W_1(n)</math>. Time <math>T_2(n) < T_1(n)</math>. Faster but less efficient. | |||
Assume Algo A is a "reducing algorithm"<br> | |||
We start with Algo A until the size is below some threshold. Then we switch to Algo B. | |||
E.g. | |||
*Algo 1 runs in <math>O(\log n)</math> iterations each taking <math>O(\log n)</math> time and <math>O(n)</math> work. Each iteration reduces the size to (3/4)m. | |||
*Algo 2 runs in <math>O(\log n)</math> time and <math>O(n\log n)</math> work. | |||
*Run Algo 1 for <math>O(\log \log n)</math> rounds then run algo 2. | |||
**Total time is <math>O(\log n \log \log n)</math> and work is <math>O(n)</math> | |||
===Problem: Selection=== | |||
;Algorithm 1 | |||
*Partition elements into rows of <math>\log n</math> size | |||
*For each row, find the median within the row | |||
*Find the median of medians (MoM) in <math>O(n)</math> | |||
*Put all rows with median <= MoM above and all rows with median >= Mom below | |||
*Now <math>m/4</math> elements are smaller and <math>m/4</math> are larger | |||
**Known as the reducing lemma | |||
This algorithm solves the selection problem in <math>O(\log^2 n)</math> time and <math>O(n)</math> work. | |||
;Accelerating Cascades | |||
*What we have: | |||
**Algorithm 1 has <math>O(\log n)</math> iterations. | |||
::Each iteration reduces a size m instance in <math>O(\log m)</math> time and <math>O(m)</math> work to an instance of size <math>\leq 3m/4</math> | |||
**Algorithm 2 runs in <math>O(\log n)</math> time and <math>O(n \log n)</math> work. | |||
*Step 1: Use algo 1 to reduce from n to <math>n / \log n</math> | |||
*Step 2: Apply algorithm 2 | |||
Total time is <math>O(\log n \log\log n)</math> and total work is <math>O(n)</math> | |||
===Informal Work-Depth (IWD)=== | |||
At each time unit there is a ''set'' containing a number of instructions to be performed concurrently.<br> | |||
==Integer Sorting== | |||
There is a theorem that sorting with only comparisons is worst case at least <math>O(n\log n)</math><br> | |||
Input: Array A[1..n], integers are range [0..r-1]<br> | |||
Sorting: rank from smallest to largest<br> | |||
Assume n is divisible by r (<math>r=\sqrt{n}</math>) | |||
===Algorithm=== | |||
*Partition A into n/r subarrays <math>B_1,...,B_{n/r}</math> | |||
**Sort each subarray separately using serial bucket sort (counting sort) | |||
**Compute number(v,s) # of elements of value v in <math>B_s</math> for <math>0\leq v \leq r-1</math> and <math>1 \leq s \leq n/r</math> | |||
**Compute serial(i) = # of elements in the block <math>B_s</math> such that A(j)=A(i) and <math> j < i </math> <math>1 \leq j \neq n</math> | |||
*Run prefix sum on number(v,1),number(v,2),...,number(v,n/r) into ps(v,1), ps(v,2),..., ps(v,n/r) | |||
*Compute prefix sums of cardinality(0),.., cardinality(r-1) into global_ps(0),...,global_ps(r-1) | |||
*The rank of element <math>i</math> is <math>1+serial(i)+ps(v,s-1)+global_ps(v-1)</math> | |||
===Complexity=== | |||
*Step 1:<math>T=O(r),\;W=O(r)</math> per subarray. | |||
**Total: <math>T=O(r),\; W=O(n)</math> | |||
*Step 2: <math>r</math> computations each <math>T=O(\log(n/r)),\; W=O(n/r)</math> | |||
**Total <math>T=O(\log n),\; W=O(n)</math> | |||
*Step 3: <math>T=O(\log r),\; W=O(r)</math> | |||
*Step 4: <math>T=O(1)\; W=O(n)</math> | |||
*Total: <math>T=O(r + \log n),\; W=O(n)</math> | |||
;Notes | |||
*Running time is not poly-log | |||
===Theorems=== | |||
*The integer sorting algorithm runs in <math>O(r+\log n)</math> time and <math>O(n)</math> work | |||
*The integer sorting algorithm can be applied to run in time <math>O(k(r^{1/k}+\log n))</math> and work <math>O(kn)</math> | |||
Radix sort using the basic integer sort (BIS) algorithm.<br> | |||
If your range is 0 to n and and your radix is <math>\sqrt{n}</math> then you will need <math>\log_{\sqrt{n}}(r) = 2</math> rounds. | |||
==2-3 trees; Technique: Pipelining== | |||
Dictionary: Search, insert, delete<br> | |||
Problem: How to parallelize to handle batches of queries<br> | |||
===2-3 tree=== | |||
A 2-3 tree is a rooted tree which has the properties: | |||
*Each node has 2-3 ordered children | |||
*For any internal node, every directed path to a leaf is the same length | |||
Types of queries: | |||
*''search(a)'' | |||
*''insert(a)'' | |||
*''delete(a)'' | |||
====Serial Algorithms==== | |||
*Deletion: discard(a) | |||
**Delete connection between a and parent of a | |||
**If parent has 1 child | |||
***If parent's sibling has 2 children, move child to sibling and discard(parent) | |||
***If parent's sibling has 3 children, take one child from sibling | |||
====Parallel Algorithms==== | |||
Assume concurrent read model | |||
*Insert: Suppose we want to insert sorted elements <math>c_1,...,c_k</math> into a tree with n elements. | |||
**Insert element <math>c_{k/2}</math>. | |||
**Then insert in parallel <math>(c_1,...,c_{k/2-1})</math> and <math>(c_{k/2+1},...,c_k)</math> | |||
**Time is <math>O(\log k \log n)</math> using <math>k</math> processors. | |||
*Deletion: Suppose we want to delete sorted elements <math>c_1,...,c_k</math> | |||
**Backwards Insertion | |||
**for t=0 to <math>\log k</math> | |||
***if <math>i \equiv 2^t (\operatorname{mod}2^{t+1})</math> | |||
****discard(c_i) | |||
**Time is <math>O(\log n \log k)</math> without pipelining | |||
**With pipelining, complexity is <math>O(\log n + \log k)</math> | |||
===Pipelining=== | |||
*There are <math>\log k</math> waves of the absorb procedure. | |||
*Apply pipelining to make the time <math>O(\log n + \log k)</math> | |||
==Maximum Finding== | |||
Given an array A=A(1),...,A(n), find the largest element in the array<br> | |||
===Constant time, <math>O(n^2)</math> Work=== | |||
*Compare every pair of elements in A[1...n] | |||
<pre> | |||
for i=1 to n pardo | |||
B(i) = 0 | |||
for 1 <= i,j <= n pardo | |||
if A(i) <= A(j) and i < j | |||
B(i) = 1 | |||
else | |||
B(j) = 1 | |||
for 1 <= i <= n pardo | |||
if B(i) == 0 | |||
A(i) is the max | |||
</pre> | |||
===<math>O(\log \log n)</math> time and <math>O(n\log \log n)</math> work algorithm=== | |||
*Split A into <math>\sqrt{n}</math> subarrays | |||
*Find the max of each subarray recursively | |||
*Find the max of all subarrays in <math>O(n)</math> using the constant time algo | |||
Complexity | |||
*<math>T(n) \leq T(\sqrt{n}) + c_1</math> | |||
*<math>W(n) \leq \sqrt{n}W(\sqrt{n}) + c_2 n</math> | |||
*<math>T(n) = O(\log \log n)</math>, <math>W(n) = O(n\log \log n)</math> | |||
===<math>O(\log \log n)</math> time and <math>O(n)</math> work=== | |||
*Step 1: Partition into blocks of size <math>\log \log n</math>. Then we have <math>n/ \log \log n</math> blocks. | |||
**Apply serial linear time algorithm to find maximum of each block | |||
**<math>O(\log \log n)</math> time and <math>O(n)</math> work. | |||
*Step 2: Apply doubly log algorithm to <math>n / \log \log n</math> maxima. | |||
**<math>O(\log \log n)</math> time and <math>O(n)</math> work. | |||
===Random Sampling=== | |||
Maximum finding in <math>O(1)</math> time and <math>O(n)</math> work with very high probability | |||
*Step 1: Using aux array B of size <math>b^{7/8}</math>. Independently fill with random elements from A | |||
*Step 2: Find the max <math>m</math> in array B in <math>O(1)</math> time and <math>O(n)</math> work | |||
**Last 3 pulses of the recursive doubly-log time algorithm: | |||
**Pulse 1: B is partitioned into <math>n^{3/4}</math> blocks of size <math>n^{1/8}</math> each. Find the max of each block. | |||
**Pulse 2: <math>n^{3/4}</math> maxima are partitioned into <math>n^{1/2}</math> blocks of size <math>n^{1/4}</math> each. | |||
**Pulse 2: Find the max m of <math>n^{1/2}</math> maxima in <math>O(1)</math> time and <math>O(n)</math> work. | |||
*Step 3: While there is an element larger than m, throw the new element into an array of size <math>n^{7/8}</math>. | |||
:Compute the maximum of the new array. | |||
;Complexity | |||
*Step 1 takes <math>O(1)</math> time and <math>O(n^{7/8})</math> work | |||
*Step 2 takes <math>O(1)</math> time and <math>O(n)</math> work | |||
*Each time Step 3 takes <math>O(1)</math> time and <math>O(n)</math> work | |||
*With high probability, we only need one iteration (see theorem below) so the time is <math>O(1)</math> and work is <math>O(n^{7/8})</math> | |||
====Theorem 8.2==== | |||
The algorithm find the maximum among <math>n</math> elements. | |||
With high probability it runs in <math>O(1)</math> time and <math>O(n)</math> work. | |||
The probability of not finishing in the above time is <math>O(1/n^c)</math>. | |||
{{hidden | Proof | | |||
See page 58 of the classnotes.<br> | |||
}} | |||
==List Ranking Cluster== | |||
Techniques: Euler tours, pointer jumping, randomized and deterministic symmetry breaking | |||
===Rooting a tree=== | |||
Technique: Euler tours | |||
;Inputs | |||
*Vector of vertices which point to an index in edges | |||
*Vector of edges e.g. (1,2) | |||
*Vector of pointers to identical edges e.g. index of (1,2) -> index of (2,1) | |||
*A vertex <math>r</math> | |||
;Goal: | |||
*Select a direction for each edge to make our graph a tree with root <math>r</math> | |||
;Algorithm | |||
*Step 1: For every edge <math>(u,v)</math>, add two edges <math>u \rightarrow v</math> and <math>v \rightarrow u</math> | |||
*Step 2: For every directed edge <math>u \rightarrow v</math>, set a pointer to another directed edge from <math>v</math>, <math>next(u \rightarrow v)</math> | |||
**If our edge is 1 -> 2, then the next edge is the immediate next edge of edge 2->1 in our edges array | |||
**If edge 2 -> 1 is the last edge coming out of 2, then the next edge is the first edge coming out of 2 | |||
**Now we have an euler cycle (or euler tour) among the edges | |||
*Step 3: <math>next(u_{r,1} \rightarrow r) = NIL </math> | |||
** The edge <math>u_{r,1} \rightarrow r</math> is the opposite edge of the first edge coming out of <math>r</math> | |||
*Step 4: Apply a list ranking algorithm to the edges. | |||
*Step 5: Delete the edge between every two vertices with the lower ranking | |||
;Notes | |||
*Every step except 4, list ranking, is a local operation which is constant time and linear work | |||
*Requires a list ranking algorithm | |||
====Preorder Numbering==== | |||
Define: tree-edges point away from root, non-tree edges point to root | |||
*Step 1: For every edge pardo, if e is a tree edge, distance(e) = 1 else distance(e) = 0 | |||
*Step 2: List ranking | |||
*Step 3: For every edge e: u->v, preorder(v) = n-distance(e) + 1 | |||
===Technique: Pointer Jumping=== | |||
List Ranking<br> | |||
Input: A dense array A. For each element except first, we have a node pointing its successor in the array.<br> | |||
Each element, except last, is the successor of one element.<br> | |||
''rank(i)=0'' for last, ''rank(i) = rank(next(i)) + 1'' for the rest<br> | |||
The last element we call the "cashier".<br> | |||
Idea: <br> | |||
At iteration 0, node i points to node i+1<br> | |||
At iteration 1, node i points to node i+2<br> | |||
At iteration 2, node i points to node i+4... | |||
<pre> | |||
for i 1 <= i <= n pardo: | |||
if next[i] == NIL then distance[i] = 0 else distance[i] = 1 | |||
for k = 1 to log g: | |||
distance[i] = distance[i] + distance[next[i]] | |||
next[i] = next[next[i]] | |||
</pre> | |||
Correctness: <br> | |||
*Each element will eventually point to the cashier | |||
*Distance will always be correct | |||
**If next[i] = NIL then distance is to the cashier | |||
**Otherwise the distance is <math>2^k</math> | |||
Complexity: <math>O(\log n)</math> time and <math>O(n\log n)</math> work | |||
===Work-optimal List Ranking=== | |||
*From our list A, pull a subset S with uniform density and sparcity | |||
*Remove S from A, creating a new subset <math>A-S</math>. | |||
**For each element i in <math>S</math>, fix the next and distance of prev(i) | |||
**Run compaction | |||
*Recurse on <math>A-S</math> until size is <math>\leq n/\log n</math>. Then use pointer jumping. | |||
*Add <math>S</math> back to <math>A-S</math> | |||
;Complexity | |||
We get the same complexity for randomized and deterministic list ranking (see below)<br> | |||
Randomized has the complexity with high probability. | |||
*<math>O(\log (n) \log \log (n))</math> time | |||
*<math>O(n)</math> work | |||
Using a parallel prefix-sum algorith which runs in <math>O(\log n/\log \log n)</math> time, | |||
we can get list ranking in <math>O(\log n)</math> time and <math>O(n)</math> work. | |||
===Randomized Symmetry Breaking=== | |||
<pre> | |||
for i 1 <= i <= n pardo | |||
with equal probability R(i) = head or tail | |||
if R(i) == HEAD and R(next(i))==Tail | |||
s(i) = 0 | |||
else | |||
s(i) = 1 | |||
</pre> | |||
;Notes | |||
* The full randomized list ranking works in <math>O(\log n \log\log n)</math> time and <math>O(n)</math> work with high probability | |||
===Deterministic Symmetry Breaking=== | |||
*Assign each element a unique tag, such as the index in the array | |||
*Convert the tag to binary. For each element i, tag(i)!=tag(next(i)). | |||
*Let k be the index of the rightmost bit (where 0 is the rightmost bit) differing between tag(i) and tag(next(i)) | |||
*Let b be the bit in tag(i) differing from tag(next(i)) | |||
*Set the new tag to <code>(k<<1) + b</code> | |||
This algorithm takes <math>O(1)</math> time and <math>O(n)</math> work per iteration. | |||
If the range of tags before the iteration are <math>[0,...,n-1]</math> then the range of tags after will be | |||
<math>[0,...,2*\log n - 1]</math>. | |||
===r-ruling set=== | |||
An r-ruling set is a subset <math>S</math> of a linked list satisfying two properties: | |||
*''Uniform Density'' If node <math>i</math> is not in S, then one of the <math>r</math> nodes following i will be in <math>S</math> | |||
*''Uniform Sparcity'' If node <math>i</math> is in <math>S</math> then node <math>next(i)</math> is not in <math>S</math> | |||
;Algorithm 1 | |||
The following algorithm gives us an r-ruling set with <math>r=2*\log n - 1</math> | |||
#Apply the deterministic coin tossing algorithm to give each element a tag in <math>[0,...,2*\log n - 1]</math> | |||
#For each element i in parallel, if i is a local maximum then add it to <math>S</math> | |||
;Optimal-Work 2-Ruling set | |||
<pre> | |||
Apply one iteration of deterministic coin tossing | |||
Sort elements by tags | |||
Initialize array S with 1s | |||
for k = 0 to 2*log(n)-1: | |||
for i such that tag(i) = k pardo | |||
if S(pre(i)) = S(next(i)) = 1: | |||
S(i) = 0 | |||
</pre> | |||
==Tree Contraction== | |||
===Serial Rake=== | |||
Consider a node y with left leaf x and right subtree z.<br> | |||
Rake(x) would delete x and y, setting parent(z) to the previous parent of y. | |||
Observation 1: Applying a rake to a binary tree yields a binary tree | |||
===Parallel Rake=== | |||
Rake all leaves such that: | |||
*No two leaves have the same parent (stem) | |||
*The parent of a stem cannot be the stem of another leaf | |||
Observation 2: Applying a legal parallel rake to a binary tree yields a binary tree | |||
;Parallel Rake Contraction Scheme | |||
Apply legal parallel rakes until the tree becomes a 3-node binary tree | |||
;Parallel Rake Contraction Algorithm | |||
;Step 1 | |||
*Number leaves according to a DFS using the Euler tour algorithm.<br> | |||
Observation 3: | |||
Consider the leaves whose number is odd, L.<br> | |||
Let S consist of the stems of those leaves L.<br> | |||
Then each node in S can be adjacent to at most one other node in S.<br> | |||
Proof: See the classnotes | |||
;Step 2: | |||
* Let <math>S_1</math> be nodes in <math>S</math> whose parents are not in <math>S</math>. | |||
* Let <math>L_1</math> be leaves in <math>L</math> whose parents are in <math>S_1</math>. | |||
* Apply parallel rakes to leaves in <math>L_1</math>, unless the parent is the root. | |||
* Apply parallel rakes to leaves in <math>L - L_1</math>, unless the parent is the root. | |||
* Repeat Step 2 until the tree is a 3-node binary tree | |||
===Complexity=== | |||
Step 2 takes <math>O(\log n)</math> rounds. Each leaf gets raked only once.<br> | |||
In total we take <math>O(\log n)</math> time and <math>O(n)</math> work. | |||
===Evaluating an Arithmetic Expression=== | |||
Consider a binary tree where each internal node is an operator: | |||
Either addition or multiplication.<br> | |||
Each leaf is a number<br> | |||
Assign each internal node 4 numbers: a,b,c,d<br> | |||
These are initialized to (1,0,1,0)<br> | |||
The internal node will be <math>(a*l+b) \times (c*r+d)</math> where <math>\times</math> is either addition or multiplication.<br> | |||
If we delete one leaf and it's parent (stem), we will modify these variables of the stem's parent. | |||
===Tree Binarization=== | |||
;Note: This is not in the classnotes, but has been covered in exams. | |||
To apply the above techniques to general trees or k-ary trees, you should binarize your k-ary tree.<br> | |||
This is done as follows: | |||
* Suppose node <math>x</math> has <math>n > 2</math> children | |||
* Select one child to keep | |||
* Create a node <math>x'</math> | |||
* Move all <math>n-1</math> children to <math>x'</math> | |||
* Set child <math>x'</math> to be a child of <math>x</math> | |||
Complexity | |||
* The time is <math>O(max\_degree)</math> and the work is <math>O(n)</math> | |||
==Graph Connectivity== | |||
===Preliminaries=== | |||
For parallel algorithms, a graph can be represented as an incidence list or an adjacency matrix. | |||
An incidence list is an array of edges ordered by the starting point and ending point. | |||
An array of vertices point to the starting index of the vertices outgoing edges within the edge array. | |||
I.e. edges with index [v[i], v[i+1]) start at vertex i | |||
Definitions: | |||
*Pointer Graph | |||
*Supervertices | |||
*The supervertex graph | |||
*Hookings | |||
*Parallel pointer jumping | |||
===A first connectivity algorithm=== | |||
The first connectivity algorithm consists of <math>\log(n)</math> iterations of the following two steps: | |||
*Hookings - Each root hooks itself the the minimal adjacent root. If there are no adjacent stars, then this rooted star quits. | |||
*Parallel pointer jumping - Each vertex performs pointer jumping until every vertex points to the root, forming a rooted star. | |||
;Theorems | |||
#The pointer graph always consists of rooted trees. | |||
#After <math>\log n</math> iterations, all vertices are contained in rooted stars that quit. | |||
#:Each connected component is represented by a single rooted star. | |||
;Complexity | |||
*We need <math>O(\log n)</math> iterations | |||
**Hookings take <math>O(1)</math> time and <math>O(n+m)</math> work | |||
**Pointer jumping takes <math>O(\log n)</math> time and <math>O(n\log n)</math> work | |||
*For adjacency matrix, in total we need <math>O(\log^2 n)</math> time and <math>O(n^2\log n)</math> work since we have a processor per <math>n^2</math> possible edges. | |||
*For incidence lists, we need <math>O(\log^2 n)</math> time and <math>O(n\log^2 n + m\log n)</math> work since we have a processor per edge. | |||
===A second connectivity algorithm=== | |||
The second connectivity algorithms consists of <math>O(\log n)</math> iterations.<br> | |||
Each iteration takes constant time. | |||
#Probe quitting: each rooted star whose supervertex is not adjacent to any other quits | |||
#Hooking on smaller: each rooted star hooks onto a smaller vertex of another supervertex which is not a leaf | |||
#*Note that this is not unique. Requires arbitrary CRCW. | |||
#Hooking non-hooked-upon: every rooted star not hooked upon in step 2 hooks to another vertex in another supervertex which is not a leaf | |||
#Parallel pointer jumping: one round of parallel pointer jumping | |||
;Theorems | |||
*The pointer graph always consists of rooted trees. | |||
*For every vertex which is not a leaf, <math>D(v) \leq v</math> | |||
{{hidden | Proof | | |||
* This is true after steps 1, 2, and 4. | |||
** This follows immediately for 1 and 2. | |||
** After one round of pointer jumping (4) after 3, everyone from supervertex <math>r</math> is a leaf due to the claim below. | |||
* Consider a root <math>r</math> which hooks on someone larger in step 3. | |||
*: After Step 3, all children of <math>r</math> are leaves. | |||
*: I.e. no one else will hook to <math>r</math> during Step 3. | |||
** If anyone was larger, <math>r</math> would have hooked on them in Step 2. | |||
** If anyone was smaller, they would have hooked on <math>r</math> in Step 2. | |||
}} | |||
;Complexity | |||
*Time <math>O(\log n)</math> | |||
*Work <math>O((n+m)\log n)</math> | |||
;Notes | |||
*Without step 3, the algorithm will run in <math>O(n)</math> time instead of <math>O(\log n)</math> but the result will be valid. | |||
*In step 2, you can hook rooted trees in addition to rooted stars for better empirical performance | |||
{{hidden | Proof | | |||
Let <math>h(T)</math> be the height of tree T where a two-layer tree has height 1.<br> | |||
Define a single node to have height 1.<br> | |||
Let <math>H</math> be the sum of heights of all trees and <math>\bar{H}</math> be the total height after one iteration. | |||
* Claim: <math>\bar{H} \leq 2*H/3</math> | |||
}} | |||
===Minimum Spanning Forest=== | |||
====The MSF Theorem==== | |||
Let <math>G(V,E)</math> be a weighted graph and let <math>U</math> and <math>V-U</math> be two non-empty subsets of <math>V</math>. | |||
Consider set <math>H=\{(u,v) \mid u \in U, v \in V\}</math> of edges connecting points in <math>U</math> and <math>V</math>. | |||
If <math>e</math> is the edge of minimum weight in <math>H</math>, then <math>e</math> is in the MSF of <math>G</math> | |||
====A first MSF algorithm==== | |||
* Sort the edges by weight. Assign a processor to each edge. Using a priority CRCW, the edge with minimum weight will take priority during writes. | |||
Each round consists of hooking and <math>\log n</math> rounds of parallel pointer jumping | |||
* Each root find the minimum weight edge to another supervertex and hooks itself to that root. | |||
*: This is automatic from the priority CRCW. | |||
====A second MSF algorithm==== | |||
We replace steps 2 and 3 in the connectivity algorithm with the simple rule: | |||
* We hook rooted stars using the edge with the smallest weight. | |||
;Notes | |||
* The progression of edge weights up the tree are monotonically increasing so there are no cycles | |||
===Biconnectivity=== | |||
This section is not it the textbook. It is from Tarjan and Vishkin (1985)<ref name="tarjan1985biconnectivity">Robert E. Tarjan, and Uzi Vishkin. ''An Efficient Parallel Biconnectivity Algorithm'' (1985). SIAM Journal on Computing. DOI: [https://doi.org/10.1137/0214061 10.1137/0214061] [https://epubs.siam.org/doi/10.1137/0214061 https://epubs.siam.org/doi/10.1137/0214061] [https://www.researchgate.net/publication/220617428_An_Efficient_Parallel_Biconnectivity_Algorithm Mirror]</ref>. | |||
A graph is biconnected if for every two vertices <math>v_1</math> and <math>v_2</math>, there is a simple cycle containing <math>v_1</math> and <math>v_2</math>. | |||
Intuitively this means that for any two vertices, there are at least two paths from <math>v_1</math> to <math>v_2</math>. If any vertex and its edges are removed from a biconnected graph, it still remains connected. | |||
Every connected graph consists of biconnected components. These are sets of edges such that every two edges from a set lie on a simple cycle. | |||
Vertices which connect two biconnected components are called ''articulation points''. | |||
If these points are removed, the two biconnected components are no longer connected. | |||
;Algorithm | |||
Assume we are given a connected graph <math>G</math>. | |||
# First build a spanning tree <math>T</math>. Record which edges are in the spanning tree. Root the spanning tree. | |||
# Compute the preorder number of all edges using an euler tour. | |||
# For every vertex <math>v</math>, calculate the hi and low preorder numbers in the subtree <math>T(v)</math> | |||
# Create the auxiliary graph <math>G'' = (V'', E'')</math> as follows: | |||
#* All edges of <math>G</math> are vertices in <math>G''</math> | |||
#* For each edge <math>(v, w)</math> in <math>G - T</math>, add edge <math> ((p(v),v),(p(w),w))</math> to <math>G''</math> iff <math>v</math> and <math>w</math> are unrelated in <math>T</math> | |||
#* For each edge <math>(v = p(w), w)</math> in <math>T</math>, add edge <math>((p(v), v), (v, w))</math> to <math>G''</math> iff <math>low(w) < v</math> or <math>high(w) \geq v + size(v)</math> | |||
# Compute the connected components of <math>G''</math> | |||
# Assign edges based on their connected components. | |||
;Complexity | |||
* <math>O(\log n)</math> time | |||
==Tricks== | |||
===Removing duplicates in an array=== | |||
Assume you have an Arbitrary CRCW | |||
# Given array A of size n with entries 0 to n-1 | |||
# For each entry pardo | |||
#* Write B[A[i]] == i | |||
#* Only one write will succeed for each unique A[i] | |||
#* Check if B[A[i]] == i | |||
===Compaction of an <math>n^2</math> array with <math>\leq n</math> elements in <math>O(\log n)</math> time=== | |||
* Split our <math>n^2</math> array to n subarrays, each of size <math>n</math> | |||
* Run <math>n</math> parallel compactions on each subarray | |||
* Perform one compaction on the size of each subarray | |||
* Compact all elements using (total size of prev subarrays + position in current subarray) | |||
The benefit of this segmented compaction/prefix sum is that if you only do each segment once in an iterative algorithm such as parallel BFS, | |||
you can get <math>O(n^2)</math> or <math>O(m)</math> work. | |||
==XMT Programming Tips== | |||
I never figured out how to use standard c headers. | |||
* XMT only has int (32 bit) and float (32 bit). If you need a bool type, you will need to define it in a header. | |||
* You cannot call functions from within <code>spawn</code> | |||
Here is my helpers header <code>helpers.h</code> | |||
{{ hidden | <code>helpers.h</code> | | |||
<syntaxhighlight lang="c"> | |||
#ifndef HELPERS_H | |||
#define HELPERS_H | |||
#define true 1 | |||
#define false 0 | |||
#define bool int | |||
// #define NIL -1 | |||
#define max(a, b) ((((a) < (b)) ? (b) : (a))) | |||
#define min(a, b) ((((a) < (b)) ? (a) : (b))) | |||
#endif | |||
</syntaxhighlight> | |||
}} | |||
If you prefer to debug in C++ like me, you can use the following code to adapt running the XMT code: | |||
This code does not run things asynchronously. | |||
{{hidden | C++ Code | | |||
<syntaxhighlight lang="cpp"> | |||
#define spawn(x, y) for (int $ = x; $ <= y; ++$) | |||
//#define spawn(x, y) for (int $ = y; $ >= x; --$) | |||
#define ps(a, b) \ | |||
{ \ | |||
b += a; \ | |||
a = b - a; \ | |||
} | |||
#define psm(a, b) \ | |||
{ \ | |||
b += a; \ | |||
a = b - a; \ | |||
} | |||
#define psBaseReg int | |||
#define STRINGIFY2(X) #X | |||
#define STRINGIFY(X) STRINGIFY2(X) | |||
using namespace std; | |||
#define GRAPH_NUM graph0 | |||
#define N 19 | |||
#define M 36 | |||
vector<int> vertices; | |||
vector<int> degrees; | |||
vector<vector<int>> edges; | |||
int main() { | |||
vertices.resize(N); | |||
degrees.resize(N); | |||
edges = std::vector<std::vector<int>>(M, std::vector<int>(2)); | |||
string datapath = "data/" STRINGIFY(GRAPH_NUM) "/graph.txt"; | |||
string data = readFile(datapath); | |||
regex section_regex( | |||
"[\\n\\r][\\n\\r]+|[\\r\\n]0\\s[\\r\\n]+", | |||
std::regex_constants::ECMAScript | std::regex_constants::icase); | |||
regex space_regex("[\\n\\s\\r]+", std::regex_constants::ECMAScript | | |||
std::regex_constants::icase); | |||
vector<string> sections = split(data, section_regex); | |||
// for (int i = 0; i < sections.size(); i++) { | |||
// cout << "Section " << i << " " << sections[i].size() << endl; | |||
// cout << sections[i] << endl; | |||
//} | |||
return 0; | |||
} | |||
string readFile(const string& datapath) { | |||
std::ifstream in(datapath, std::ios::in); | |||
string data; | |||
if (in.good()) { | |||
std::string contents; | |||
in.seekg(0, std::ios::end); | |||
contents.resize(static_cast<unsigned int>(in.tellg())); | |||
in.seekg(0, std::ios::beg); | |||
in.read(&contents[0], contents.size()); | |||
return contents; | |||
} | |||
std::cerr << "Failed to open file: " << datapath << std::endl; | |||
throw(errno); | |||
} | |||
vector<string> split(string s, regex r) { | |||
vector<string> splits; | |||
smatch m; // <-- need a match object | |||
while (regex_search(s, m, r)) // <-- use it here to get the match | |||
{ | |||
int split_on = m.position(); // <-- use the match position | |||
splits.push_back(s.substr(0, split_on)); | |||
s = s.substr(split_on + m.length()); // <-- also, skip the whole match | |||
} | |||
if (!s.empty()) { | |||
splits.push_back(s); // and there may be one last token at the end | |||
} | |||
return splits; | |||
} | |||
</syntaxhighlight> | |||
}} | |||
==Hardware/Architecture== | |||
[http://users.umiacs.umd.edu/~vishkin/XMT/vCarageaLeebook.pdf Models for Advancing PRAM and Other Algorithms into Parallel Algorithms for PRAM] | |||
[[File:Pram hardware diagram.jpg | 400px]] | |||
*MTCU - Master Thread Control Unit - runs in serial mode | |||
*TCU Cluster - Thread Control Unit | |||
;Procedure | |||
*Spawn broadcasts your spawn code in the block to every thread control unit (TCU). | |||
*Data is shared through the interconnection network | |||
;Performance Penalty | |||
<math>\text{Execution Depth} = \text{Computation Depth} + LSRTM \times R + QD</math> | |||
*<math>LSTRM</math> - length of sequence of round trips to memory | |||
*<math>R</math> - time for round trip to memory | |||
*<math>QD</math> - queuing delay | |||
**If multiple threads access the same memory module, they are queued at the memory module | |||
==Resources== | |||
*[https://stanford.edu/~rezab/dao/ Stanford CME323 Distributed Algorithms Lectures 1-3] | |||
===Textbooks=== | |||
*[http://users.umiacs.umd.edu/~vishkin/PUBLICATIONS/classnotes.pdf Uzi Vishkin ENEE651/CMSC751 Classnotes] | |||
*[https://www.amazon.com/Introduction-Parallel-Algorithms-Joseph-JaJa/dp/0201548569?sa-no-redirect=1&pldnSite=1 Introduction to Parallel Algorithms (Joseph Jaja, textbook)] | |||
*[https://www.cs.cmu.edu/~guyb/papers/BM04.pdf Parallel Algorithms by Guy E. Blelloch and Bruce M. Maggs (CMU)] | |||
==References== |