Jump to content

Parallel Algorithms: Difference between revisions

no edit summary
No edit summary
Tag: visualeditor
Line 4: Line 4:
==XMT Language==
==XMT Language==
XMTC is a single-program multiple-data (SPMD) extension of C.<br>
XMTC is a single-program multiple-data (SPMD) extension of C.<br>
* Spawn creates threads
 
* Threads expire at Join
*Spawn creates threads
* <code>$</code> represents the number of the thread
*Threads expire at Join
* <code>PS Ri Rj</code> is an atomic prefix sum
*<code>$</code> represents the number of the thread
** Stores Ri + Rj in Ri
*<code>PS Ri Rj</code> is an atomic prefix sum
** Stores the original value of Ri in Rj
**Stores Ri + Rj in Ri
**Stores the original value of Ri in Rj
<pre>
<pre>
int x = 0;
int x = 0;
Line 32: Line 33:


====Types of PRAM====
====Types of PRAM====
* exclusive-read exclusive-write (EREW)
 
* concurrent-read exclusive-write (CREW)
*exclusive-read exclusive-write (EREW)
* concurrent-read concurrent-write (CRCW)
*concurrent-read exclusive-write (CREW)
** Arbitrary CRCW - an arbitrary processor writing succeeds
*concurrent-read concurrent-write (CRCW)
** Priority CRCW - the lowest numbered processor writing succeeds
**Arbitrary CRCW - an arbitrary processor writing succeeds
** Common CRCW - writing succeeds only if all processors write the same value
**Priority CRCW - the lowest numbered processor writing succeeds
**Common CRCW - writing succeeds only if all processors write the same value


====Drawbacks====
====Drawbacks====
* Does not reveal how the algorithm will run on PRAMs with different number of proc
 
* Fully specifying allocation requires an unnecessary level of detail
*Does not reveal how the algorithm will run on PRAMs with different number of proc
*Fully specifying allocation requires an unnecessary level of detail


===Work Depth===
===Work Depth===
Line 51: Line 54:


;Notes
;Notes
* Other resources call this Brent's theorem
 
* <math>x</math> is the work and <math>d</math> is the depth
*Other resources call this Brent's theorem
*<math>x</math> is the work and <math>d</math> is the depth


===Speedup===
===Speedup===
* A parallel algorithm is ''work-optimal'' if W(n) grows asymptotically the same as T(n).
 
* A work-optimal parallel algorithm is ''work-time-optimal'' if T(n) cannot be improved by another work-optimal algorithm.
*A parallel algorithm is ''work-optimal'' if W(n) grows asymptotically the same as T(n).
* If T(n) is best known and W(n) matches it, this is called ''linear-speedup''
*A work-optimal parallel algorithm is ''work-time-optimal'' if T(n) cannot be improved by another work-optimal algorithm.
*If T(n) is best known and W(n) matches it, this is called ''linear-speedup''


====NC Theory====
====NC Theory====
Nick's Class Theory
Nick's Class Theory
* Good serial algorithms: Poly time
 
* Good parallel algorithm: poly-log <math>O(\log ^c n)</math> time, poly processors
*Good serial algorithms: Poly time
*Good parallel algorithm: poly-log <math>O(\log ^c n)</math> time, poly processors




==Technique: Balanced Binary Trees==
==Technique: Balanced Binary Trees==
Example Applications:  
Example Applications:  
* Prefix sum
 
* Largest element
*Prefix sum
* Nearest-one element
*Largest element
* Compaction
*Nearest-one element
*Compaction


;Inputs
;Inputs
* Array A[1..n] of elements
 
* Associative binary operation, denoted * (e.g. addition, multiplication, min, max)
*Array A[1..n] of elements
*Associative binary operation, denoted * (e.g. addition, multiplication, min, max)


;Outputs
;Outputs
* Array B containing B(i) = A(1) * ... * A(i)
 
*Array B containing B(i) = A(1) * ... * A(i)


;Algorithm
;Algorithm
<math>O(n)</math> work and <math>O(\log n)</math> time
<math>O(n)</math> work and <math>O(\log n)</math> time
<pre>
<pre>
Line 100: Line 110:
==Technique: Merge Sorting Cluster==
==Technique: Merge Sorting Cluster==
===Technique: Partitioning===
===Technique: Partitioning===
* Merging: Given two sorted arrays, A and B, create a sorted array C consisting of elements in A and B
 
* Ranking: Given an element x, find <math>Rank(x, A) = i</math> defined such that <math>A(i) \leq x \leq A(i+1)</math>
*Merging: Given two sorted arrays, A and B, create a sorted array C consisting of elements in A and B
** Note the book sometimes uses <math>Rank(i,B)</math> to denote <math>Rank(A(i), B)</math>.
*Ranking: Given an element x, find <math>Rank(x, A) = i</math> defined such that <math>A(i) \leq x \leq A(i+1)</math>
**Note the book sometimes uses <math>Rank(i,B)</math> to denote <math>Rank(A(i), B)</math>.


;Equivalence of Merging and Ranking
;Equivalence of Merging and Ranking
* Given an algorithm for merging, we know A(i) belongs in C(j) so Rank(A(i), B) = j-i
 
* Given an algorithm for ranking, we get A(i) belongs in C(j) where j = Rank(A(i), B) + i
*Given an algorithm for merging, we know A(i) belongs in C(j) so Rank(A(i), B) = j-i
*Given an algorithm for ranking, we get A(i) belongs in C(j) where j = Rank(A(i), B) + i


;Naive algorithm
;Naive algorithm
Apply binary search element wise on concurrent read system.
Apply binary search element wise on concurrent read system.
O(log n) time, O(nlog n) work
O(log n) time, O(nlog n) work
Line 118: Line 131:


;Partitioning paradigm
;Partitioning paradigm
Input size: n
Input size: n
* Partition the input into p jobs of size approx. n/p
 
* Do small jobs concurrently using a separate (possibly serial) algo for each.
*Partition the input into p jobs of size approx. n/p
*Do small jobs concurrently using a separate (possibly serial) algo for each.
 
Example:
Example:
Total work O(n), total time O(log n)
Total work O(n), total time O(log n)
Line 137: Line 153:


;Notes
;Notes
* We do not need to compute A_end or B_end. Just continue merging until we hit some index where (i % x)=0 in A or B.
 
*We do not need to compute A_end or B_end. Just continue merging until we hit some index where (i % x)=0 in A or B.


===Techinque: Divide and Conquer===
===Techinque: Divide and Conquer===


;Merge Sort
;Merge Sort
Input: Array A[1..n]<br>
Input: Array A[1..n]<br>
Complexity:<br>
Complexity:<br>
Line 160: Line 178:
===Technique: Accelerating Cascades===
===Technique: Accelerating Cascades===
Consider: for problem of size n, there are two parallel algorithms.<br>
Consider: for problem of size n, there are two parallel algorithms.<br>
* Algo A: <math>W_1(n)</math> work, <math>T_1(n)</math> time
 
* Algo B: Work <math>W_2(n) > W_1(n)</math>. Time <math>T_2(n) < T_1(n)</math>. Faster but less efficient.
*Algo A: <math>W_1(n)</math> work, <math>T_1(n)</math> time
*Algo B: Work <math>W_2(n) > W_1(n)</math>. Time <math>T_2(n) < T_1(n)</math>. Faster but less efficient.
 
Assume Algo A is a "reducing algorithm"<br>
Assume Algo A is a "reducing algorithm"<br>
We start with Algo A until the size is below some threshold. Then we switch to Algo B.
We start with Algo A until the size is below some threshold. Then we switch to Algo B.


E.g.
E.g.
* Algo 1 runs in <math>O(\log n)</math> iterations each taking <math>O(\log n)</math> time and <math>O(n)</math> work. Each iteration reduces the size to (3/4)m.
 
* Algo 2 runs in <math>O(\log n)</math> time and <math>O(n\log n)</math> work.
*Algo 1 runs in <math>O(\log n)</math> iterations each taking <math>O(\log n)</math> time and <math>O(n)</math> work. Each iteration reduces the size to (3/4)m.
* Run Algo 1 for <math>O(\log \log n)</math> rounds then run algo 2.
*Algo 2 runs in <math>O(\log n)</math> time and <math>O(n\log n)</math> work.
** Total time is <math>O(\log n \log \log n)</math> and work is <math>O(n)</math>
*Run Algo 1 for <math>O(\log \log n)</math> rounds then run algo 2.
**Total time is <math>O(\log n \log \log n)</math> and work is <math>O(n)</math>


===Problem: Selection===
===Problem: Selection===
;Algorithm 1
;Algorithm 1
* Partition elements into rows of <math>\log n</math> size
 
* For each row, find the median within the row
*Partition elements into rows of <math>\log n</math> size
* Find the median of medians (MoM) in <math>O(n)</math>
*For each row, find the median within the row
* Put all rows with median <= MoM above and all rows with median >= Mom below
*Find the median of medians (MoM) in <math>O(n)</math>
* Now <math>m/4</math> elements are smaller and <math>m/4</math> are larger
*Put all rows with median <= MoM above and all rows with median >= Mom below
** Known as the reducing lemma
*Now <math>m/4</math> elements are smaller and <math>m/4</math> are larger
**Known as the reducing lemma


This algorithm solves the selection problem in <math>O(\log^2 n)</math> time and <math>O(n)</math> work.
This algorithm solves the selection problem in <math>O(\log^2 n)</math> time and <math>O(n)</math> work.


; Accelerating Cascades
;Accelerating Cascades
* What we have:
 
** Algorithm 1 has <math>O(\log n)</math> iterations.  
*What we have:
:: Each iteration reduces a size m instance in <math>O(\log m)</math> time and <math>O(m)</math> work to an instance of size <math>\leq 3m/4</math>
**Algorithm 1 has <math>O(\log n)</math> iterations.
** Algorithm 2 runs in <math>O(\log n)</math> time and <math>O(n \log n)</math> work.
 
* Step 1: Use algo 1 to reduce from n to <math>n / \log n</math>
::Each iteration reduces a size m instance in <math>O(\log m)</math> time and <math>O(m)</math> work to an instance of size <math>\leq 3m/4</math>
* Step 2: Apply algorithm 2
 
**Algorithm 2 runs in <math>O(\log n)</math> time and <math>O(n \log n)</math> work.
*Step 1: Use algo 1 to reduce from n to <math>n / \log n</math>
*Step 2: Apply algorithm 2


===Informal Work-Depth (IWD)===
===Informal Work-Depth (IWD)===
Line 202: Line 228:


===Algorithm===
===Algorithm===
* Partition A into n/r subarrays <math>B_1,...,B_{n/r}</math>
 
** Sort each subarray separately using serial bucket sort (counting sort)
*Partition A into n/r subarrays <math>B_1,...,B_{n/r}</math>
** Compute number(v,s)  # of elements of value v in <math>B_s</math> for <math>0\leq v \leq r-1</math> and <math>1 \leq s \leq n/r</math>
**Sort each subarray separately using serial bucket sort (counting sort)
** Compute serial(i) = # of elements in the block <math>B_s</math> such that A(j)=A(i) and <math> j < i </math> <math>1 \leq j \neq n</math>
**Compute number(v,s)  # of elements of value v in <math>B_s</math> for <math>0\leq v \leq r-1</math> and <math>1 \leq s \leq n/r</math>
* Run prefix sum on number(v,1),number(v,2),...,number(v,n/r) into ps(v,1), ps(v,2),..., ps(v,n/r)
**Compute serial(i) = # of elements in the block <math>B_s</math> such that A(j)=A(i) and <math> j < i </math> <math>1 \leq j \neq n</math>
* Compute prefix sums of cardinality(0),.., cardinality(r-1) into global_ps(0),...,global_ps(r-1)
*Run prefix sum on number(v,1),number(v,2),...,number(v,n/r) into ps(v,1), ps(v,2),..., ps(v,n/r)
* The rank of element <math>i</math> is <math>1+serial(i)+ps(v,s-1)+global_ps(v-1)</math>
*Compute prefix sums of cardinality(0),.., cardinality(r-1) into global_ps(0),...,global_ps(r-1)
*The rank of element <math>i</math> is <math>1+serial(i)+ps(v,s-1)+global_ps(v-1)</math>


===Complexity===
===Complexity===
* Step 1:<math>T=O(r),\;W=O(r)</math> per subarray.
 
** Total: <math>T=O(r),\; W=O(n)</math>
*Step 1:<math>T=O(r),\;W=O(r)</math> per subarray.
* Step 2: <math>r</math> computations each <math>T=O(\log(n/r)),\; W=O(n/r)</math>
**Total: <math>T=O(r),\; W=O(n)</math>
** Total <math>T=O(\log n),\; W=O(n)</math>
*Step 2: <math>r</math> computations each <math>T=O(\log(n/r)),\; W=O(n/r)</math>
* Step 3: <math>T=O(\log r),\; W=O(r)</math>
**Total <math>T=O(\log n),\; W=O(n)</math>
* Step 4: <math>T=O(1)\; W=O(n)</math>
*Step 3: <math>T=O(\log r),\; W=O(r)</math>
* Total: <math>T=O(r + \log n),\; W=O(n)</math>
*Step 4: <math>T=O(1)\; W=O(n)</math>
*Total: <math>T=O(r + \log n),\; W=O(n)</math>


;Notes
;Notes
* Running time is not poly-log
 
*Running time is not poly-log


===Theorems===
===Theorems===


* The integer sorting algorithm runs in <math>O(r+\log n)</math> time and <math>O(n)</math> work
*The integer sorting algorithm runs in <math>O(r+\log n)</math> time and <math>O(n)</math> work
* The integer sorting algorithm can be applied to run in time <math>O(k(r^{1/k}+\log n))</math> and work <math>O(kn)</math>
*The integer sorting algorithm can be applied to run in time <math>O(k(r^{1/k}+\log n))</math> and work <math>O(kn)</math>


Radix sort using the basic integer sort (BIS) algorithm.<br>
Radix sort using the basic integer sort (BIS) algorithm.<br>
Line 236: Line 265:
===2-3 tree===
===2-3 tree===
A 2-3 tree is a rooted tree which has the properties:
A 2-3 tree is a rooted tree which has the properties:
* Each node has 2-3 ordered children
 
* For any internal node, every directed path to a leaf is the same length
*Each node has 2-3 ordered children
*For any internal node, every directed path to a leaf is the same length
 
Types of queries:
Types of queries:
* ''search(a)''
 
* ''insert(a)''
*''search(a)''
* ''delete(a)''
*''insert(a)''
*''delete(a)''


====Serial Algorithms====
====Serial Algorithms====
* Deletion: discard(a)
 
** Delete connection between a and parent of a
*Deletion: discard(a)
** If parent has 1 child
**Delete connection between a and parent of a
*** If parent's sibling has 2 children, move child to sibling and discard(parent)
**If parent has 1 child
*** If parent's sibling has 3 children, take one child from sibling
***If parent's sibling has 2 children, move child to sibling and discard(parent)
***If parent's sibling has 3 children, take one child from sibling


====Parallel Algorithms====
====Parallel Algorithms====
Assume concurrent read model
Assume concurrent read model
* Insert: Suppose we want to insert sorted elements <math>c_1,...,c_k</math> into a tree with n elements.
 
** Insert element <math>c_{k/2}</math>.
*Insert: Suppose we want to insert sorted elements <math>c_1,...,c_k</math> into a tree with n elements.
** Then insert in parallel <math>(c_1,...,c_{k/2-1})</math> and <math>(c_{k/2+1},...,c_k)</math>
**Insert element <math>c_{k/2}</math>.
** Time is <math>O(\log k \log n)</math> using <math>k</math> processors.
**Then insert in parallel <math>(c_1,...,c_{k/2-1})</math> and <math>(c_{k/2+1},...,c_k)</math>
* Deletion: Suppose we want to delete sorted elements <math>c_1,...,c_k</math>
**Time is <math>O(\log k \log n)</math> using <math>k</math> processors.
** Backwards Insertion
*Deletion: Suppose we want to delete sorted elements <math>c_1,...,c_k</math>
** for t=0 to <math>\log k</math>
**Backwards Insertion
*** if <math>i \equiv 2^t (\operatorname{mod}2^{t+1})</math>
**for t=0 to <math>\log k</math>
**** discard(c_i)
***if <math>i \equiv 2^t (\operatorname{mod}2^{t+1})</math>
** Time is <math>O(\log n \log k)</math> without pipelining
****discard(c_i)
** With pipelining, complexity is <math>O(\log n + \log k)</math>
**Time is <math>O(\log n \log k)</math> without pipelining
**With pipelining, complexity is <math>O(\log n + \log k)</math>


===Pipelining===
===Pipelining===
* There are <math>\log k</math> waves of the absorb procedure.
 
* Apply pipelining to make the time <math>O(\log n + \log k)</math>
*There are <math>\log k</math> waves of the absorb procedure.
*Apply pipelining to make the time <math>O(\log n + \log k)</math>


==Maximum Finding==
==Maximum Finding==
Line 272: Line 307:


===Constant time, <math>O(n^2)</math> Work===
===Constant time, <math>O(n^2)</math> Work===
* Compare every pair of elements in A[1...n]
 
*Compare every pair of elements in A[1...n]
<pre>
<pre>
for i=1 to n pardo
for i=1 to n pardo
Line 287: Line 323:


===<math>O(\log \log n)</math> time and <math>O(n\log \log n)</math> work algorithm===
===<math>O(\log \log n)</math> time and <math>O(n\log \log n)</math> work algorithm===
* Split A into <math>\sqrt{n}</math> subarrays
 
* Find the max of each subarray recursively
*Split A into <math>\sqrt{n}</math> subarrays
* Find the max of all subarrays in <math>O(n)</math> using the constant time algo
*Find the max of each subarray recursively
*Find the max of all subarrays in <math>O(n)</math> using the constant time algo


Complexity
Complexity
* <math>T(n) \leq T(\sqrt{n}) + c_1</math>
 
* <math>W(n) \leq \sqrt{n}W(\sqrt{n}) + c_2 n</math>
*<math>T(n) \leq T(\sqrt{n}) + c_1</math>
* <math>T(n) = O(\log \log n)</math>, <math>W(n) = O(n\log \log n)</math>
*<math>W(n) \leq \sqrt{n}W(\sqrt{n}) + c_2 n</math>
*<math>T(n) = O(\log \log n)</math>, <math>W(n) = O(n\log \log n)</math>


===<math>O(\log \log n)</math> time and <math>O(n)</math> work===
===<math>O(\log \log n)</math> time and <math>O(n)</math> work===


* Step 1: Partition into blocks of size <math>\log \log n</math>. Then we have <math>n/ \log \log n</math> blocks.
*Step 1: Partition into blocks of size <math>\log \log n</math>. Then we have <math>n/ \log \log n</math> blocks.
** Apply serial linear time algorithm to find maximum of each block
**Apply serial linear time algorithm to find maximum of each block
** <math>O(\log \log n)</math> time and <math>O(n)</math> work.
**<math>O(\log \log n)</math> time and <math>O(n)</math> work.
* Step 2: Apply doubly log algorithm to <math>n / \log \log n</math> maxima.
*Step 2: Apply doubly log algorithm to <math>n / \log \log n</math> maxima.
** <math>O(\log \log n)</math> time and <math>O(n)</math> work.
**<math>O(\log \log n)</math> time and <math>O(n)</math> work.


===Random Sampling===
===Random Sampling===
Maximum finding in <math>O(1)</math> time and <math>O(n)</math> work with very high probability
Maximum finding in <math>O(1)</math> time and <math>O(n)</math> work with very high probability


* Step 1: Using aux array B of size <math>b^{7/8}</math>. Independently fill with random elements from A
*Step 1: Using aux array B of size <math>b^{7/8}</math>. Independently fill with random elements from A
* Step 2: Find the max <math>m</math> in array B in <math>O(1)</math> time and <math>O(n)</math> work
*Step 2: Find the max <math>m</math> in array B in <math>O(1)</math> time and <math>O(n)</math> work
** Last 3 pulses of the recursive doubly-log time algorithm:
**Last 3 pulses of the recursive doubly-log time algorithm:
** Pulse 1: B is partitioned into <math>n^{3/4}</math> blocks of size <math>n^{1/8}</math> each. Find the max of each block.
**Pulse 1: B is partitioned into <math>n^{3/4}</math> blocks of size <math>n^{1/8}</math> each. Find the max of each block.
** Pulse 2: <math>n^{3/4}</math> maxima are partitioned into <math>n^{1/2}</math> blocks of size <math>n^{1/4}</math> each.
**Pulse 2: <math>n^{3/4}</math> maxima are partitioned into <math>n^{1/2}</math> blocks of size <math>n^{1/4}</math> each.
** Pulse 2: Find the max m of <math>n^{1/2}</math> maxima in <math>O(1)</math> time and <math>O(n)</math> work.
**Pulse 2: Find the max m of <math>n^{1/2}</math> maxima in <math>O(1)</math> time and <math>O(n)</math> work.
* Step 3: While there is an element larger than m, throw the new element into an array of size <math>n^{7/8}</math>.
*Step 3: While there is an element larger than m, throw the new element into an array of size <math>n^{7/8}</math>.
: Compute the maximum of the new array.
 
:Compute the maximum of the new array.


;Complexity
;Complexity
* Step 1 takes <math>O(1)</math> time and <math>O(n^{7/8})</math> work
 
* Step 2 takes <math>O(1)</math> time and <math>O(n)</math> work
*Step 1 takes <math>O(1)</math> time and <math>O(n^{7/8})</math> work
* Each time Step 3 takes <math>O(1)</math> time and <math>O(n)</math> work
*Step 2 takes <math>O(1)</math> time and <math>O(n)</math> work
* With high probability, we only need one iteration (see theorem below) so the time is <math>O(1)</math> and work is <math>O(n^{7/8})</math>
*Each time Step 3 takes <math>O(1)</math> time and <math>O(n)</math> work
*With high probability, we only need one iteration (see theorem below) so the time is <math>O(1)</math> and work is <math>O(n^{7/8})</math>


====Theorem 8.2====
====Theorem 8.2====
Line 327: Line 367:
The probability of not finishing in the above time is <math>O(1/n^c)</math>.
The probability of not finishing in the above time is <math>O(1/n^c)</math>.


{{ hidden | Proof |
{{hidden | Proof |
See page 58 of the classnotes.<br>
See page 58 of the classnotes.<br>


Line 340: Line 380:


;Inputs
;Inputs
* Vector of vertices which point to an index in edges
 
* Vector of edges e.g. (1,2)
*Vector of vertices which point to an index in edges
* Vector of pointers to identical edges e.g. index of (1,2) -> index of (2,1)
*Vector of edges e.g. (1,2)
* A vertex <math>r</math>
*Vector of pointers to identical edges e.g. index of (1,2) -> index of (2,1)
*A vertex <math>r</math>


;Goal:
;Goal:
* Select a direction for each edge to make our graph a tree with root <math>r</math>
 
*Select a direction for each edge to make our graph a tree with root <math>r</math>


;Algorithm
;Algorithm
* Step 1: For every edge <math>(u,v)</math>, add two edges <math>u \rightarrow v</math> and <math>v \rightarrow u</math>
 
* Step 2: For every directed edge <math>u \rightarrow v</math>, set a pointer to another directed edge from <math>v</math>, <math>next(u \rightarrow v)</math>
*Step 1: For every edge <math>(u,v)</math>, add two edges <math>u \rightarrow v</math> and <math>v \rightarrow u</math>
** If our edge is 1 -> 2, then the next edge is the immediate next edge of edge 2->1 in our edges array  
*Step 2: For every directed edge <math>u \rightarrow v</math>, set a pointer to another directed edge from <math>v</math>, <math>next(u \rightarrow v)</math>
** If edge 2 -> 1 is the last edge coming out of 2, then the next edge is the first edge coming out of 2
**If our edge is 1 -> 2, then the next edge is the immediate next edge of edge 2->1 in our edges array
** Now we have an euler cycle (or euler tour) among the edges
**If edge 2 -> 1 is the last edge coming out of 2, then the next edge is the first edge coming out of 2
* Step 3: <math>next(u_{r,1} \rightarrow r) = NIL </math>
**Now we have an euler cycle (or euler tour) among the edges
* Step 4: Apply a list ranking algorithm to the edges.
*Step 3: <math>next(u_{r,1} \rightarrow r) = NIL </math>
* Step 5: Delete the edge between every two vertices with the lower ranking
*Step 4: Apply a list ranking algorithm to the edges.
*Step 5: Delete the edge between every two vertices with the lower ranking


;Notes
;Notes
* Every step except 4, list ranking, is a local operation which is constant time and linear work
 
* Requires a list ranking algorithm
*Every step except 4, list ranking, is a local operation which is constant time and linear work
*Requires a list ranking algorithm


====Preorder Numbering====
====Preorder Numbering====
Define: tree-edges point away from root, non-tree edges point to root
Define: tree-edges point away from root, non-tree edges point to root


* Step 1: For every edge pardo, if e is a tree edge, distance(e) = 1 else distance(e) = 0
*Step 1: For every edge pardo, if e is a tree edge, distance(e) = 1 else distance(e) = 0
* Step 2: List ranking
*Step 2: List ranking
* Step 3: For every edge e: u->v, preorder(v) = n-distance(e) + 1
*Step 3: For every edge e: u->v, preorder(v) = n-distance(e) + 1


===Technique: Pointer Jumping===
===Technique: Pointer Jumping===
Line 389: Line 433:


Correctness: <br>
Correctness: <br>
* Each element will eventually point to the cashier
 
* Distance will always be correct
*Each element will eventually point to the cashier
** If next[i] = NIL then distance is to the cashier
*Distance will always be correct
** Otherwise the distance is <math>2^k</math>
**If next[i] = NIL then distance is to the cashier
**Otherwise the distance is <math>2^k</math>


Complexity: <math>O(\log n)</math> time and <math>O(n\log n)</math> work
Complexity: <math>O(\log n)</math> time and <math>O(n\log n)</math> work


===Work-optimal List Ranking===
===Work-optimal List Ranking===
* From our list A, pull a subset S with uniform density and sparcity
* Remove S from A, creating a new subset <math>A-S</math>.
** For each element i in <math>S</math>, fix the next and distance of prev(i)
** Run compaction
* Recurse on <math>A-S</math> until size is <math>\leq n/\log n</math>. Then use pointer jumping.
* Add <math>S</math> back to <math>A-S</math>


; Complexity
*From our list A, pull a subset S with uniform density and sparcity
*Remove S from A, creating a new subset <math>A-S</math>.
**For each element i in <math>S</math>, fix the next and distance of prev(i)
**Run compaction
*Recurse on <math>A-S</math> until size is <math>\leq n/\log n</math>. Then use pointer jumping.
*Add <math>S</math> back to <math>A-S</math>
 
;Complexity
 
With high probability:
With high probability:
* <math>O(log (n) log log (n)) time</math>
 
* <math>O(n)</math> work
*<math>O(log (n) log log (n)) time</math>
*<math>O(n)</math> work


===Randomized Symmetry Breaking===
===Randomized Symmetry Breaking===
Line 420: Line 468:


===Deterministic Symmetry Breaking===
===Deterministic Symmetry Breaking===
* Assign each element a unique tag, such as the index in the array
 
* Convert the tag to binary. For each element i, tag(i)!=tag(next(i)).
*Assign each element a unique tag, such as the index in the array
* Let k be the index of the rightmost bit (where 0 is the rightmost bit) differing between tag(i) and tag(next(i))
*Convert the tag to binary. For each element i, tag(i)!=tag(next(i)).
* Let b be the bit in tag(i) differing from tag(next(i))
*Let k be the index of the rightmost bit (where 0 is the rightmost bit) differing between tag(i) and tag(next(i))
* Set the new tag to <code>(k<<1) + b</code>
*Let b be the bit in tag(i) differing from tag(next(i))
*Set the new tag to <code>(k<<1) + b</code>


This algorithm takes <math>O(1)</math> time and <math>O(n)</math> work per iteration.
This algorithm takes <math>O(1)</math> time and <math>O(n)</math> work per iteration.
Line 433: Line 482:
===r-ruling set===
===r-ruling set===
An r-ruling set is a subset <math>S</math> of a linked list satisfying two properties:
An r-ruling set is a subset <math>S</math> of a linked list satisfying two properties:
* ''Uniform Density'' If node <math>i</math> is not in S, then one of the <math>r</math> nodes following i will be in <math>S</math>
 
* ''Uniform Sparcity'' If node <math>i</math> is in <math>S</math> then node <math>next(i)</math> is not in <math>S</math>
*''Uniform Density'' If node <math>i</math> is not in S, then one of the <math>r</math> nodes following i will be in <math>S</math>
*''Uniform Sparcity'' If node <math>i</math> is in <math>S</math> then node <math>next(i)</math> is not in <math>S</math>


;Algorithm
;Algorithm
The following algorithm gives us an r-ruling set with <math>r=2*\log n - 1</math>
The following algorithm gives us an r-ruling set with <math>r=2*\log n - 1</math>
# Apply the deterministic coin tossing algorithm to give each element a tag in <math>[0,...,2*\log n - 1]</math>
 
# For each element i in parallel, if i is a local maximum then add it to <math>S</math>
#Apply the deterministic coin tossing algorithm to give each element a tag in <math>[0,...,2*\log n - 1]</math>
#For each element i in parallel, if i is a local maximum then add it to <math>S</math>


==Tree Contraction==
==Tree Contraction==
Line 453: Line 505:
===Parallel Rake===
===Parallel Rake===
Rake all leaves such that:
Rake all leaves such that:
* No two leaves have the same parent (stem)
 
* The parent of a stem cannot be the stem of another leaf
*No two leaves have the same parent (stem)
*The parent of a stem cannot be the stem of another leaf




Line 460: Line 513:


;Parallel Rake Contraction Scheme
;Parallel Rake Contraction Scheme
Apply legal parallel rakes until the tree becomes a 3-node binary tree
Apply legal parallel rakes until the tree becomes a 3-node binary tree


Line 465: Line 519:


;Step 1
;Step 1
* Number leaves according to a DFS using the Euler tour algorithm.<br>
 
*Number leaves according to a DFS using the Euler tour algorithm.<br>


Observation 3:
Observation 3:
Line 474: Line 529:


;Step 2:
;Step 2:
* Let <math>S_1</math> be nodes in <math>S</math> whose parents are not in <math>S</math>.
 
; Let <math>L_1</math> be leaves in <math>L</math> whose parents are in <math>S_1</math>.
*Let <math>S_1</math> be nodes in <math>S</math> whose parents are not in <math>S</math>.
* Apply parallel rakes to leaves in <math>L_1</math>, unless the parent is the root.
 
* Apply parallel rakes to leaves in <math>L - L_1</math>, unless the parent is the root.
;Let <math>L_1</math> be leaves in <math>L</math> whose parents are in <math>S_1</math>.
* Repeat Step 2 until the tree is a 3-node binary tree
 
*Apply parallel rakes to leaves in <math>L_1</math>, unless the parent is the root.
*Apply parallel rakes to leaves in <math>L - L_1</math>, unless the parent is the root.
*Repeat Step 2 until the tree is a 3-node binary tree


===Complexity===
===Complexity===
Line 502: Line 560:


Definitions:
Definitions:
* Pointer Graph
 
* Supervertices
*Pointer Graph
* The supervertex graph
*Supervertices
* Hookings
*The supervertex graph
* Parallel pointer jumping
*Hookings
*Parallel pointer jumping


===A first connectivity algorithm===
===A first connectivity algorithm===


The first connectivity algorithm consists of <math>\log(n)</math> iterations of the following two steps:
The first connectivity algorithm consists of <math>\log(n)</math> iterations of the following two steps:
* Hookings - Each root hooks itself the the minimal adjacent root. If there are no adjacent stars, then this rooted star quits.
 
* Parallel pointer jumping - Each vertex performs pointer jumping until every vertex points to the root, forming a rooted star.
*Hookings - Each root hooks itself the the minimal adjacent root. If there are no adjacent stars, then this rooted star quits.
*Parallel pointer jumping - Each vertex performs pointer jumping until every vertex points to the root, forming a rooted star.


;Theorems
;Theorems
# The pointer graph always consists of rooted trees.
 
# After <math>\log n</math> iterations, all vertices are contained in rooted stars that quit.
#The pointer graph always consists of rooted trees.
#: Each connected component is represented by a single rooted star.
#After <math>\log n</math> iterations, all vertices are contained in rooted stars that quit.
#:Each connected component is represented by a single rooted star.


;Complexity
;Complexity
* We need <math>O(\log n)</math> iterations
 
** Hookings take <math>O(1)</math> time and <math>O(n+m)</math> work
*We need <math>O(\log n)</math> iterations
** Pointer jumping takes <math>O(\log n)</math> time and <math>O(n\log n)</math> work
**Hookings take <math>O(1)</math> time and <math>O(n+m)</math> work
* For adjacency matrix, in total we need <math>O(\log^2 n)</math> time and <math>O(n^2\log n)</math> work since we have a processor per <math>n^2</math> possible edges.
**Pointer jumping takes <math>O(\log n)</math> time and <math>O(n\log n)</math> work
* For incidence lists, we need <math>O(\log^2 n)</math> time and <math>O(n\log^2 n + m\log n)</math> work since we have a processor per edge.
*For adjacency matrix, in total we need <math>O(\log^2 n)</math> time and <math>O(n^2\log n)</math> work since we have a processor per <math>n^2</math> possible edges.
*For incidence lists, we need <math>O(\log^2 n)</math> time and <math>O(n\log^2 n + m\log n)</math> work since we have a processor per edge.


===A second connectivity algorithm===
===A second connectivity algorithm===
Line 531: Line 593:
Each iteration takes constant time.
Each iteration takes constant time.


# Probe quitting: each rooted star whose supervertex is not adjacent to any other quits
#Probe quitting: each rooted star whose supervertex is not adjacent to any other quits
# Hooking on smaller: each rooted star hooks onto a smaller vertex of another supervertex which is not a leaf
#Hooking on smaller: each rooted star hooks onto a smaller vertex of another supervertex which is not a leaf
#* Note that this is not unique. Requires arbitrary CRCW.
#*Note that this is not unique. Requires arbitrary CRCW.
# Hooking non-hooked-upon: every rooted star not hooked upon in step 2 hooks to another vertex in another supervertex which is not a leaf
#Hooking non-hooked-upon: every rooted star not hooked upon in step 2 hooks to another vertex in another supervertex which is not a leaf
# Parallel pointer jumping: one round of parallel pointer jumping
#Parallel pointer jumping: one round of parallel pointer jumping


;Theorems
;Theorems
* The pointer graph always consists of rooted trees.
 
* For every vertex which is not a leaf, <math>D(v) \leq v</math>
*The pointer graph always consists of rooted trees.
*For every vertex which is not a leaf, <math>D(v) \leq v</math>
{{hidden | Proof |  
{{hidden | Proof |  
* This is true after steps 1, 2, and 4.
* This is true after steps 1, 2, and 4.
Line 553: Line 616:


;Complexity
;Complexity
* Time <math>O(\log n)</math>
 
* Work <math>O((n+m)\log n)</math>
*Time <math>O(\log n)</math>
*Work <math>O((n+m)\log n)</math>


;Notes
;Notes
* Without step 3, the algorithm will run in <math>O(n)</math> time instead of <math>O(\log n)</math> but the result will be valid.
* In step 2, you can hook rooted trees in addition to rooted stars for better empirical performance


{{ hidden | Proof |
*Without step 3, the algorithm will run in <math>O(n)</math> time instead of <math>O(\log n)</math> but the result will be valid.
*In step 2, you can hook rooted trees in addition to rooted stars for better empirical performance
 
{{hidden | Proof |
Let <math>h(T)</math> be the height of tree T where a two-layer tree has height 1.<br>
Let <math>h(T)</math> be the height of tree T where a two-layer tree has height 1.<br>
Define a single node to have height 1.<br>
Define a single node to have height 1.<br>
Line 570: Line 635:
===Minimum Spanning Forest===
===Minimum Spanning Forest===


 
minimum spanning forest
==Hardware/Architecture==
==Hardware/Architecture==
[http://users.umiacs.umd.edu/~vishkin/XMT/vCarageaLeebook.pdf Models for Advancing PRAM and Other Algorithms into Parallel Algorithms for PRAM]
[http://users.umiacs.umd.edu/~vishkin/XMT/vCarageaLeebook.pdf Models for Advancing PRAM and Other Algorithms into Parallel Algorithms for PRAM]
Line 576: Line 641:
[[File:Pram hardware diagram.jpg | 400px]]
[[File:Pram hardware diagram.jpg | 400px]]


* MTCU - Master Thread Control Unit - runs in serial mode
*MTCU - Master Thread Control Unit - runs in serial mode
* TCU Cluster - Thread Control Unit
*TCU Cluster - Thread Control Unit


;Procedure
;Procedure
* Spawn broadcasts your spawn code in the block to every thread control unit (TCU).
 
* Data is shared through the interconnection network
*Spawn broadcasts your spawn code in the block to every thread control unit (TCU).
*Data is shared through the interconnection network


;Performance Penalty
;Performance Penalty
<math>\text{Execution Depth} = \text{Computation Depth} + LSRTM \times R + QD</math>
<math>\text{Execution Depth} = \text{Computation Depth} + LSRTM \times R + QD</math>
* <math>LSTRM</math> - length of sequence of round trips to memory
 
* <math>R</math> - time for round trip to memory
*<math>LSTRM</math> - length of sequence of round trips to memory
* <math>QD</math> - queuing delay
*<math>R</math> - time for round trip to memory
** If multiple threads access the same memory module, they are queued at the memory module
*<math>QD</math> - queuing delay
**If multiple threads access the same memory module, they are queued at the memory module


==Resources==
==Resources==
* [https://stanford.edu/~rezab/dao/ Stanford CME323 Distributed Algorithms Lectures 1-3]
 
*[https://stanford.edu/~rezab/dao/ Stanford CME323 Distributed Algorithms Lectures 1-3]


===Textbooks===
===Textbooks===
* [http://users.umiacs.umd.edu/~vishkin/PUBLICATIONS/classnotes.pdf Uzi Vishkin ENEE651/CMSC751 Classnotes]
 
* [https://www.amazon.com/Introduction-Parallel-Algorithms-Joseph-JaJa/dp/0201548569?sa-no-redirect=1&pldnSite=1 Introduction to Parallel Algorithms (Joseph Jaja, textbook)]
*[http://users.umiacs.umd.edu/~vishkin/PUBLICATIONS/classnotes.pdf Uzi Vishkin ENEE651/CMSC751 Classnotes]
* [https://www.cs.cmu.edu/~guyb/papers/BM04.pdf Parallel Algorithms by Guy E. Blelloch and Bruce M. Maggs (CMU)]
*[https://www.amazon.com/Introduction-Parallel-Algorithms-Joseph-JaJa/dp/0201548569?sa-no-redirect=1&pldnSite=1 Introduction to Parallel Algorithms (Joseph Jaja, textbook)]
*[https://www.cs.cmu.edu/~guyb/papers/BM04.pdf Parallel Algorithms by Guy E. Blelloch and Bruce M. Maggs (CMU)]


==Ignore==
==Ignore==
[[Visible to::users]]
[[Visible to::users]]