Parallel Algorithms: Difference between revisions

(39 intermediate revisions by the same user not shown)

Line 1:

Parallel Algorithms notes from CMSC751 with Uzi Vishkin.

Parallel Algorithms notes from CMSC751 with [http://users.umiacs.umd.edu/~vishkin/index.shtml Uzi Vishkin].

This class is based on his book [[:File:CMSC751_Classnotes.pdf | Thinking in Parallel Some Basic Data-Parallel Algorithms and Techniques]]

This class is based on his book [[:File:CMSC751_Classnotes.pdf | Thinking in Parallel Some Basic Data-Parallel Algorithms and Techniques]]

[http://users.umiacs.umd.edu/~vishkin/TEACHING/ Class Website]

==XMT Language==

XMTC is a single-program multiple-data (SPMD) extension of C.

* Spawn creates threads

* Threads expire at Join

*Spawn creates threads

* <code>$</code> represents the number of the thread

*Threads expire at Join

* <code>PS Ri Rj</code> is an atomic prefix sum

*<code>$</code> represents the number of the thread

** Stores Ri + Rj in Ri

*<code>PS Ri Rj</code> is an atomic prefix sum

** Stores the original value of Ri in Rj

**Stores Ri + Rj in Ri

<~~pre~~>

**Stores the original value of Ri in Rj

int x = 0;

Line 21:

Line 23:

}

</~~pre~~>

</syntaxhighlight>

==Models==

Line 32:

Line 34:

====Types of PRAM====

* exclusive-read exclusive-write (EREW)

* concurrent-read exclusive-write (CREW)

*exclusive-read exclusive-write (EREW)

* concurrent-read concurrent-write (CRCW)

*concurrent-read exclusive-write (CREW)

** Arbitrary CRCW - an arbitrary processor writing succeeds

*concurrent-read concurrent-write (CRCW)

** Priority CRCW - the lowest numbered processor writing succeeds

**Arbitrary CRCW - an arbitrary processor writing succeeds

** Common CRCW - writing succeeds only if all processors write the same value

**Priority CRCW - the lowest numbered processor writing succeeds

**Common CRCW - writing succeeds only if all processors write the same value

====Drawbacks====

* Does not reveal how the algorithm will run on PRAMs with different number of proc

* Fully specifying allocation requires an unnecessary level of detail

*Does not reveal how the algorithm will run on PRAMs with different number of proc

*Fully specifying allocation requires an unnecessary level of detail

===Work Depth===

Line 51:

Line 55:

;Notes

* Other resources call this Brent's theorem

* <math>x</math> is the work and <math>d</math> is the depth

*Other resources call this Brent's theorem

*<math>x</math> is the work and <math>d</math> is the depth

===Speedup===

* A parallel algorithm is ''work-optimal'' if W(n) grows asymptotically the same as T(n).

* A work-optimal parallel algorithm is ''work-time-optimal'' if T(n) cannot be improved by another work-optimal algorithm.

*A parallel algorithm is ''work-optimal'' if W(n) grows asymptotically the same as T(n).

* If T(n) is best known and W(n) matches it, this is called ''linear-speedup''

*A work-optimal parallel algorithm is ''work-time-optimal'' if T(n) cannot be improved by another work-optimal algorithm.

*If T(n) is best known and W(n) matches it, this is called ''linear-speedup''

====NC Theory====

Nick's Class Theory

* Good serial algorithms: Poly time

* Good parallel algorithm: poly-log <math>O(\log ^c n)</math> time, poly processors

*Good serial algorithms: Poly time

*Good parallel algorithm: poly-log <math>O(\log ^c n)</math> time, poly processors

==Technique: Balanced Binary Trees==

Example Applications:

* Prefix sum

* Largest element

*Prefix sum

* Nearest-one element

*Largest element

* Compaction

*Nearest-one element

*Compaction

;Inputs

* Array A[1..n] of elements

* Associative binary operation, denoted * (e.g. addition, multiplication, min, max)

*Array A[1..n] of elements

*Associative binary operation, denoted * (e.g. addition, multiplication, min, max)

;Outputs

* Array B containing B(i) = A(1) * ... * A(i)

*Array B containing B(i) = A(1) * ... * A(i)

;Algorithm

<math>O(n)</math> work and <math>O(\log n)</math> time

<pre>

Line 100:

Line 111:

==Technique: Merge Sorting Cluster==

===Technique: Partitioning===

* Merging: Given two sorted arrays, A and B, create a sorted array C consisting of elements in A and B

* Ranking: Given an element x, find <math>Rank(x, A) = i</math> defined such that <math>A(i) \leq x \leq A(i+1)</math>

*Merging: Given two sorted arrays, A and B, create a sorted array C consisting of elements in A and B

** Note the book sometimes uses <math>Rank(i,B)</math> to denote <math>Rank(A(i), B)</math>.

*Ranking: Given an element x, find <math>Rank(x, A) = i</math> defined such that <math>A(i) \leq x \leq A(i+1)</math>

**Note the book sometimes uses <math>Rank(i,B)</math> to denote <math>Rank(A(i), B)</math>.

;Equivalence of Merging and Ranking

* Given an algorithm for merging, we know A(i) belongs in C(j) so Rank(A(i), B) = j-i

* Given an algorithm for ranking, we get A(i) belongs in C(j) where j = Rank(A(i), B) + i

*Given an algorithm for merging, we know A(i) belongs in C(j) so Rank(A(i), B) = j-i

*Given an algorithm for ranking, we get A(i) belongs in C(j) where j = Rank(A(i), B) + i

;Naive algorithm

Apply binary search element wise on concurrent read system.

O(log n) time, O(nlog n) work

Line 118:

Line 132:

;Partitioning paradigm

Input size: n

* Partition the input into p jobs of size approx. n/p

* Do small jobs concurrently using a separate (possibly serial) algo for each.

*Partition the input into p jobs of size approx. n/p

*Do small jobs concurrently using a separate (possibly serial) algo for each.

Example:

Total work O(n), total time O(log n)

Line 137:

Line 154:

;Notes

* We do not need to compute A_end or B_end. Just continue merging until we hit some index where (i % x)=0 in A or B.

*We do not need to compute A_end or B_end. Just continue merging until we hit some index where (i % x)=0 in A or B.

===Techinque: Divide and Conquer===

;Merge Sort

Input: Array A[1..n]

Complexity:

Line 160:

Line 179:

===Technique: Accelerating Cascades===

Consider: for problem of size n, there are two parallel algorithms.

* Algo A: <math>W_1(n)</math> work, <math>T_1(n)</math> time

* Algo B: Work <math>W_2(n) > W_1(n)</math>. Time <math>T_2(n) < T_1(n)</math>. Faster but less efficient.

*Algo A: <math>W_1(n)</math> work, <math>T_1(n)</math> time

*Algo B: Work <math>W_2(n) > W_1(n)</math>. Time <math>T_2(n) < T_1(n)</math>. Faster but less efficient.

Assume Algo A is a "reducing algorithm"

We start with Algo A until the size is below some threshold. Then we switch to Algo B.

E.g.

* Algo 1 runs in <math>O(\log n)</math> iterations each taking <math>O(\log n)</math> time and <math>O(n)</math> work. Each iteration reduces the size to (3/4)m.

* Algo 2 runs in <math>O(\log n)</math> time and <math>O(n\log n)</math> work.

*Algo 1 runs in <math>O(\log n)</math> iterations each taking <math>O(\log n)</math> time and <math>O(n)</math> work. Each iteration reduces the size to (3/4)m.

* Run Algo 1 for <math>O(\log \log n)</math> rounds then run algo 2.

*Algo 2 runs in <math>O(\log n)</math> time and <math>O(n\log n)</math> work.

** Total time is <math>O(\log n \log \log n)</math> and work is <math>O(n)</math>

*Run Algo 1 for <math>O(\log \log n)</math> rounds then run algo 2.

**Total time is <math>O(\log n \log \log n)</math> and work is <math>O(n)</math>

===Problem: Selection===

;Algorithm 1

* Partition elements into rows of <math>\log n</math> size

* For each row, find the median within the row

*Partition elements into rows of <math>\log n</math> size

* Find the median of medians (MoM) in <math>O(n)</math>

*For each row, find the median within the row

* Put all rows with median <= MoM above and all rows with median >= Mom below

*Find the median of medians (MoM) in <math>O(n)</math>

* Now <math>m/4</math> elements are smaller and <math>m/4</math> are larger

*Put all rows with median <= MoM above and all rows with median >= Mom below

** Known as the reducing lemma

*Now <math>m/4</math> elements are smaller and <math>m/4</math> are larger

**Known as the reducing lemma

This algorithm solves the selection problem in <math>O(\log^2 n)</math> time and <math>O(n)</math> work.

; Accelerating Cascades

;Accelerating Cascades

* What we have:

** Algorithm 1 has <math>O(\log n)</math> iterations.

*What we have:

:: Each iteration reduces a size m instance in <math>O(\log m)</math> time and <math>O(m)</math> work to an instance of size <math>\leq 3m/4</math>

**Algorithm 1 has <math>O(\log n)</math> iterations.

** Algorithm 2 runs in <math>O(\log n)</math> time and <math>O(n \log n)</math> work.

::Each iteration reduces a size m instance in <math>O(\log m)</math> time and <math>O(m)</math> work to an instance of size <math>\leq 3m/4</math>

* Step 1: Use algo 1 to reduce from n to <math>n / \log n</math>

**Algorithm 2 runs in <math>O(\log n)</math> time and <math>O(n \log n)</math> work.

* Step 2: Apply algorithm 2

*Step 1: Use algo 1 to reduce from n to <math>n / \log n</math>

*Step 2: Apply algorithm 2

Total time is <math>O(\log n \log\log n)</math> and total work is <math>O(n)</math>

===Informal Work-Depth (IWD)===

Line 202:

Line 230:

===Algorithm===

* Partition A into n/r subarrays <math>B_1,...,B_{n/r}</math>

** Sort each subarray separately using serial bucket sort (counting sort)

*Partition A into n/r subarrays <math>B_1,...,B_{n/r}</math>

** Compute number(v,s) # of elements of value v in <math>B_s</math> for <math>0\leq v \leq r-1</math> and <math>1 \leq s \leq n/r</math>

**Sort each subarray separately using serial bucket sort (counting sort)

** Compute serial(i) = # of elements in the block <math>B_s</math> such that A(j)=A(i) and <math> j <math>1 \leq j \neq n</math>

**Compute number(v,s) # of elements of value v in <math>B_s</math> for <math>0\leq v \leq r-1</math> and <math>1 \leq s \leq n/r</math>

* Run prefix sum on number(v,1),number(v,2),...,number(v,n/r) into ps(v,1), ps(v,2),..., ps(v,n/r)

**Compute serial(i) = # of elements in the block <math>B_s</math> such that A(j)=A(i) and <math> j <math>1 \leq j \neq n</math>

* Compute prefix sums of cardinality(0),.., cardinality(r-1) into global_ps(0),...,global_ps(r-1)

*Run prefix sum on number(v,1),number(v,2),...,number(v,n/r) into ps(v,1), ps(v,2),..., ps(v,n/r)

* The rank of element <math>i</math> is <math>1+serial(i)+ps(v,s-1)+global_ps(v-1)</math>

*Compute prefix sums of cardinality(0),.., cardinality(r-1) into global_ps(0),...,global_ps(r-1)

*The rank of element <math>i</math> is <math>1+serial(i)+ps(v,s-1)+global_ps(v-1)</math>

===Complexity===

* Step 1:<math>T=O(r),\;W=O(r)</math> per subarray.

** Total: <math>T=O(r),\; W=O(n)</math>

*Step 1:<math>T=O(r),\;W=O(r)</math> per subarray.

* Step 2: <math>r</math> computations each <math>T=O(\log(n/r)),\; W=O(n/r)</math>

**Total: <math>T=O(r),\; W=O(n)</math>

** Total <math>T=O(\log n),\; W=O(n)</math>

*Step 2: <math>r</math> computations each <math>T=O(\log(n/r)),\; W=O(n/r)</math>

* Step 3: <math>T=O(\log r),\; W=O(r)</math>

**Total <math>T=O(\log n),\; W=O(n)</math>

* Step 4: <math>T=O(1)\; W=O(n)</math>

*Step 3: <math>T=O(\log r),\; W=O(r)</math>

* Total: <math>T=O(r + \log n),\; W=O(n)</math>

*Step 4: <math>T=O(1)\; W=O(n)</math>

*Total: <math>T=O(r + \log n),\; W=O(n)</math>

;Notes

* Running time is not poly-log

*Running time is not poly-log

===Theorems===

* The integer sorting algorithm runs in <math>O(r+\log n)</math> time and <math>O(n)</math> work

*The integer sorting algorithm runs in <math>O(r+\log n)</math> time and <math>O(n)</math> work

* The integer sorting algorithm can be applied to run in time <math>O(k(r^{1/k}+\log n))</math> and work <math>O(kn)</math>

*The integer sorting algorithm can be applied to run in time <math>O(k(r^{1/k}+\log n))</math> and work <math>O(kn)</math>

Radix sort using the basic integer sort (BIS) algorithm.

If your range is 0 to n and and your radix is <math>\sqrt{n}</math> then you will need <math>log_{\sqrt{n}}(r) = 2</math> rounds.

If your range is 0 to n and and your radix is <math>\sqrt{n}</math> then you will need <math>\log_{\sqrt{n}}(r) = 2</math> rounds.

==2-3 trees; Technique: Pipelining==

Line 236:

Line 267:

===2-3 tree===

A 2-3 tree is a rooted tree which has the properties:

* Each node has 2-3 ordered children

* For any internal node, every directed path to a leaf is the same length

*Each node has 2-3 ordered children

*For any internal node, every directed path to a leaf is the same length

Types of queries:

* ''search(a)''

* ''insert(a)''

*''search(a)''

* ''delete(a)''

*''insert(a)''

*''delete(a)''

====Serial Algorithms====

* Deletion: discard(a)

** Delete connection between a and parent of a

*Deletion: discard(a)

** If parent has 1 child

**Delete connection between a and parent of a

*** If parent's sibling has 2 children, move child to sibling and discard(parent)

**If parent has 1 child

*** If parent's sibling has 3 children, take one child from sibling

***If parent's sibling has 2 children, move child to sibling and discard(parent)

***If parent's sibling has 3 children, take one child from sibling

====Parallel Algorithms====

Assume concurrent read model

* Insert: Suppose we want to insert sorted elements <math>c_1,...,c_k</math> into a tree with n elements.

** Insert element <math>c_{k/2}</math>.

*Insert: Suppose we want to insert sorted elements <math>c_1,...,c_k</math> into a tree with n elements.

** Then insert in parallel <math>(c_1,...,c_{k/2-1})</math> and <math>(c_{k/2+1},...,c_k)</math>

**Insert element <math>c_{k/2}</math>.

** Time is <math>O(\log k \log n)</math> using <math>k</math> processors.

**Then insert in parallel <math>(c_1,...,c_{k/2-1})</math> and <math>(c_{k/2+1},...,c_k)</math>

* Deletion: Suppose we want to delete sorted elements <math>c_1,...,c_k</math>

**Time is <math>O(\log k \log n)</math> using <math>k</math> processors.

** Backwards Insertion

*Deletion: Suppose we want to delete sorted elements <math>c_1,...,c_k</math>

** for t=0 to <math>\log k</math>

**Backwards Insertion

*** if <math>i \equiv 2^t (\operatorname{mod}2^{t+1})</math>

**for t=0 to <math>\log k</math>

**** discard(c_i)

***if <math>i \equiv 2^t (\operatorname{mod}2^{t+1})</math>

** Time is <math>O(\log n \log k)</math> without pipelining

****discard(c_i)

** With pipelining, complexity is <math>O(\log n + \log k)</math>

**Time is <math>O(\log n \log k)</math> without pipelining

**With pipelining, complexity is <math>O(\log n + \log k)</math>

===Pipelining===

* There are <math>\log k</math> waves of the absorb procedure.

* Apply pipelining to make the time <math>O(\log n + \log k)</math>

*There are <math>\log k</math> waves of the absorb procedure.

*Apply pipelining to make the time <math>O(\log n + \log k)</math>

==Maximum Finding==

Line 272:

Line 309:

===Constant time, <math>O(n^2)</math> Work===

* Compare every pair of elements in A[1...n]

*Compare every pair of elements in A[1...n]

<pre>

for i=1 to n pardo

Line 287:

Line 325:

===<math>O(\log \log n)</math> time and <math>O(n\log \log n)</math> work algorithm===

* Split A into <math>\sqrt{n}</math> subarrays

* Find the max of each subarray recursively

*Split A into <math>\sqrt{n}</math> subarrays

* Find the max of all subarrays in <math>O(n)</math> using the constant time algo

*Find the max of each subarray recursively

*Find the max of all subarrays in <math>O(n)</math> using the constant time algo

Complexity

* <math>T(n) \leq T(\sqrt{n}) + c_1</math>

* <math>W(n) \leq \sqrt{n}W(\sqrt{n}) + c_2 n</math>

*<math>T(n) \leq T(\sqrt{n}) + c_1</math>

* <math>T(n) = O(\log \log n)</math>, <math>W(n) = O(n\log \log n)</math>

*<math>W(n) \leq \sqrt{n}W(\sqrt{n}) + c_2 n</math>

*<math>T(n) = O(\log \log n)</math>, <math>W(n) = O(n\log \log n)</math>

===<math>O(\log \log n)</math> time and <math>O(n)</math> work===

* Step 1: Partition into blocks of size <math>\log \log n</math>. Then we have <math>n/ \log \log n</math> blocks.

*Step 1: Partition into blocks of size <math>\log \log n</math>. Then we have <math>n/ \log \log n</math> blocks.

** Apply serial linear time algorithm to find maximum of each block

**Apply serial linear time algorithm to find maximum of each block

** <math>O(\log \log n)</math> time and <math>O(n)</math> work.

**<math>O(\log \log n)</math> time and <math>O(n)</math> work.

* Step 2: Apply doubly log algorithm to <math>n / \log \log n</math> maxima.

*Step 2: Apply doubly log algorithm to <math>n / \log \log n</math> maxima.

** <math>O(\log \log n)</math> time and <math>O(n)</math> work.

**<math>O(\log \log n)</math> time and <math>O(n)</math> work.

===Random Sampling===

Maximum finding in <math>O(1)</math> time and <math>O(n)</math> work with very high probability

* Step 1: Using aux array B of size <math>b^{7/8}</math>. Independently fill with random elements from A

*Step 1: Using aux array B of size <math>b^{7/8}</math>. Independently fill with random elements from A

* Step 2: Find the max <math>m</math> in array B in <math>O(1)</math> time and <math>O(n)</math> work

*Step 2: Find the max <math>m</math> in array B in <math>O(1)</math> time and <math>O(n)</math> work

** Last 3 pulses of the recursive doubly-log time algorithm:

**Last 3 pulses of the recursive doubly-log time algorithm:

** Pulse 1: B is partitioned into <math>n^{3/4}</math> blocks of size <math>n^{1/8}</math> each. Find the max of each block.

**Pulse 1: B is partitioned into <math>n^{3/4}</math> blocks of size <math>n^{1/8}</math> each. Find the max of each block.

** Pulse 2: <math>n^{3/4}</math> maxima are partitioned into <math>n^{1/2}</math> blocks of size <math>n^{1/4}</math> each.

**Pulse 2: <math>n^{3/4}</math> maxima are partitioned into <math>n^{1/2}</math> blocks of size <math>n^{1/4}</math> each.

** Pulse 2: Find the max m of <math>n^{1/2}</math> maxima in <math>O(1)</math> time and <math>O(n)</math> work.

**Pulse 2: Find the max m of <math>n^{1/2}</math> maxima in <math>O(1)</math> time and <math>O(n)</math> work.

* Step 3: While there is an element larger than m, throw the new element into an array of size <math>n^{7/8}</math>.

*Step 3: While there is an element larger than m, throw the new element into an array of size <math>n^{7/8}</math>.

: Compute the maximum of the new array.

:Compute the maximum of the new array.

;Complexity

* Step 1 takes <math>O(1)</math> time and <math>O(n^{7/8})</math> work

* Step 2 takes <math>O(1)</math> time and <math>O(n)</math> work

*Step 1 takes <math>O(1)</math> time and <math>O(n^{7/8})</math> work

* Each time Step 3 takes <math>O(1)</math> time and <math>O(n)</math> work

*Step 2 takes <math>O(1)</math> time and <math>O(n)</math> work

* With high probability, we only need one iteration (see theorem below) so the time is <math>O(1)</math> and work is <math>O(n^{7/8})</math>

*Each time Step 3 takes <math>O(1)</math> time and <math>O(n)</math> work

*With high probability, we only need one iteration (see theorem below) so the time is <math>O(1)</math> and work is <math>O(n^{7/8})</math>

====Theorem 8.2====

Line 327:

Line 369:

The probability of not finishing in the above time is <math>O(1/n^c)</math>.

{{ hidden | Proof |

{{hidden | Proof |

See page 58 of the classnotes.

Line 340:

Line 382:

;Inputs

* Vector of vertices which point to an index in edges

* Vector of edges e.g. (1,2)

*Vector of vertices which point to an index in edges

* Vector of pointers to identical edges e.g. index of (1,2) -> index of (2,1)

*Vector of edges e.g. (1,2)

* A vertex <math>r</math>

*Vector of pointers to identical edges e.g. index of (1,2) -> index of (2,1)

*A vertex <math>r</math>

;Goal:

* Select a direction for each edge to make our graph a tree with root <math>r</math>

*Select a direction for each edge to make our graph a tree with root <math>r</math>

;Algorithm

* Step 1: For every edge <math>(u,v)</math>, add two edges <math>u \rightarrow v</math> and <math>v \rightarrow u</math>

* Step 2: For every directed edge <math>u \rightarrow v</math>, set a pointer to another directed edge from <math>v</math>, <math>next(u \rightarrow v)</math>

*Step 1: For every edge <math>(u,v)</math>, add two edges <math>u \rightarrow v</math> and <math>v \rightarrow u</math>

** If our edge is 1 -> 2, then the next edge is the immediate next edge of edge 2->1 in our edges array

*Step 2: For every directed edge <math>u \rightarrow v</math>, set a pointer to another directed edge from <math>v</math>, <math>next(u \rightarrow v)</math>

** If edge 2 -> 1 is the last edge coming out of 2, then the next edge is the first edge coming out of 2

**If our edge is 1 -> 2, then the next edge is the immediate next edge of edge 2->1 in our edges array

** Now we have an euler cycle (or euler tour) among the edges

**If edge 2 -> 1 is the last edge coming out of 2, then the next edge is the first edge coming out of 2

* Step 3: <math>next(u_{r,1} \rightarrow r) = NIL </math>

**Now we have an euler cycle (or euler tour) among the edges

* Step 4: Apply a list ranking algorithm to the edges.

*Step 3: <math>next(u_{r,1} \rightarrow r) = NIL </math>

* Step 5: Delete the edge between every two vertices with the lower ranking

** The edge <math>u_{r,1} \rightarrow r</math> is the opposite edge of the first edge coming out of <math>r</math>

*Step 4: Apply a list ranking algorithm to the edges.

*Step 5: Delete the edge between every two vertices with the lower ranking

;Notes

* Every step except 4, list ranking, is a local operation which is constant time and linear work

* Requires a list ranking algorithm

*Every step except 4, list ranking, is a local operation which is constant time and linear work

*Requires a list ranking algorithm

====Preorder Numbering====

Define: tree-edges point away from root, non-tree edges point to root

* Step 1: For every edge pardo, if e is a tree edge, distance(e) = 1 else distance(e) = 0

*Step 1: For every edge pardo, if e is a tree edge, distance(e) = 1 else distance(e) = 0

* Step 2: List ranking

*Step 2: List ranking

* Step 3: For every edge e: u->v, preorder(v) = n-distance(e) + 1

*Step 3: For every edge e: u->v, preorder(v) = n-distance(e) + 1

===Technique: Pointer Jumping===

Line 389:

Line 436:

Correctness:

* Each element will eventually point to the cashier

* Distance will always be correct

*Each element will eventually point to the cashier

** If next[i] = NIL then distance is to the cashier

*Distance will always be correct

** Otherwise the distance is <math>2^k</math>

**If next[i] = NIL then distance is to the cashier

**Otherwise the distance is <math>2^k</math>

Complexity: <math>O(\log n)</math> time and <math>O(n\log n)</math> work

===Work-optimal List Ranking===

* From our list A, pull a subset S with uniform density and sparcity

* Remove S from A, creating a new subset <math>A-S</math>.

** For each element i in <math>S</math>, fix the next and distance of prev(i)

** Run compaction

* Recurse on <math>A-S</math> until size is <math>\leq n/\log n</math>. Then use pointer jumping.

* Add <math>S</math> back to <math>A-S</math>

; Complexity

*From our list A, pull a subset S with uniform density and sparcity

~~With~~ high probability:

*Remove S from A, creating a new subset <math>A-S</math>.

* <math>O(log (n) log log (n)) time</math>

**For each element i in <math>S</math>, fix the next and distance of prev(i)

* <math>O(n)</math> work

**Run compaction

*Recurse on <math>A-S</math> until size is <math>\leq n/\log n</math>. Then use pointer jumping.

*Add <math>S</math> back to <math>A-S</math>

;Complexity

We get the same complexity for randomized and deterministic list ranking (see below)

Randomized has the complexity with high probability.

*<math>O(\log (n) \log \log (n))</math> time

*<math>O(n)</math> work

Using a parallel prefix-sum algorith which runs in <math>O(\log n/\log \log n)</math> time,

we can get list ranking in <math>O(\log n)</math> time and <math>O(n)</math> work.

===Randomized Symmetry Breaking===

Line 418:

Line 472:

s(i) = 1

</pre>

;Notes

* The full randomized list ranking works in <math>O(\log n \log\log n)</math> time and <math>O(n)</math> work with high probability

===Deterministic Symmetry Breaking===

* Assign each element a unique tag, such as the index in the array

* Convert the tag to binary. For each element i, tag(i)!=tag(next(i)).

*Assign each element a unique tag, such as the index in the array

* Let k be the index of the rightmost bit (where 0 is the rightmost bit) differing between tag(i) and tag(next(i))

*Convert the tag to binary. For each element i, tag(i)!=tag(next(i)).

* Let b be the bit in tag(i) differing from tag(next(i))

*Let k be the index of the rightmost bit (where 0 is the rightmost bit) differing between tag(i) and tag(next(i))

* Set the new tag to <code>(k<<1) + b</code>

*Let b be the bit in tag(i) differing from tag(next(i))

*Set the new tag to <code>(k<<1) + b</code>

This algorithm takes <math>O(1)</math> time and <math>O(n)</math> work per iteration.

Line 433:

Line 491:

===r-ruling set===

An r-ruling set is a subset <math>S</math> of a linked list satisfying two properties:

* ''Uniform Density'' If node <math>i</math> is not in S, then one of the <math>r</math> nodes following i will be in <math>S</math>

* ''Uniform Sparcity'' If node <math>i</math> is in <math>S</math> then node <math>next(i)</math> is not in <math>S</math>

;Algorithm

*''Uniform Density'' If node <math>i</math> is not in S, then one of the <math>r</math> nodes following i will be in <math>S</math>

*''Uniform Sparcity'' If node <math>i</math> is in <math>S</math> then node <math>next(i)</math> is not in <math>S</math>

;Algorithm 1

The following algorithm gives us an r-ruling set with <math>r=2*\log n - 1</math>

# Apply the deterministic coin tossing algorithm to give each element a tag in <math>[0,...,2*\log n - 1]</math>

# For each element i in parallel, if i is a local maximum then add it to <math>S</math>

#Apply the deterministic coin tossing algorithm to give each element a tag in <math>[0,...,2*\log n - 1]</math>

#For each element i in parallel, if i is a local maximum then add it to <math>S</math>

;Optimal-Work 2-Ruling set

<pre>

Apply one iteration of deterministic coin tossing

Sort elements by tags

Initialize array S with 1s

for k = 0 to 2*log(n)-1:

for i such that tag(i) = k pardo

if S(pre(i)) = S(next(i)) = 1:

S(i) = 0

</pre>

==Tree Contraction==

Line 453:

Line 526:

===Parallel Rake===

Rake all leaves such that:

* No two leaves have the same parent (stem)

* The parent of a stem cannot be the stem of another leaf

*No two leaves have the same parent (stem)

*The parent of a stem cannot be the stem of another leaf

Line 460:

Line 534:

;Parallel Rake Contraction Scheme

Apply legal parallel rakes until the tree becomes a 3-node binary tree

;Parallel Rake Contraction Algorithm

Step 1: Number ~~nodes~~ according to a DFS using the Euler tour algorithm.

;Step 1

*Number leaves according to a DFS using the Euler tour algorithm.

Observation 3:

Line 472:

Line 549:

Proof: See the classnotes

Step 2:

;Step 2:

* Let <math>S_1</math> be nodes in <math>S</math> whose parents are not in <math>S</math>.

; Let <math>L_1</math> be leaves in <math>L</math> whose parents are in <math>S_1</math>.

* Let <math>L_1</math> be leaves in <math>L</math> whose parents are in <math>S_1</math>.

* Apply parallel rakes to leaves in <math>L_1</math>, unless the parent is the root.

* Apply parallel rakes to leaves in <math>L - L_1</math>, unless the parent is the root.

Line 491:

Line 569:

The internal node will be <math>(a*l+b) \times (c*r+d)</math> where <math>\times</math> is either addition or multiplication.

If we delete one leaf and it's parent (stem), we will modify these variables of the stem's parent.

===Tree Binarization===

;Note: This is not in the classnotes, but has been covered in exams.

To apply the above techniques to general trees or k-ary trees, you should binarize your k-ary tree.

This is done as follows:

* Suppose node <math>x</math> has <math>n > 2</math> children

* Select one child to keep

* Create a node <math>x'</math>

* Move all <math>n-1</math> children to <math>x'</math>

* Set child <math>x'</math> to be a child of <math>x</math>

Complexity

* The time is <math>O(max\_degree)</math> and the work is <math>O(n)</math>

==Graph Connectivity==

Line 501:

Line 593:

Definitions:

* Pointer Graph

* Supervertices

*Pointer Graph

* The supervertex graph

*Supervertices

* Hookings

*The supervertex graph

* Parallel pointer jumping

*Hookings

*Parallel pointer jumping

===A first connectivity algorithm===

The first connectivity algorithm consists of <math>\log(n)</math> iterations of the following two steps:

* Hookings - Each root hooks itself the the minimal adjacent root. If there are no adjacent stars, then this rooted star quits.

* Parallel pointer jumping - Each vertex performs pointer jumping until every vertex points to the root, forming a rooted star.

*Hookings - Each root hooks itself the the minimal adjacent root. If there are no adjacent stars, then this rooted star quits.

*Parallel pointer jumping - Each vertex performs pointer jumping until every vertex points to the root, forming a rooted star.

;Theorems

# The pointer graph always consists of rooted trees.

# After <math>\log n</math> iterations, all vertices are contained in rooted stars that quit.

#The pointer graph always consists of rooted trees.

#: Each connected component is represented by a single rooted star.

#After <math>\log n</math> iterations, all vertices are contained in rooted stars that quit.

#:Each connected component is represented by a single rooted star.

;Complexity

* We need <math>O(\log n)</math> iterations

** Hookings take <math>O(1)</math> time and <math>O(n+m)</math> work

*We need <math>O(\log n)</math> iterations

** Pointer jumping takes <math>O(\log n)</math> time and <math>O(n\log n)</math> work

**Hookings take <math>O(1)</math> time and <math>O(n+m)</math> work

* For adjacency matrix, in total we need <math>O(\log^2 n)</math> time and <math>O(n^2\log n)</math> work since we have a processor per <math>n^2</math> possible edges.

**Pointer jumping takes <math>O(\log n)</math> time and <math>O(n\log n)</math> work

* For incidence lists, we need <math>O(\log^2 n)</math> time and <math>O(n\log^2 n + m\log n)</math> work since we have a processor per edge.

*For adjacency matrix, in total we need <math>O(\log^2 n)</math> time and <math>O(n^2\log n)</math> work since we have a processor per <math>n^2</math> possible edges.

*For incidence lists, we need <math>O(\log^2 n)</math> time and <math>O(n\log^2 n + m\log n)</math> work since we have a processor per edge.

===A second connectivity algorithm===

Line 530:

Line 626:

Each iteration takes constant time.

# Probe quitting: each rooted star whose supervertex is not adjacent to any other quits

#Probe quitting: each rooted star whose supervertex is not adjacent to any other quits

# Hooking on smaller: each rooted star hooks onto a smaller vertex of another supervertex which is not a leaf

#Hooking on smaller: each rooted star hooks onto a smaller vertex of another supervertex which is not a leaf

#* Note that this is not unique. Requires arbitrary CRCW.

#*Note that this is not unique. Requires arbitrary CRCW.

# Hooking non-hooked-upon: every rooted star not hooked upon in step 2 hooks to another vertex in another supervertex which is not a leaf

#Hooking non-hooked-upon: every rooted star not hooked upon in step 2 hooks to another vertex in another supervertex which is not a leaf

# Parallel pointer jumping: one round of parallel pointer jumping

#Parallel pointer jumping: one round of parallel pointer jumping

;Theorems

*The pointer graph always consists of rooted trees.

*For every vertex which is not a leaf, <math>D(v) \leq v</math>

{{hidden | Proof |

* This is true after steps 1, 2, and 4.

** This follows immediately for 1 and 2.

** After one round of pointer jumping (4) after 3, everyone from supervertex <math>r</math> is a leaf due to the claim below.

* Consider a root <math>r</math> which hooks on someone larger in step 3.

*: After Step 3, all children of <math>r</math> are leaves.

*: I.e. no one else will hook to <math>r</math> during Step 3.

** If anyone was larger, <math>r</math> would have hooked on them in Step 2.

** If anyone was smaller, they would have hooked on <math>r</math> in Step 2.

}}

;Complexity

*Time <math>O(\log n)</math>

*Work <math>O((n+m)\log n)</math>

;Notes

*Without step 3, the algorithm will run in <math>O(n)</math> time instead of <math>O(\log n)</math> but the result will be valid.

*In step 2, you can hook rooted trees in addition to rooted stars for better empirical performance

{{hidden | Proof |

Let <math>h(T)</math> be the height of tree T where a two-layer tree has height 1.

Define a single node to have height 1.

Let <math>H</math> be the sum of heights of all trees and <math>\bar{H}</math> be the total height after one iteration.

* Claim: <math>\bar{H} \leq 2*H/3</math>

}}

===Minimum Spanning Forest===

====The MSF Theorem====

Let <math>G(V,E)</math> be a weighted graph and let <math>U</math> and <math>V-U</math> be two non-empty subsets of <math>V</math>.

Consider set <math>H=\{(u,v) \mid u \in U, v \in V\}</math> of edges connecting points in <math>U</math> and <math>V</math>.

If <math>e</math> is the edge of minimum weight in <math>H</math>, then <math>e</math> is in the MSF of <math>G</math>

====A first MSF algorithm====

* Sort the edges by weight. Assign a processor to each edge. Using a priority CRCW, the edge with minimum weight will take priority during writes.

Each round consists of hooking and <math>\log n</math> rounds of parallel pointer jumping

* Each root find the minimum weight edge to another supervertex and hooks itself to that root.

*: This is automatic from the priority CRCW.

====A second MSF algorithm====

We replace steps 2 and 3 in the connectivity algorithm with the simple rule:

* We hook rooted stars using the edge with the smallest weight.

;Notes

* The progression of edge weights up the tree are monotonically increasing so there are no cycles

===Biconnectivity===

This section is not it the textbook. It is from Tarjan and Vishkin (1985)<ref name="tarjan1985biconnectivity">Robert E. Tarjan, and Uzi Vishkin. ''An Efficient Parallel Biconnectivity Algorithm'' (1985). SIAM Journal on Computing. DOI: [https://doi.org/10.1137/0214061 10.1137/0214061] [https://epubs.siam.org/doi/10.1137/0214061 https://epubs.siam.org/doi/10.1137/0214061] [https://www.researchgate.net/publication/220617428_An_Efficient_Parallel_Biconnectivity_Algorithm Mirror]</ref>.

A graph is biconnected if for every two vertices <math>v_1</math> and <math>v_2</math>, there is a simple cycle containing <math>v_1</math> and <math>v_2</math>.

Intuitively this means that for any two vertices, there are at least two paths from <math>v_1</math> to <math>v_2</math>. If any vertex and its edges are removed from a biconnected graph, it still remains connected.

Every connected graph consists of biconnected components. These are sets of edges such that every two edges from a set lie on a simple cycle.

Vertices which connect two biconnected components are called ''articulation points''.

If these points are removed, the two biconnected components are no longer connected.

;Algorithm

Assume we are given a connected graph <math>G</math>.

# First build a spanning tree <math>T</math>. Record which edges are in the spanning tree. Root the spanning tree.

# Compute the preorder number of all edges using an euler tour.

# For every vertex <math>v</math>, calculate the hi and low preorder numbers in the subtree <math>T(v)</math>

# Create the auxiliary graph <math>G'' = (V'', E'')</math> as follows:

#* All edges of <math>G</math> are vertices in <math>G''</math>

#* For each edge <math>(v, w)</math> in <math>G - T</math>, add edge <math> ((p(v),v),(p(w),w))</math> to <math>G''</math> iff <math>v</math> and <math>w</math> are unrelated in <math>T</math>

#* For each edge <math>(v = p(w), w)</math> in <math>T</math>, add edge <math>((p(v), v), (v, w))</math> to <math>G''</math> iff <math>low(w) < v</math> or <math>high(w) \geq v + size(v)</math>

# Compute the connected components of <math>G''</math>

# Assign edges based on their connected components.

;Complexity

* <math>O(\log n)</math> time

==Tricks==

===Removing duplicates in an array===

Assume you have an Arbitrary CRCW

# Given array A of size n with entries 0 to n-1

# For each entry pardo

#* Write B[A[i]] == i

#* Only one write will succeed for each unique A[i]

#* Check if B[A[i]] == i

===Compaction of an <math>n^2</math> array with <math>\leq n</math> elements in <math>O(\log n)</math> time===

* Split our <math>n^2</math> array to n subarrays, each of size <math>n</math>

* Run <math>n</math> parallel compactions on each subarray

* Perform one compaction on the size of each subarray

* Compact all elements using (total size of prev subarrays + position in current subarray)

The benefit of this segmented compaction/prefix sum is that if you only do each segment once in an iterative algorithm such as parallel BFS,

you can get <math>O(n^2)</math> or <math>O(m)</math> work.

==XMT Programming Tips==

I never figured out how to use standard c headers.

* XMT only has int (32 bit) and float (32 bit). If you need a bool type, you will need to define it in a header.

* You cannot call functions from within <code>spawn</code>

Here is my helpers header <code>helpers.h</code>

{{ hidden | <code>helpers.h</code> |

#ifndef HELPERS_H

#define HELPERS_H

#define true 1

#define false 0

#define bool int

// #define NIL -1

#define max(a, b) ((((a) < (b)) ? (b) : (a)))

#define min(a, b) ((((a) < (b)) ? (a) : (b)))

#endif

</syntaxhighlight>

}}

If you prefer to debug in C++ like me, you can use the following code to adapt running the XMT code:

This code does not run things asynchronously.

{{hidden | C++ Code |

#define spawn(x, y) for (int $ = x; $ <= y; ++$)

//#define spawn(x, y) for (int $ = y; $ >= x; --$)

#define ps(a, b) \

{ \

b += a; \

a = b - a; \

}

#define psm(a, b) \

{ \

b += a; \

a = b - a; \

}

#define psBaseReg int

#define STRINGIFY2(X) #X

#define STRINGIFY(X) STRINGIFY2(X)

using namespace std;

#define GRAPH_NUM graph0

#define N 19

#define M 36

vector<int> vertices;

vector<int> degrees;

vector<vector<int>> edges;

int main() {

vertices.resize(N);

degrees.resize(N);

edges = std::vector<std::vector<int>>(M, std::vector<int>(2));

string datapath = "data/" STRINGIFY(GRAPH_NUM) "/graph.txt";

string data = readFile(datapath);

regex section_regex(

"[\\n\\r][\\n\\r]+|[\\r\\n]0\\s[\\r\\n]+",

std::regex_constants::ECMAScript | std::regex_constants::icase);

regex space_regex("[\\n\\s\\r]+", std::regex_constants::ECMAScript |

std::regex_constants::icase);

vector<string> sections = split(data, section_regex);

// for (int i = 0; i < sections.size(); i++) {

// cout << "Section " << i << " " << sections[i].size() << endl;

// cout << sections[i] << endl;

//}

return 0;

}

string readFile(const string& datapath) {

std::ifstream in(datapath, std::ios::in);

string data;

if (in.good()) {

std::string contents;

in.seekg(0, std::ios::end);

contents.resize(static_cast<unsigned int>(in.tellg()));

in.seekg(0, std::ios::beg);

in.read(&contents[0], contents.size());

return contents;

}

std::cerr << "Failed to open file: " << datapath << std::endl;

throw(errno);

}

vector<string> split(string s, regex r) {

vector<string> splits;

smatch m; // <-- need a match object

while (regex_search(s, m, r)) // <-- use it here to get the match

{

int split_on = m.position(); // <-- use the match position

splits.push_back(s.substr(0, split_on));

s = s.substr(split_on + m.length()); // <-- also, skip the whole match

}

if (!s.empty()) {

splits.push_back(s); // and there may be one last token at the end

}

return splits;

}

</syntaxhighlight>

}}

==Hardware/Architecture==

Line 548:

Line 850:

[[File:Pram hardware diagram.jpg | 400px]]

* MTCU - Master Thread Control Unit - runs in serial mode

*MTCU - Master Thread Control Unit - runs in serial mode

* TCU Cluster - Thread Control Unit

*TCU Cluster - Thread Control Unit

;Procedure

* Spawn broadcasts your spawn code in the block to every thread control unit (TCU).

* Data is shared through the interconnection network

*Spawn broadcasts your spawn code in the block to every thread control unit (TCU).

*Data is shared through the interconnection network

;Performance Penalty

<math>\text{Execution Depth} = \text{Computation Depth} + LSRTM \times R + QD</math>

* <math>LSTRM</math> - length of sequence of round trips to memory

* <math>R</math> - time for round trip to memory

*<math>LSTRM</math> - length of sequence of round trips to memory

* <math>QD</math> - queuing delay

*<math>R</math> - time for round trip to memory

** If multiple threads access the same memory module, they are queued at the memory module

*<math>QD</math> - queuing delay

**If multiple threads access the same memory module, they are queued at the memory module

==Resources==

* [https://stanford.edu/~rezab/dao/ Stanford CME323 Distributed Algorithms Lectures 1-3]

*[https://stanford.edu/~rezab/dao/ Stanford CME323 Distributed Algorithms Lectures 1-3]

===Textbooks===

* [http://users.umiacs.umd.edu/~vishkin/PUBLICATIONS/classnotes.pdf Uzi Vishkin ENEE651/CMSC751 Classnotes]

* [https://www.amazon.com/Introduction-Parallel-Algorithms-Joseph-JaJa/dp/0201548569?sa-no-redirect=1&pldnSite=1 Introduction to Parallel Algorithms (Joseph Jaja, textbook)]

==~~Ignore==~~

*[http://users.umiacs.umd.edu/~vishkin/PUBLICATIONS/classnotes.pdf Uzi Vishkin ENEE651/CMSC751 Classnotes]

[~~[Visible to:~~:~~users]~~]

*[https://www.amazon.com/Introduction-Parallel-Algorithms-Joseph-JaJa/dp/0201548569?sa-no-redirect=1&pldnSite=1 Introduction to Parallel Algorithms (Joseph Jaja, textbook)]

*[https://www.cs.cmu.edu/~guyb/papers/BM04.pdf Parallel Algorithms by Guy E. Blelloch and Bruce M. Maggs (CMU)]

==References==

@@ Line 1: / Line 1: @@
-Parallel Algorithms notes from CMSC751 with Uzi Vishkin.<br>
+Parallel Algorithms notes from CMSC751 with [http://users.umiacs.umd.edu/~vishkin/index.shtml Uzi Vishkin].<br>
-This class is based on his book [[:File:CMSC751_Classnotes.pdf | Thinking in Parallel Some Basic Data-Parallel Algorithms and Techniques]]
+This class is based on his book [[:File:CMSC751_Classnotes.pdf | Thinking in Parallel Some Basic Data-Parallel Algorithms and Techniques]]<br>
+[http://users.umiacs.umd.edu/~vishkin/TEACHING/ Class Website]<br>
 ==XMT Language==
 XMTC is a single-program multiple-data (SPMD) extension of C.<br>
-* Spawn creates threads
-* Threads expire at Join
+*Spawn creates threads
-* <code>$</code> represents the number of the thread
+*Threads expire at Join
-* <code>PS Ri Rj</code> is an atomic prefix sum
+*<code>$</code> represents the number of the thread
-** Stores Ri + Rj in Ri
+*<code>PS Ri Rj</code> is an atomic prefix sum
-** Stores the original value of Ri in Rj
+**Stores Ri + Rj in Ri
-<pre>
+**Stores the original value of Ri in Rj
+<syntaxhighlight lang="c">
 int x = 0;
@@ Line 21: / Line 23: @@
    }
 }
-</pre>
+</syntaxhighlight>
 ==Models==
@@ Line 32: / Line 34: @@
 ====Types of PRAM====
-* exclusive-read exclusive-write (EREW)
-* concurrent-read exclusive-write (CREW)
+*exclusive-read exclusive-write (EREW)
-* concurrent-read concurrent-write (CRCW)
+*concurrent-read exclusive-write (CREW)
-** Arbitrary CRCW - an arbitrary processor writing succeeds
+*concurrent-read concurrent-write (CRCW)
-** Priority CRCW - the lowest numbered processor writing succeeds
+**Arbitrary CRCW - an arbitrary processor writing succeeds
-** Common CRCW - writing succeeds only if all processors write the same value
+**Priority CRCW - the lowest numbered processor writing succeeds
+**Common CRCW - writing succeeds only if all processors write the same value
 ====Drawbacks====
-* Does not reveal how the algorithm will run on PRAMs with different number of proc
-* Fully specifying allocation requires an unnecessary level of detail
+*Does not reveal how the algorithm will run on PRAMs with different number of proc
+*Fully specifying allocation requires an unnecessary level of detail
 ===Work Depth===
@@ Line 51: / Line 55: @@
 ;Notes
-* Other resources call this Brent's theorem
-* <math>x</math> is the work and <math>d</math> is the depth
+*Other resources call this Brent's theorem
+*<math>x</math> is the work and <math>d</math> is the depth
 ===Speedup===
-* A parallel algorithm is ''work-optimal'' if W(n) grows asymptotically the same as T(n).
-* A work-optimal parallel algorithm is ''work-time-optimal'' if T(n) cannot be improved by another work-optimal algorithm.
+*A parallel algorithm is ''work-optimal'' if W(n) grows asymptotically the same as T(n).
-* If T(n) is best known and W(n) matches it, this is called ''linear-speedup''
+*A work-optimal parallel algorithm is ''work-time-optimal'' if T(n) cannot be improved by another work-optimal algorithm.
+*If T(n) is best known and W(n) matches it, this is called ''linear-speedup''
 ====NC Theory====
 Nick's Class Theory
-* Good serial algorithms: Poly time
-* Good parallel algorithm: poly-log <math>O(\log ^c n)</math> time, poly processors
+*Good serial algorithms: Poly time
+*Good parallel algorithm: poly-log <math>O(\log ^c n)</math> time, poly processors
 ==Technique: Balanced Binary Trees==
 Example Applications:
-* Prefix sum
-* Largest element
+*Prefix sum
-* Nearest-one element
+*Largest element
-* Compaction
+*Nearest-one element
+*Compaction
 ;Inputs
-* Array A[1..n] of elements
-* Associative binary operation, denoted * (e.g. addition, multiplication, min, max)
+*Array A[1..n] of elements
+*Associative binary operation, denoted * (e.g. addition, multiplication, min, max)
 ;Outputs
-* Array B containing B(i) = A(1) * ... * A(i)
+*Array B containing B(i) = A(1) * ... * A(i)
 ;Algorithm
 <math>O(n)</math> work and <math>O(\log n)</math> time
 <pre>
@@ Line 100: / Line 111: @@
 ==Technique: Merge Sorting Cluster==
 ===Technique: Partitioning===
-* Merging: Given two sorted arrays, A and B, create a sorted array C consisting of elements in A and B
-* Ranking: Given an element x, find <math>Rank(x, A) = i</math> defined such that <math>A(i) \leq x \leq A(i+1)</math>
+*Merging: Given two sorted arrays, A and B, create a sorted array C consisting of elements in A and B
-** Note the book sometimes uses <math>Rank(i,B)</math> to denote <math>Rank(A(i), B)</math>.
+*Ranking: Given an element x, find <math>Rank(x, A) = i</math> defined such that <math>A(i) \leq x \leq A(i+1)</math>
+**Note the book sometimes uses <math>Rank(i,B)</math> to denote <math>Rank(A(i), B)</math>.
 ;Equivalence of Merging and Ranking
-* Given an algorithm for merging, we know A(i) belongs in C(j) so Rank(A(i), B) = j-i
-* Given an algorithm for ranking, we get A(i) belongs in C(j) where j = Rank(A(i), B) + i
+*Given an algorithm for merging, we know A(i) belongs in C(j) so Rank(A(i), B) = j-i
+*Given an algorithm for ranking, we get A(i) belongs in C(j) where j = Rank(A(i), B) + i
 ;Naive algorithm
 Apply binary search element wise on concurrent read system.
 O(log n) time, O(nlog n) work
@@ Line 118: / Line 132: @@
 ;Partitioning paradigm
 Input size: n
-* Partition the input into p jobs of size approx. n/p
-* Do small jobs concurrently using a separate (possibly serial) algo for each.
+*Partition the input into p jobs of size approx. n/p
+*Do small jobs concurrently using a separate (possibly serial) algo for each.
 Example:
 Total work O(n), total time O(log n)
@@ Line 137: / Line 154: @@
 ;Notes
-* We do not need to compute A_end or B_end. Just continue merging until we hit some index where (i % x)=0 in A or B.
+*We do not need to compute A_end or B_end. Just continue merging until we hit some index where (i % x)=0 in A or B.
 ===Techinque: Divide and Conquer===
 ;Merge Sort
 Input: Array A[1..n]<br>
 Complexity:<br>
@@ Line 160: / Line 179: @@
 ===Technique: Accelerating Cascades===
 Consider: for problem of size n, there are two parallel algorithms.<br>
-* Algo A: <math>W_1(n)</math> work, <math>T_1(n)</math> time
-* Algo B: Work <math>W_2(n) > W_1(n)</math>. Time <math>T_2(n) < T_1(n)</math>. Faster but less efficient.
+*Algo A: <math>W_1(n)</math> work, <math>T_1(n)</math> time
+*Algo B: Work <math>W_2(n) > W_1(n)</math>. Time <math>T_2(n) < T_1(n)</math>. Faster but less efficient.
 Assume Algo A is a "reducing algorithm"<br>
 We start with Algo A until the size is below some threshold. Then we switch to Algo B.
 E.g.
-* Algo 1 runs in <math>O(\log n)</math> iterations each taking <math>O(\log n)</math> time and <math>O(n)</math> work. Each iteration reduces the size to (3/4)m.
-* Algo 2 runs in <math>O(\log n)</math> time and <math>O(n\log n)</math> work.
+*Algo 1 runs in <math>O(\log n)</math> iterations each taking <math>O(\log n)</math> time and <math>O(n)</math> work. Each iteration reduces the size to (3/4)m.
-* Run Algo 1 for <math>O(\log \log n)</math> rounds then run algo 2.
+*Algo 2 runs in <math>O(\log n)</math> time and <math>O(n\log n)</math> work.
-** Total time is <math>O(\log n \log \log n)</math> and work is <math>O(n)</math>
+*Run Algo 1 for <math>O(\log \log n)</math> rounds then run algo 2.
+**Total time is <math>O(\log n \log \log n)</math> and work is <math>O(n)</math>
 ===Problem: Selection===
 ;Algorithm 1
-* Partition elements into rows of <math>\log n</math> size
-* For each row, find the median within the row
+*Partition elements into rows of <math>\log n</math> size
-* Find the median of medians (MoM) in <math>O(n)</math>
+*For each row, find the median within the row
-* Put all rows with median <= MoM above and all rows with median >= Mom below
+*Find the median of medians (MoM) in <math>O(n)</math>
-* Now <math>m/4</math> elements are smaller and <math>m/4</math> are larger
+*Put all rows with median <= MoM above and all rows with median >= Mom below
-** Known as the reducing lemma
+*Now <math>m/4</math> elements are smaller and <math>m/4</math> are larger
+**Known as the reducing lemma
 This algorithm solves the selection problem in <math>O(\log^2 n)</math> time and <math>O(n)</math> work.
-; Accelerating Cascades
+;Accelerating Cascades
-* What we have:
-** Algorithm 1 has <math>O(\log n)</math> iterations.
+*What we have:
-:: Each iteration reduces a size m instance in <math>O(\log m)</math> time and <math>O(m)</math> work to an instance of size <math>\leq 3m/4</math>
+**Algorithm 1 has <math>O(\log n)</math> iterations.
-** Algorithm 2 runs in <math>O(\log n)</math> time and <math>O(n \log n)</math> work.
+::Each iteration reduces a size m instance in <math>O(\log m)</math> time and <math>O(m)</math> work to an instance of size <math>\leq 3m/4</math>
-* Step 1: Use algo 1 to reduce from n to <math>n / \log n</math>
+**Algorithm 2 runs in <math>O(\log n)</math> time and <math>O(n \log n)</math> work.
-* Step 2: Apply algorithm 2
+*Step 1: Use algo 1 to reduce from n to <math>n / \log n</math>
+*Step 2: Apply algorithm 2
+Total time is <math>O(\log n \log\log n)</math> and total work is <math>O(n)</math>
 ===Informal Work-Depth (IWD)===
@@ Line 202: / Line 230: @@
 ===Algorithm===
-* Partition A into n/r subarrays <math>B_1,...,B_{n/r}</math>
-** Sort each subarray separately using serial bucket sort (counting sort)
+*Partition A into n/r subarrays <math>B_1,...,B_{n/r}</math>
-** Compute number(v,s)  # of elements of value v in <math>B_s</math> for <math>0\leq v \leq r-1</math> and <math>1 \leq s \leq n/r</math>
+**Sort each subarray separately using serial bucket sort (counting sort)
-** Compute serial(i) = # of elements in the block <math>B_s</math> such that A(j)=A(i) and <math> j < i </math> <math>1 \leq j \neq n</math>
+**Compute number(v,s)  # of elements of value v in <math>B_s</math> for <math>0\leq v \leq r-1</math> and <math>1 \leq s \leq n/r</math>
-* Run prefix sum on number(v,1),number(v,2),...,number(v,n/r) into ps(v,1), ps(v,2),..., ps(v,n/r)
+**Compute serial(i) = # of elements in the block <math>B_s</math> such that A(j)=A(i) and <math> j < i </math> <math>1 \leq j \neq n</math>
-* Compute prefix sums of cardinality(0),.., cardinality(r-1) into global_ps(0),...,global_ps(r-1)
+*Run prefix sum on number(v,1),number(v,2),...,number(v,n/r) into ps(v,1), ps(v,2),..., ps(v,n/r)
-* The rank of element <math>i</math> is <math>1+serial(i)+ps(v,s-1)+global_ps(v-1)</math>
+*Compute prefix sums of cardinality(0),.., cardinality(r-1) into global_ps(0),...,global_ps(r-1)
+*The rank of element <math>i</math> is <math>1+serial(i)+ps(v,s-1)+global_ps(v-1)</math>
 ===Complexity===
-* Step 1:<math>T=O(r),\;W=O(r)</math> per subarray.
-** Total: <math>T=O(r),\; W=O(n)</math>
+*Step 1:<math>T=O(r),\;W=O(r)</math> per subarray.
-* Step 2: <math>r</math> computations each <math>T=O(\log(n/r)),\; W=O(n/r)</math>
+**Total: <math>T=O(r),\; W=O(n)</math>
-** Total <math>T=O(\log n),\; W=O(n)</math>
+*Step 2: <math>r</math> computations each <math>T=O(\log(n/r)),\; W=O(n/r)</math>
-* Step 3: <math>T=O(\log r),\; W=O(r)</math>
+**Total <math>T=O(\log n),\; W=O(n)</math>
-* Step 4: <math>T=O(1)\; W=O(n)</math>
+*Step 3: <math>T=O(\log r),\; W=O(r)</math>
-* Total: <math>T=O(r + \log n),\; W=O(n)</math>
+*Step 4: <math>T=O(1)\; W=O(n)</math>
+*Total: <math>T=O(r + \log n),\; W=O(n)</math>
 ;Notes
-* Running time is not poly-log
+*Running time is not poly-log
 ===Theorems===
-* The integer sorting algorithm runs in <math>O(r+\log n)</math> time and <math>O(n)</math> work
+*The integer sorting algorithm runs in <math>O(r+\log n)</math> time and <math>O(n)</math> work
-* The integer sorting algorithm can be applied to run in time <math>O(k(r^{1/k}+\log n))</math> and work <math>O(kn)</math>
+*The integer sorting algorithm can be applied to run in time <math>O(k(r^{1/k}+\log n))</math> and work <math>O(kn)</math>
 Radix sort using the basic integer sort (BIS) algorithm.<br>
-If your range is 0 to n and and your radix is <math>\sqrt{n}</math> then you will need <math>log_{\sqrt{n}}(r) = 2</math> rounds.
+If your range is 0 to n and and your radix is <math>\sqrt{n}</math> then you will need <math>\log_{\sqrt{n}}(r) = 2</math> rounds.
 ==2-3 trees; Technique: Pipelining==
@@ Line 236: / Line 267: @@
 ===2-3 tree===
 A 2-3 tree is a rooted tree which has the properties:
-* Each node has 2-3 ordered children
-* For any internal node, every directed path to a leaf is the same length
+*Each node has 2-3 ordered children
+*For any internal node, every directed path to a leaf is the same length
 Types of queries:
-* ''search(a)''
-* ''insert(a)''
+*''search(a)''
-* ''delete(a)''
+*''insert(a)''
+*''delete(a)''
 ====Serial Algorithms====
-* Deletion: discard(a)
-** Delete connection between a and parent of a
+*Deletion: discard(a)
-** If parent has 1 child
+**Delete connection between a and parent of a
-*** If parent's sibling has 2 children, move child to sibling and discard(parent)
+**If parent has 1 child
-*** If parent's sibling has 3 children, take one child from sibling
+***If parent's sibling has 2 children, move child to sibling and discard(parent)
+***If parent's sibling has 3 children, take one child from sibling
 ====Parallel Algorithms====
 Assume concurrent read model
-* Insert: Suppose we want to insert sorted elements <math>c_1,...,c_k</math> into a tree with n elements.
-** Insert element <math>c_{k/2}</math>.
+*Insert: Suppose we want to insert sorted elements <math>c_1,...,c_k</math> into a tree with n elements.
-** Then insert in parallel <math>(c_1,...,c_{k/2-1})</math> and <math>(c_{k/2+1},...,c_k)</math>
+**Insert element <math>c_{k/2}</math>.
-** Time is <math>O(\log k \log n)</math> using <math>k</math> processors.
+**Then insert in parallel <math>(c_1,...,c_{k/2-1})</math> and <math>(c_{k/2+1},...,c_k)</math>
-* Deletion: Suppose we want to delete sorted elements <math>c_1,...,c_k</math>
+**Time is <math>O(\log k \log n)</math> using <math>k</math> processors.
-** Backwards Insertion
+*Deletion: Suppose we want to delete sorted elements <math>c_1,...,c_k</math>
-** for t=0 to <math>\log k</math>
+**Backwards Insertion
-*** if <math>i \equiv 2^t (\operatorname{mod}2^{t+1})</math>
+**for t=0 to <math>\log k</math>
-**** discard(c_i)
+***if <math>i \equiv 2^t (\operatorname{mod}2^{t+1})</math>
-** Time is <math>O(\log n \log k)</math> without pipelining
+****discard(c_i)
-** With pipelining, complexity is <math>O(\log n + \log k)</math>
+**Time is <math>O(\log n \log k)</math> without pipelining
+**With pipelining, complexity is <math>O(\log n + \log k)</math>
 ===Pipelining===
-* There are <math>\log k</math> waves of the absorb procedure.
-* Apply pipelining to make the time <math>O(\log n + \log k)</math>
+*There are <math>\log k</math> waves of the absorb procedure.
+*Apply pipelining to make the time <math>O(\log n + \log k)</math>
 ==Maximum Finding==
@@ Line 272: / Line 309: @@
 ===Constant time, <math>O(n^2)</math> Work===
-* Compare every pair of elements in A[1...n]
+*Compare every pair of elements in A[1...n]
 <pre>
 for i=1 to n pardo
@@ Line 287: / Line 325: @@
 ===<math>O(\log \log n)</math> time and <math>O(n\log \log n)</math> work algorithm===
-* Split A into <math>\sqrt{n}</math> subarrays
-* Find the max of each subarray recursively
+*Split A into <math>\sqrt{n}</math> subarrays
-* Find the max of all subarrays in <math>O(n)</math> using the constant time algo
+*Find the max of each subarray recursively
+*Find the max of all subarrays in <math>O(n)</math> using the constant time algo
 Complexity
-* <math>T(n) \leq T(\sqrt{n}) + c_1</math>
-* <math>W(n) \leq \sqrt{n}W(\sqrt{n}) + c_2 n</math>
+*<math>T(n) \leq T(\sqrt{n}) + c_1</math>
-* <math>T(n) = O(\log \log n)</math>, <math>W(n) = O(n\log \log n)</math>
+*<math>W(n) \leq \sqrt{n}W(\sqrt{n}) + c_2 n</math>
+*<math>T(n) = O(\log \log n)</math>, <math>W(n) = O(n\log \log n)</math>
 ===<math>O(\log \log n)</math> time and <math>O(n)</math> work===
-* Step 1: Partition into blocks of size <math>\log \log n</math>. Then we have <math>n/ \log \log n</math> blocks.
+*Step 1: Partition into blocks of size <math>\log \log n</math>. Then we have <math>n/ \log \log n</math> blocks.
-** Apply serial linear time algorithm to find maximum of each block
+**Apply serial linear time algorithm to find maximum of each block
-** <math>O(\log \log n)</math> time and <math>O(n)</math> work.
+**<math>O(\log \log n)</math> time and <math>O(n)</math> work.
-* Step 2: Apply doubly log algorithm to <math>n / \log \log n</math> maxima.
+*Step 2: Apply doubly log algorithm to <math>n / \log \log n</math> maxima.
-** <math>O(\log \log n)</math> time and <math>O(n)</math> work.
+**<math>O(\log \log n)</math> time and <math>O(n)</math> work.
 ===Random Sampling===
 Maximum finding in <math>O(1)</math> time and <math>O(n)</math> work with very high probability
-* Step 1: Using aux array B of size <math>b^{7/8}</math>. Independently fill with random elements from A
+*Step 1: Using aux array B of size <math>b^{7/8}</math>. Independently fill with random elements from A
-* Step 2: Find the max <math>m</math> in array B in <math>O(1)</math> time and <math>O(n)</math> work
+*Step 2: Find the max <math>m</math> in array B in <math>O(1)</math> time and <math>O(n)</math> work
-** Last 3 pulses of the recursive doubly-log time algorithm:
+**Last 3 pulses of the recursive doubly-log time algorithm:
-** Pulse 1: B is partitioned into <math>n^{3/4}</math> blocks of size <math>n^{1/8}</math> each. Find the max of each block.
+**Pulse 1: B is partitioned into <math>n^{3/4}</math> blocks of size <math>n^{1/8}</math> each. Find the max of each block.
-** Pulse 2: <math>n^{3/4}</math> maxima are partitioned into <math>n^{1/2}</math> blocks of size <math>n^{1/4}</math> each.
+**Pulse 2: <math>n^{3/4}</math> maxima are partitioned into <math>n^{1/2}</math> blocks of size <math>n^{1/4}</math> each.
-** Pulse 2: Find the max m of <math>n^{1/2}</math> maxima in <math>O(1)</math> time and <math>O(n)</math> work.
+**Pulse 2: Find the max m of <math>n^{1/2}</math> maxima in <math>O(1)</math> time and <math>O(n)</math> work.
-* Step 3: While there is an element larger than m, throw the new element into an array of size <math>n^{7/8}</math>.
+*Step 3: While there is an element larger than m, throw the new element into an array of size <math>n^{7/8}</math>.
-: Compute the maximum of the new array.
+:Compute the maximum of the new array.
 ;Complexity
-* Step 1 takes <math>O(1)</math> time and <math>O(n^{7/8})</math> work
-* Step 2 takes <math>O(1)</math> time and <math>O(n)</math> work
+*Step 1 takes <math>O(1)</math> time and <math>O(n^{7/8})</math> work
-* Each time Step 3 takes <math>O(1)</math> time and <math>O(n)</math> work
+*Step 2 takes <math>O(1)</math> time and <math>O(n)</math> work
-* With high probability, we only need one iteration (see theorem below) so the time is <math>O(1)</math> and work is <math>O(n^{7/8})</math>
+*Each time Step 3 takes <math>O(1)</math> time and <math>O(n)</math> work
+*With high probability, we only need one iteration (see theorem below) so the time is <math>O(1)</math> and work is <math>O(n^{7/8})</math>
 ====Theorem 8.2====
@@ Line 327: / Line 369: @@
 The probability of not finishing in the above time is <math>O(1/n^c)</math>.
-{{ hidden | Proof |
+{{hidden | Proof |
 See page 58 of the classnotes.<br>
@@ Line 340: / Line 382: @@
 ;Inputs
-* Vector of vertices which point to an index in edges
-* Vector of edges e.g. (1,2)
+*Vector of vertices which point to an index in edges
-* Vector of pointers to identical edges e.g. index of (1,2) -> index of (2,1)
+*Vector of edges e.g. (1,2)
-* A vertex <math>r</math>
+*Vector of pointers to identical edges e.g. index of (1,2) -> index of (2,1)
+*A vertex <math>r</math>
 ;Goal:
-* Select a direction for each edge to make our graph a tree with root <math>r</math>
+*Select a direction for each edge to make our graph a tree with root <math>r</math>
 ;Algorithm
-* Step 1: For every edge <math>(u,v)</math>, add two edges <math>u \rightarrow v</math> and <math>v \rightarrow u</math>
-* Step 2: For every directed edge <math>u \rightarrow v</math>, set a pointer to another directed edge from <math>v</math>, <math>next(u \rightarrow v)</math>
+*Step 1: For every edge <math>(u,v)</math>, add two edges <math>u \rightarrow v</math> and <math>v \rightarrow u</math>
-** If our edge is 1 -> 2, then the next edge is the immediate next edge of edge 2->1 in our edges array
+*Step 2: For every directed edge <math>u \rightarrow v</math>, set a pointer to another directed edge from <math>v</math>, <math>next(u \rightarrow v)</math>
-** If edge 2 -> 1 is the last edge coming out of 2, then the next edge is the first edge coming out of 2
+**If our edge is 1 -> 2, then the next edge is the immediate next edge of edge 2->1 in our edges array
-** Now we have an euler cycle (or euler tour) among the edges
+**If edge 2 -> 1 is the last edge coming out of 2, then the next edge is the first edge coming out of 2
-* Step 3: <math>next(u_{r,1} \rightarrow r) = NIL </math>
+**Now we have an euler cycle (or euler tour) among the edges
-* Step 4: Apply a list ranking algorithm to the edges.
+*Step 3: <math>next(u_{r,1} \rightarrow r) = NIL </math>
-* Step 5: Delete the edge between every two vertices with the lower ranking
+** The edge <math>u_{r,1} \rightarrow r</math> is the opposite edge of the first edge coming out of <math>r</math>
+*Step 4: Apply a list ranking algorithm to the edges.
+*Step 5: Delete the edge between every two vertices with the lower ranking
 ;Notes
-* Every step except 4, list ranking, is a local operation which is constant time and linear work
-* Requires a list ranking algorithm
+*Every step except 4, list ranking, is a local operation which is constant time and linear work
+*Requires a list ranking algorithm
 ====Preorder Numbering====
 Define: tree-edges point away from root, non-tree edges point to root
-* Step 1: For every edge pardo, if e is a tree edge, distance(e) = 1 else distance(e) = 0
+*Step 1: For every edge pardo, if e is a tree edge, distance(e) = 1 else distance(e) = 0
-* Step 2: List ranking
+*Step 2: List ranking
-* Step 3: For every edge e: u->v, preorder(v) = n-distance(e) + 1
+*Step 3: For every edge e: u->v, preorder(v) = n-distance(e) + 1
 ===Technique: Pointer Jumping===
@@ Line 389: / Line 436: @@
 Correctness: <br>
-* Each element will eventually point to the cashier
-* Distance will always be correct
+*Each element will eventually point to the cashier
-** If next[i] = NIL then distance is to the cashier
+*Distance will always be correct
-** Otherwise the distance is <math>2^k</math>
+**If next[i] = NIL then distance is to the cashier
+**Otherwise the distance is <math>2^k</math>
 Complexity: <math>O(\log n)</math> time and <math>O(n\log n)</math> work
 ===Work-optimal List Ranking===
-* From our list A, pull a subset S with uniform density and sparcity
-* Remove S from A, creating a new subset <math>A-S</math>.
-** For each element i in <math>S</math>, fix the next and distance of prev(i)
-** Run compaction
-* Recurse on <math>A-S</math> until size is <math>\leq n/\log n</math>. Then use pointer jumping.
-* Add <math>S</math> back to <math>A-S</math>
-; Complexity
+*From our list A, pull a subset S with uniform density and sparcity
-With high probability:
+*Remove S from A, creating a new subset <math>A-S</math>.
-* <math>O(log (n) log log (n)) time</math>
+**For each element i in <math>S</math>, fix the next and distance of prev(i)
-* <math>O(n)</math> work
+**Run compaction
+*Recurse on <math>A-S</math> until size is <math>\leq n/\log n</math>. Then use pointer jumping.
+*Add <math>S</math> back to <math>A-S</math>
+;Complexity
+We get the same complexity for randomized and deterministic list ranking (see below)<br>
+Randomized has the complexity with high probability.
+*<math>O(\log (n) \log \log (n))</math> time
+*<math>O(n)</math> work
+Using a parallel prefix-sum algorith which runs in <math>O(\log n/\log \log n)</math> time,
+we can get list ranking in <math>O(\log n)</math> time and <math>O(n)</math> work.
 ===Randomized Symmetry Breaking===
@@ Line 418: / Line 472: @@
      s(i) = 1
 </pre>
+;Notes
+* The full randomized list ranking works in <math>O(\log n \log\log n)</math> time and <math>O(n)</math> work with high probability
 ===Deterministic Symmetry Breaking===
-* Assign each element a unique tag, such as the index in the array
-* Convert the tag to binary. For each element i, tag(i)!=tag(next(i)).
+*Assign each element a unique tag, such as the index in the array
-* Let k be the index of the rightmost bit (where 0 is the rightmost bit) differing between tag(i) and tag(next(i))
+*Convert the tag to binary. For each element i, tag(i)!=tag(next(i)).
-* Let b be the bit in tag(i) differing from tag(next(i))
+*Let k be the index of the rightmost bit (where 0 is the rightmost bit) differing between tag(i) and tag(next(i))
-* Set the new tag to <code>(k<<1) + b</code>
+*Let b be the bit in tag(i) differing from tag(next(i))
+*Set the new tag to <code>(k<<1) + b</code>
 This algorithm takes <math>O(1)</math> time and <math>O(n)</math> work per iteration.
@@ Line 433: / Line 491: @@
 ===r-ruling set===
 An r-ruling set is a subset <math>S</math> of a linked list satisfying two properties:
-* ''Uniform Density'' If node <math>i</math> is not in S, then one of the <math>r</math> nodes following i will be in <math>S</math>
-* ''Uniform Sparcity'' If node <math>i</math> is in <math>S</math> then node <math>next(i)</math> is not in <math>S</math>
-;Algorithm
+*''Uniform Density'' If node <math>i</math> is not in S, then one of the <math>r</math> nodes following i will be in <math>S</math>
+*''Uniform Sparcity'' If node <math>i</math> is in <math>S</math> then node <math>next(i)</math> is not in <math>S</math>
+;Algorithm 1
 The following algorithm gives us an r-ruling set with <math>r=2*\log n - 1</math>
-# Apply the deterministic coin tossing algorithm to give each element a tag in <math>[0,...,2*\log n - 1]</math>
-# For each element i in parallel, if i is a local maximum then add it to <math>S</math>
+#Apply the deterministic coin tossing algorithm to give each element a tag in <math>[0,...,2*\log n - 1]</math>
+#For each element i in parallel, if i is a local maximum then add it to <math>S</math>
+;Optimal-Work 2-Ruling set
+<pre>
+Apply one iteration of deterministic coin tossing
+Sort elements by tags
+Initialize array S with 1s
+for k = 0 to 2*log(n)-1:
+  for i such that tag(i) = k pardo
+    if S(pre(i)) = S(next(i)) = 1:
+      S(i) = 0
+</pre>
 ==Tree Contraction==
@@ Line 453: / Line 526: @@
 ===Parallel Rake===
 Rake all leaves such that:
-* No two leaves have the same parent (stem)
-* The parent of a stem cannot be the stem of another leaf
+*No two leaves have the same parent (stem)
+*The parent of a stem cannot be the stem of another leaf
@@ Line 460: / Line 534: @@
 ;Parallel Rake Contraction Scheme
 Apply legal parallel rakes until the tree becomes a 3-node binary tree
 ;Parallel Rake Contraction Algorithm
-Step 1: Number nodes according to a DFS using the Euler tour algorithm.<br>
+;Step 1
+*Number leaves according to a DFS using the Euler tour algorithm.<br>
 Observation 3:
@@ Line 472: / Line 549: @@
 Proof: See the classnotes
-Step 2:
+;Step 2:
 * Let <math>S_1</math> be nodes in <math>S</math> whose parents are not in <math>S</math>.
-; Let <math>L_1</math> be leaves in <math>L</math> whose parents are in <math>S_1</math>.
+* Let <math>L_1</math> be leaves in <math>L</math> whose parents are in <math>S_1</math>.
 * Apply parallel rakes to leaves in <math>L_1</math>, unless the parent is the root.
 * Apply parallel rakes to leaves in <math>L - L_1</math>, unless the parent is the root.
@@ Line 491: / Line 569: @@
 The internal node will be <math>(a*l+b) \times (c*r+d)</math> where <math>\times</math> is either addition or multiplication.<br>
 If we delete one leaf and it's parent (stem), we will modify these variables of the stem's parent.
+===Tree Binarization===
+;Note: This is not in the classnotes, but has been covered in exams.
+To apply the above techniques to general trees or k-ary trees, you should binarize your k-ary tree.<br>
+This is done as follows:
+* Suppose node <math>x</math> has <math>n > 2</math> children
+* Select one child to keep
+* Create a node <math>x'</math>
+* Move all <math>n-1</math> children to <math>x'</math>
+* Set child <math>x'</math> to be a child of <math>x</math>
+Complexity
+* The time is <math>O(max\_degree)</math> and the work is <math>O(n)</math>
 ==Graph Connectivity==
@@ Line 501: / Line 593: @@
 Definitions:
-* Pointer Graph
-* Supervertices
+*Pointer Graph
-* The supervertex graph
+*Supervertices
-* Hookings
+*The supervertex graph
-* Parallel pointer jumping
+*Hookings
+*Parallel pointer jumping
 ===A first connectivity algorithm===
 The first connectivity algorithm consists of <math>\log(n)</math> iterations of the following two steps:
-* Hookings - Each root hooks itself the the minimal adjacent root. If there are no adjacent stars, then this rooted star quits.
-* Parallel pointer jumping - Each vertex performs pointer jumping until every vertex points to the root, forming a rooted star.
+*Hookings - Each root hooks itself the the minimal adjacent root. If there are no adjacent stars, then this rooted star quits.
+*Parallel pointer jumping - Each vertex performs pointer jumping until every vertex points to the root, forming a rooted star.
 ;Theorems
-# The pointer graph always consists of rooted trees.
-# After <math>\log n</math> iterations, all vertices are contained in rooted stars that quit.
+#The pointer graph always consists of rooted trees.
-#: Each connected component is represented by a single rooted star.
+#After <math>\log n</math> iterations, all vertices are contained in rooted stars that quit.
+#:Each connected component is represented by a single rooted star.
 ;Complexity
-* We need <math>O(\log n)</math> iterations
-** Hookings take <math>O(1)</math> time and <math>O(n+m)</math> work
+*We need <math>O(\log n)</math> iterations
-** Pointer jumping takes <math>O(\log n)</math> time and <math>O(n\log n)</math> work
+**Hookings take <math>O(1)</math> time and <math>O(n+m)</math> work
-* For adjacency matrix, in total we need <math>O(\log^2 n)</math> time and <math>O(n^2\log n)</math> work since we have a processor per <math>n^2</math> possible edges.
+**Pointer jumping takes <math>O(\log n)</math> time and <math>O(n\log n)</math> work
-* For incidence lists, we need <math>O(\log^2 n)</math> time and <math>O(n\log^2 n + m\log n)</math> work since we have a processor per edge.
+*For adjacency matrix, in total we need <math>O(\log^2 n)</math> time and <math>O(n^2\log n)</math> work since we have a processor per <math>n^2</math> possible edges.
+*For incidence lists, we need <math>O(\log^2 n)</math> time and <math>O(n\log^2 n + m\log n)</math> work since we have a processor per edge.
 ===A second connectivity algorithm===
@@ Line 530: / Line 626: @@
 Each iteration takes constant time.
-# Probe quitting: each rooted star whose supervertex is not adjacent to any other quits
+#Probe quitting: each rooted star whose supervertex is not adjacent to any other quits
-# Hooking on smaller: each rooted star hooks onto a smaller vertex of another supervertex which is not a leaf
+#Hooking on smaller: each rooted star hooks onto a smaller vertex of another supervertex which is not a leaf
-#* Note that this is not unique. Requires arbitrary CRCW.
+#*Note that this is not unique. Requires arbitrary CRCW.
-# Hooking non-hooked-upon: every rooted star not hooked upon in step 2 hooks to another vertex in another supervertex which is not a leaf
+#Hooking non-hooked-upon: every rooted star not hooked upon in step 2 hooks to another vertex in another supervertex which is not a leaf
-# Parallel pointer jumping: one round of parallel pointer jumping
+#Parallel pointer jumping: one round of parallel pointer jumping
 ;Theorems
+*The pointer graph always consists of rooted trees.
+*For every vertex which is not a leaf, <math>D(v) \leq v</math>
+{{hidden | Proof |
+* This is true after steps 1, 2, and 4.
+** This follows immediately for 1 and 2.
+** After one round of pointer jumping (4) after 3, everyone from supervertex <math>r</math> is a leaf due to the claim below.
+* Consider a root <math>r</math> which hooks on someone larger in step 3.
+*: After Step 3, all  children of <math>r</math> are leaves.
+*: I.e. no one else will hook to <math>r</math> during Step 3.
+** If anyone was larger, <math>r</math> would have hooked on them in Step 2.
+** If anyone was smaller, they would have hooked on <math>r</math> in Step 2.
+}}
 ;Complexity
+*Time <math>O(\log n)</math>
+*Work <math>O((n+m)\log n)</math>
+;Notes
+*Without step 3, the algorithm will run in <math>O(n)</math> time instead of <math>O(\log n)</math> but the result will be valid.
+*In step 2, you can hook rooted trees in addition to rooted stars for better empirical performance
+{{hidden | Proof |
+Let <math>h(T)</math> be the height of tree T where a two-layer tree has height 1.<br>
+Define a single node to have height 1.<br>
+Let <math>H</math> be the sum of heights of all trees and <math>\bar{H}</math> be the total height after one iteration.
+* Claim: <math>\bar{H} \leq 2*H/3</math>
+}}
 ===Minimum Spanning Forest===
+====The MSF Theorem====
+Let <math>G(V,E)</math> be a weighted graph and let <math>U</math> and <math>V-U</math> be two non-empty subsets of <math>V</math>.
+Consider set <math>H=\{(u,v) \mid u \in U, v \in V\}</math> of edges connecting points in <math>U</math> and <math>V</math>.
+If <math>e</math> is the edge of minimum weight in <math>H</math>, then <math>e</math> is in the MSF of <math>G</math>
+====A first MSF algorithm====
+* Sort the edges by weight. Assign a processor to each edge. Using a priority CRCW, the edge with minimum weight will take priority during writes.
+Each round consists of hooking and <math>\log n</math> rounds of parallel pointer jumping
+* Each root find the minimum weight edge to another supervertex and hooks itself to that root.
+*: This is automatic from the priority CRCW.
+====A second MSF algorithm====
+We replace steps 2 and 3 in the connectivity algorithm with the simple rule:
+* We hook rooted stars using the edge with the smallest weight.
+;Notes
+* The progression of edge weights up the tree are monotonically increasing so there are no cycles
+===Biconnectivity===
+This section is not it the textbook. It is from Tarjan and Vishkin (1985)<ref name="tarjan1985biconnectivity">Robert E. Tarjan, and Uzi Vishkin. ''An Efficient Parallel Biconnectivity Algorithm'' (1985). SIAM Journal on Computing. DOI: [https://doi.org/10.1137/0214061 10.1137/0214061] [https://epubs.siam.org/doi/10.1137/0214061 https://epubs.siam.org/doi/10.1137/0214061] [https://www.researchgate.net/publication/220617428_An_Efficient_Parallel_Biconnectivity_Algorithm Mirror]</ref>.
+A graph is biconnected if for every two vertices <math>v_1</math> and <math>v_2</math>, there is a simple cycle containing <math>v_1</math> and <math>v_2</math>.
+Intuitively this means that for any two vertices, there are at least two paths from <math>v_1</math> to <math>v_2</math>. If any vertex and its edges are removed from a biconnected graph, it still remains connected.
+Every connected graph consists of biconnected components. These are sets of edges such that every two edges from a set lie on a simple cycle.
+Vertices which connect two biconnected components are called ''articulation points''.
+If these points are removed, the two biconnected components are no longer connected.
+;Algorithm
+Assume we are given a connected graph <math>G</math>.
+# First build a spanning tree <math>T</math>. Record which edges are in the spanning tree. Root the spanning tree.
+# Compute the preorder number of all edges using an euler tour.
+# For every vertex <math>v</math>, calculate the hi and low preorder numbers in the subtree <math>T(v)</math>
+# Create the auxiliary graph <math>G'' = (V'', E'')</math> as follows:
+#* All edges of <math>G</math> are vertices in  <math>G''</math>
+#* For each edge <math>(v, w)</math> in <math>G - T</math>, add edge <math> ((p(v),v),(p(w),w))</math> to <math>G''</math> iff <math>v</math> and <math>w</math> are unrelated in <math>T</math>
+#* For each edge <math>(v = p(w), w)</math> in <math>T</math>, add edge <math>((p(v), v), (v, w))</math> to <math>G''</math> iff <math>low(w) < v</math> or <math>high(w) \geq v + size(v)</math>
+# Compute the connected components of <math>G''</math>
+# Assign edges based on their connected components.
+;Complexity
+* <math>O(\log n)</math> time
+==Tricks==
+===Removing duplicates in an array===
+Assume you have an Arbitrary CRCW
+# Given array A of size n with entries 0 to n-1
+# For each entry pardo
+#* Write B[A[i]] == i
+#* Only one write will succeed for each unique A[i]
+#* Check if B[A[i]] == i
+===Compaction of an <math>n^2</math> array with <math>\leq n</math> elements in <math>O(\log n)</math> time===
+* Split our <math>n^2</math> array to n subarrays, each of size <math>n</math>
+* Run <math>n</math> parallel compactions on each subarray
+* Perform one compaction on the size of each subarray
+* Compact all elements using (total size of prev subarrays + position in current subarray)
+The benefit of this segmented compaction/prefix sum is that if you only do each segment once in an iterative algorithm such as parallel BFS,
+you can get <math>O(n^2)</math> or <math>O(m)</math> work.
+==XMT Programming Tips==
+I never figured out how to use standard c headers.
+* XMT only has int (32 bit) and float (32 bit). If you need a bool type, you will need to define it in a header.
+* You cannot call functions from within <code>spawn</code>
+Here is my helpers header <code>helpers.h</code>
+{{ hidden | <code>helpers.h</code> |
+<syntaxhighlight lang="c">
+#ifndef HELPERS_H
+#define HELPERS_H
+#define true 1
+#define false 0
+#define bool int
+// #define NIL -1
+#define max(a, b) ((((a) < (b)) ? (b) : (a)))
+#define min(a, b) ((((a) < (b)) ? (a) : (b)))
+#endif
+</syntaxhighlight>
+}}
+If you prefer to debug in C++ like me, you can use the following code to adapt running the XMT code:
+This code does not run things asynchronously.
+{{hidden | C++ Code |
+<syntaxhighlight lang="cpp">
+#define spawn(x, y) for (int $ = x; $ <= y; ++$)
+//#define spawn(x, y) for (int $ = y; $ >= x; --$)
+#define ps(a, b) \
+  {              \
+    b += a;      \
+    a = b - a;   \
+  }
+#define psm(a, b) \
+  {               \
+    b += a;       \
+    a = b - a;    \
+  }
+#define psBaseReg int
+#define STRINGIFY2(X) #X
+#define STRINGIFY(X) STRINGIFY2(X)
+using namespace std;
+#define GRAPH_NUM graph0
+#define N 19
+#define M 36
+vector<int> vertices;
+vector<int> degrees;
+vector<vector<int>> edges;
+int main() {
+  vertices.resize(N);
+  degrees.resize(N);
+  edges = std::vector<std::vector<int>>(M, std::vector<int>(2));
+  string datapath = "data/" STRINGIFY(GRAPH_NUM) "/graph.txt";
+  string data = readFile(datapath);
+  regex section_regex(
+      "[\\n\\r][\\n\\r]+|[\\r\\n]0\\s[\\r\\n]+",
+      std::regex_constants::ECMAScript | std::regex_constants::icase);
+  regex space_regex("[\\n\\s\\r]+", std::regex_constants::ECMAScript |
+                                        std::regex_constants::icase);
+  vector<string> sections = split(data, section_regex);
+  // for (int i = 0; i < sections.size(); i++) {
+  //  cout << "Section " << i << " " << sections[i].size() << endl;
+  //  cout << sections[i] << endl;
+  //}
+  return 0;
+}
+string readFile(const string& datapath) {
+  std::ifstream in(datapath, std::ios::in);
+  string data;
+  if (in.good()) {
+    std::string contents;
+    in.seekg(0, std::ios::end);
+    contents.resize(static_cast<unsigned int>(in.tellg()));
+    in.seekg(0, std::ios::beg);
+    in.read(&contents[0], contents.size());
+    return contents;
+  }
+  std::cerr << "Failed to open file: " << datapath << std::endl;
+  throw(errno);
+}
+vector<string> split(string s, regex r) {
+  vector<string> splits;
+  smatch m;  // <-- need a match object
+  while (regex_search(s, m, r))  // <-- use it here to get the match
+  {
+    int split_on = m.position();  // <-- use the match position
+    splits.push_back(s.substr(0, split_on));
+    s = s.substr(split_on + m.length());  // <-- also, skip the whole match
+  }
+  if (!s.empty()) {
+    splits.push_back(s);  // and there may be one last token at the end
+  }
+  return splits;
+}
+</syntaxhighlight>
+}}
 ==Hardware/Architecture==
@@ Line 548: / Line 850: @@
 [[File:Pram hardware diagram.jpg | 400px]]
-* MTCU - Master Thread Control Unit - runs in serial mode
+*MTCU - Master Thread Control Unit - runs in serial mode
-* TCU Cluster - Thread Control Unit
+*TCU Cluster - Thread Control Unit
 ;Procedure
-* Spawn broadcasts your spawn code in the block to every thread control unit (TCU).
-* Data is shared through the interconnection network
+*Spawn broadcasts your spawn code in the block to every thread control unit (TCU).
+*Data is shared through the interconnection network
 ;Performance Penalty
 <math>\text{Execution Depth} = \text{Computation Depth} + LSRTM \times R + QD</math>
-* <math>LSTRM</math> - length of sequence of round trips to memory
-* <math>R</math> - time for round trip to memory
+*<math>LSTRM</math> - length of sequence of round trips to memory
-* <math>QD</math> - queuing delay
+*<math>R</math> - time for round trip to memory
-** If multiple threads access the same memory module, they are queued at the memory module
+*<math>QD</math> - queuing delay
+**If multiple threads access the same memory module, they are queued at the memory module
 ==Resources==
-* [https://stanford.edu/~rezab/dao/ Stanford CME323 Distributed Algorithms Lectures 1-3]
+*[https://stanford.edu/~rezab/dao/ Stanford CME323 Distributed Algorithms Lectures 1-3]
 ===Textbooks===
-* [http://users.umiacs.umd.edu/~vishkin/PUBLICATIONS/classnotes.pdf Uzi Vishkin ENEE651/CMSC751 Classnotes]
-* [https://www.amazon.com/Introduction-Parallel-Algorithms-Joseph-JaJa/dp/0201548569?sa-no-redirect=1&pldnSite=1 Introduction to Parallel Algorithms (Joseph Jaja, textbook)]
-==Ignore==
+*[http://users.umiacs.umd.edu/~vishkin/PUBLICATIONS/classnotes.pdf Uzi Vishkin ENEE651/CMSC751 Classnotes]
-[[Visible to::users]]
+*[https://www.amazon.com/Introduction-Parallel-Algorithms-Joseph-JaJa/dp/0201548569?sa-no-redirect=1&pldnSite=1 Introduction to Parallel Algorithms (Joseph Jaja, textbook)]
+*[https://www.cs.cmu.edu/~guyb/papers/BM04.pdf Parallel Algorithms by Guy E. Blelloch and Bruce M. Maggs (CMU)]
+==References==