Parallel Algorithms

From David's Wiki
\( \newcommand{\P}[]{\unicode{xB6}} \newcommand{\AA}[]{\unicode{x212B}} \newcommand{\empty}[]{\emptyset} \newcommand{\O}[]{\emptyset} \newcommand{\Alpha}[]{Α} \newcommand{\Beta}[]{Β} \newcommand{\Epsilon}[]{Ε} \newcommand{\Iota}[]{Ι} \newcommand{\Kappa}[]{Κ} \newcommand{\Rho}[]{Ρ} \newcommand{\Tau}[]{Τ} \newcommand{\Zeta}[]{Ζ} \newcommand{\Mu}[]{\unicode{x039C}} \newcommand{\Chi}[]{Χ} \newcommand{\Eta}[]{\unicode{x0397}} \newcommand{\Nu}[]{\unicode{x039D}} \newcommand{\Omicron}[]{\unicode{x039F}} \DeclareMathOperator{\sgn}{sgn} \def\oiint{\mathop{\vcenter{\mathchoice{\huge\unicode{x222F}\,}{\unicode{x222F}}{\unicode{x222F}}{\unicode{x222F}}}\,}\nolimits} \def\oiiint{\mathop{\vcenter{\mathchoice{\huge\unicode{x2230}\,}{\unicode{x2230}}{\unicode{x2230}}{\unicode{x2230}}}\,}\nolimits} \)

Parallel Algorithms notes from CMSC751 with Uzi Vishkin.
This class is based on his book Thinking in Parallel Some Basic Data-Parallel Algorithms and Techniques

XMT Language

XMTC is a single-program multiple-data (SPMD) extension of C.

  • Spawn creates threads
  • Threads expire at Join
  • $ represents the number of the thread
  • PS Ri Rj is an atomic prefix sum
    • Stores Ri + Rj in Ri
    • Stores the original value of Ri in Rj
int x = 0;

// Spawn n threads
spawn(0, n-1) {
  int e = 1;
  if (A[$] != 0) {
    // Sets e=x and increments x
    ps(e,x);
  }
}

Models

PRAM

Parallel Random-Access Machine/Model
You're given n synchronous processors each with local memory and access to a shared memory.
Each processor can write to shared memory, read to shared memory, or do computation in local memory.
You tell each processor what to do at each time step.

Types of PRAM

  • exclusive-read exclusive-write (EREW)
  • concurrent-read exclusive-write (CREW)
  • concurrent-read concurrent-write (CRCW)
    • Arbitrary CRCW - an arbitrary processor writing succeeds
    • Priority CRCW - the lowest numbered processor writing succeeds
    • Common CRCW - writing succeeds only if all processors write the same value

Drawbacks

  • Does not reveal how the algorithm will run on PRAMs with different number of proc
  • Fully specifying allocation requires an unnecessary level of detail

Work Depth

You provide a sequence of instructions. At each time step, you specify the number of parallel operations.

WD-presentation Sufficiency Theorem

Given an algorithm in WD mode that takes \(\displaystyle x=x(n)\) operations and \(\displaystyle d=d(n)\) time, the algorithm can be implemented in any p-processor PRAM with \(\displaystyle O(x/p + d)\) time.

Notes
  • Other resources call this Brent's theorem
  • \(\displaystyle x\) is the work and \(\displaystyle d\) is the depth

Speedup

  • A parallel algorithm is work-optimal if W(n) grows asymptotically the same as T(n).
  • A work-optimal parallel algorithm is work-time-optimal if T(n) cannot be improved by another work-optimal algorithm.
  • If T(n) is best known and W(n) matches it, this is called linear-speedup

NC Theory

Nick's Class Theory

  • Good serial algorithms: Poly time
  • Good parallel algorithm: poly-log \(\displaystyle O(\log ^c n)\) time, poly processors


Technique: Balanced Binary Trees

Example Applications:

  • Prefix sum
  • Largest element
  • Nearest-one element
  • Compaction
Inputs
  • Array A[1..n] of elements
  • Associative binary operation, denoted * (e.g. addition, multiplication, min, max)
Outputs
  • Array B containing B(i) = A(1) * ... * A(i)
Algorithm

\(\displaystyle O(n)\) work and \(\displaystyle O(\log n)\) time

for i, 1 <= i <= n pardo
  B(0, i) = A(i)
  // Summation step up the tree
  for h=1 to log n
    B(h, i) = B(h-1, 2i-1) * B(h-1, 2i)
  // Down the tree
  for h=log n to 1
    if i even, i <= n/2^h
      C(h, i) = C(h+1, i/2)
    if i == 1
      C(h, i) = B(h, i)
    if i odd, 3 <= i <= n/2^h
      C(h, i) = C(h+1, i/2) * B(h, i)
return C(0, :)

Technique: Merge Sorting Cluster

Technique: Partitioning

Merging: Given two sorted arrays, A and B, create a sorted array C consisting of elements in A and B
Ranking: Given an element x, find Rank(x, A) = i such that \(\displaystyle A(i) \leq x \leq A(i+1)\)

Equivalence of Merging and Ranking
  • Given an algorithm for merging, we know A(i) belongs in C(j) so Rank(A(i), B) = j-i
  • Given an algorithm for ranking, we get A(i) belongs in C(j) where j = Rank(A(i), B) + i
Naive algorithm

Apply binary search element wise on concurrent read system. O(log n) time, O(nlog n) work

for 1 <= i <= n pardo
  Rank(A(i), B)
  Rank(B(i), A)
Partitioning paradigm

Input size: n

  • Partition the input into p jobs of size approx. n/p
  • Do small jobs concurrently using a separate (possibly serial) algo for each.

Example: Total work O(n), total time O(log n)

// Partitioning
for 1 <= i <= p pardo
  b(i) = Rank((n/p)(i-1)+1, B)
  a(i) = Rank((n/p)(i-1)+1, A)
// Total slice size is <= 2n/p. Total 2p slices.
// E.g. slice size is 2 * log(n) if p=n/log(n).
// Work
for 1 <= i <= p pardo
  k = ceil(b(i)/(n/p)) * (n/p)
  merge slice A[(n/p)(i-1)+1:min(a(i), (n/p)i+1)] and B[b(i):min(b(i+1), k)]

Techinque: Divide and Conquer

Merge Sort

Input: Array A[1..n]
Complexity:
Time: \(\displaystyle T(n) \leq T(n/2) + \alpha \log n\)
Work: \(\displaystyle T(n) \leq 2 * T(n/2) + \beta n\)
Time \(\displaystyle O(\log^2 n)\). Work: \(\displaystyle O(n \log n)\)

MergeSort(A, B):
  if n == 1
    return B(1) = A(1)
  call in parallel:
    MergeSort(A[1:n/2], C[1:n/2])
    MergeSort(A[n/2+1:n], C[n/2+1:n])
  Merge C[1:n/2] and C[n/2:n] using O(log n) algorithm

Technique: Informal Work-Depth (IWD) and Accelerating Cascades

Technique: Accelerating Cascades

Consider: for problem of size n, there are two parallel algorithms.

  • Algo A: \(\displaystyle W_1(n)\) work, \(\displaystyle T_1(n)\) time
  • Algo B: Work \(\displaystyle W_2(n) \gt W_1(n)\). Time \(\displaystyle T_2(n) \lt T_1(n)\). Faster but less efficient.

Assume Algo A is a "reducing algorithm"
We start with Algo A until the size is below some threshold. Then we switch to Algo B.

Problem: Selection

Algorithm 1
  • Partition elements into rows of \(\displaystyle \log n\) size
  • For each row, find the median within the row
  • Find the median of medians (MoM) in \(\displaystyle O(n)\)
  • Put all rows with median <= MoM above and all rows with median >= Mom below
  • Now \(\displaystyle m/4\) elements are smaller and \(\displaystyle m/4\) are larger
    • Known as the reducing lemma

This algorithm solves the selection problem in \(\displaystyle O(\log^2 n)\) time and \(\displaystyle O(n)\) work.

Accelerating Cascades
  • What we have:
    • Algorithm 1 has \(\displaystyle O(\log n)\) iterations.
Each iteration reduces a size m instance in \(\displaystyle O(\log m)\) time and \(\displaystyle O(m)\) work to an instance of size \(\displaystyle \leq 3m/4\)
    • Algorithm 2 runs in \(\displaystyle O(\log n)\) time and \(\displaystyle O(n \log n)\) work.
  • Step 1: Use algo 1 to reduce from n to \(\displaystyle n / \log n\)
  • Step 2: Apply algorithm 2

Informal Work-Depth (IWD)

At each time unit there is a set containing a number of instructions to be performed concurrently.

Integer Sorting

There is a theorem that sorting with only comparisons is worst case at least \(\displaystyle O(n\log n)\)

Input: Array A[1..n], integers are range [0..r-1]
Sorting: rank from smallest to largest
Assume n is divisible by r (\(\displaystyle r=\sqrt{n}\))

Algorithm
  • Partition A into n/r subarrays \(\displaystyle B_1,...,B_{n/r}\)
    • Sort each subarray separately using counting sort
    • Compute number(v,s) # of elements of value v in \(\displaystyle B_s\) for \(\displaystyle 0\leq v \leq r-1\) and \(\displaystyle 1 \leq s \leq n/r\)
    • Compute serial(i) = # of elements in the block \(\displaystyle B_s\) such that A(j)=A(i) and \(\displaystyle j \lt i \) \(\displaystyle 1 \leq j \neq n\)
  • Run prefix sum on number(v,1),number(v,2),...,number(v,n/r) into ps(v,1), ps(v,2),..., ps(v,n/r)
  • Compute prefix sums of cardinality(0),.., cardinality(r-1) into global_ps(0),...,global_ps(r-1)
  • The rank of element \(\displaystyle i\) is \(\displaystyle 1+serial(i)+ps(v,s-1)+global_ps(v-1)\)
Notes
  • Running time is not poly-log

Resources

Visible to::users