Apache Beam

From David's Wiki
Revision as of 03:20, 3 August 2021 by David (talk | contribs) (Created page with "[https://beam.apache.org/ Apache Beam] is a library for building parallel data pipelines.<br> Such pipelines are executed on a runner such as Apache Spark. Apache Beam is orig...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
\( \newcommand{\P}[]{\unicode{xB6}} \newcommand{\AA}[]{\unicode{x212B}} \newcommand{\empty}[]{\emptyset} \newcommand{\O}[]{\emptyset} \newcommand{\Alpha}[]{Α} \newcommand{\Beta}[]{Β} \newcommand{\Epsilon}[]{Ε} \newcommand{\Iota}[]{Ι} \newcommand{\Kappa}[]{Κ} \newcommand{\Rho}[]{Ρ} \newcommand{\Tau}[]{Τ} \newcommand{\Zeta}[]{Ζ} \newcommand{\Mu}[]{\unicode{x039C}} \newcommand{\Chi}[]{Χ} \newcommand{\Eta}[]{\unicode{x0397}} \newcommand{\Nu}[]{\unicode{x039D}} \newcommand{\Omicron}[]{\unicode{x039F}} \DeclareMathOperator{\sgn}{sgn} \def\oiint{\mathop{\vcenter{\mathchoice{\huge\unicode{x222F}\,}{\unicode{x222F}}{\unicode{x222F}}{\unicode{x222F}}}\,}\nolimits} \def\oiiint{\mathop{\vcenter{\mathchoice{\huge\unicode{x2230}\,}{\unicode{x2230}}{\unicode{x2230}}{\unicode{x2230}}}\,}\nolimits} \)

Apache Beam is a library for building parallel data pipelines.
Such pipelines are executed on a runner such as Apache Spark. Apache Beam is originally developed by Google.

Usage

Programming guide, examples in Python.

Background

Data are referred to as PCollection

Create

Map

ParDo

Pardo allows you to pass in a function and generate multiple items.
If you are yielding many items though, you should do a beam.Reshuffle() afterwards to split and get more parallelism.