Apache Beam: Difference between revisions

From David's Wiki
No edit summary
No edit summary
Line 11: Line 11:
Pardo allows you to pass in a function and generate multiple items.<br>
Pardo allows you to pass in a function and generate multiple items.<br>
If you are yielding many items though, you should do a <code>beam.Reshuffle()</code> afterwards to split and get more parallelism.
If you are yielding many items though, you should do a <code>beam.Reshuffle()</code> afterwards to split and get more parallelism.
==Administration==
How to setup Apache Beam running on Flick and Kubernetes.
===Resources===
* [https://python.plainenglish.io/apache-beam-flink-cluster-kubernetes-python-a1965f37b7cb Beam+Flink+Kubernetes+Python]
* [https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/tree/master/examples/beam/with_job_server flink on k8s yaml]
* [https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/resource-providers/native_kubernetes/#getting-started Flink on native kubernetes]

Revision as of 15:55, 20 January 2022

Apache Beam is a library for building parallel data pipelines.
Such pipelines are executed on a runner such as Apache Flink. Apache Beam is originally developed by Google.

Usage

Programming guide, examples in Python.

Background

Data are referred to as PCollection

Create

Map

ParDo

Pardo allows you to pass in a function and generate multiple items.
If you are yielding many items though, you should do a beam.Reshuffle() afterwards to split and get more parallelism.

Administration

How to setup Apache Beam running on Flick and Kubernetes.

Resources