Apache Beam: Difference between revisions
No edit summary |
|||
(One intermediate revision by the same user not shown) | |||
Line 11: | Line 11: | ||
Pardo allows you to pass in a function and generate multiple items.<br> | Pardo allows you to pass in a function and generate multiple items.<br> | ||
If you are yielding many items though, you should do a <code>beam.Reshuffle()</code> afterwards to split and get more parallelism. | If you are yielding many items though, you should do a <code>beam.Reshuffle()</code> afterwards to split and get more parallelism. | ||
===GroupByKey=== | |||
==Administration== | ==Administration== | ||
How to setup Apache Beam running on Flick and Kubernetes. | How to setup Apache Beam running on Flick and Kubernetes. | ||
===Resources=== | ===Resources=== | ||
* [https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/blob/master/docs/beam_guide.md Apache Beam Python Jobs with Flicker K8s operator] | |||
** [https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/tree/master/examples/beam/with_job_server flink on k8s yaml] | |||
* [https://python.plainenglish.io/apache-beam-flink-cluster-kubernetes-python-a1965f37b7cb Beam+Flink+Kubernetes+Python] | * [https://python.plainenglish.io/apache-beam-flink-cluster-kubernetes-python-a1965f37b7cb Beam+Flink+Kubernetes+Python] | ||
* [https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/resource-providers/native_kubernetes/#getting-started Flink on native kubernetes] | * [https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/resource-providers/native_kubernetes/#getting-started Flink on native kubernetes] |
Latest revision as of 18:12, 19 July 2022
Apache Beam is a library for building parallel data pipelines.
Such pipelines are executed on a runner such as Apache Flink. Apache Beam is originally developed by Google.
Usage
Programming guide, examples in Python.
Background
Data are referred to as PCollection
Create
Map
ParDo
Pardo allows you to pass in a function and generate multiple items.
If you are yielding many items though, you should do a beam.Reshuffle()
afterwards to split and get more parallelism.
GroupByKey
Administration
How to setup Apache Beam running on Flick and Kubernetes.