Samples showing how to create and run an Apache Beam template on Google Cloud Dataflow.
Follow the
Getting started with Google Cloud Dataflow
page, and make sure you have a Google Cloud project with billing enabled
and a service account JSON key set up in your GOOGLE_APPLICATION_CREDENTIALS environment variable.
Additionally, for this sample you need the following:
-
Create a Cloud Storage bucket.
export BUCKET=your-gcs-bucket gsutil mb gs://$BUCKET
-
Clone the
java-docs-samplesrepository.git clone https://github.com/GoogleCloudPlatform/java-docs-samples.git
-
Navigate to the sample code directory.
cd java-docs-samples/dataflow/templates
The following sample creates a WordCount Dataflow template showcasing different uses of ValueProviders.
Make sure you have the following variables set up:
export PROJECT=$(gcloud config get-value project)
export BUCKET=your-gcs-bucket
export TEMPLATE_LOCATION=gs://$BUCKET/samples/dataflow/templates/WordCountThen, to create the template in the desired Cloud Storage location.
# Create the template.
mvn compile exec:java \
-Dexec.mainClass=com.example.dataflow.templates.WordCount \
-Dexec.args="\
--isCaseSensitive=false \
--project=$PROJECT \
--templateLocation=$TEMPLATE_LOCATION \
--runner=DataflowRunner"
# Upload the metadata file.
gsutil cp WordCount_metadata "$TEMPLATE_LOCATION"_metadataFor more information, see Creating templates.
Finally, you can run the template via gcloud or through the
GCP Console create Dataflow job page.
export JOB_NAME=wordcount-$(date +'%Y%m%d-%H%M%S')
export INPUT=gs://apache-beam-samples/shakespeare/kinglear.txt
gcloud dataflow jobs run $JOB_NAME \
--gcs-location $TEMPLATE_LOCATION \
--parameters inputFile=$INPUT,outputBucket=$BUCKETFor more information, see Executing templates.
You can check your submitted jobs in the GCP Console Dataflow page.
To avoid incurring charges to your GCP account for the resources used:
# Remove only the files created by this sample.
gsutil -m rm -rf "$TEMPLATE_LOCATION*"
gsutil -m rm -rf "gs://$BUCKET/samples/dataflow/wordcount/"
# [optional] Remove the Cloud Storage bucket.
gsutil rb gs://$BUCKET