Google Cloud Dataflow

 Unified stream and batch data processing that's serverless, fast, and cost-effective.


Benefits 

Streaming data analytics with speed

Dataflow enables fast, simplified streaming data pipeline development with lower data latency.

Simplify operations and management

Allow teams to focus on programming instead of managing server clusters as Dataflow’s serverless approach removes operational overhead from data engineering workloads.

Reduce total cost of ownership

Resource autoscaling paired with cost-optimized batch processing capabilities means Dataflow offers virtually limitless capacity to manage your seasonal and spiky workloads without overspending.


Key Features

Streaming EngineStreaming Engine separates compute from state storage and moves parts of pipeline execution out of the worker VMs and into the Dataflow service back end, significantly improving autoscaling and data latency.
AutoscalingAutoscaling lets the Dataflow service automatically choose the appropriate number of worker instances required to run your job. The Dataflow service may also dynamically reallocate more workers or fewer workers during runtime to account for the characteristics of your job.
Dataflow ShuffleService-based Dataflow Shuffle moves the shuffle operation, used for grouping and joining data, out of the worker VMs and into the Dataflow service back end for batch pipelines. Batch pipelines scale seamlessly, without any tuning required, into hundreds of terabytes.
Dataflow SQLDataflow SQL lets you use your SQL skills to develop streaming Dataflow pipelines right from the BigQuery web UI. You can join streaming data from Pub/Sub with files in Cloud Storage or tables in BigQuery, write results into BigQuery, and build real-time dashboards using Google Sheets or other BI tools.
Flexible Resource Scheduling (FlexRS)Dataflow FlexRS reduces batch processing costs by using advanced scheduling techniques, the Dataflow Shuffle service, and a combination of preemptible virtual machine (VM) instances and regular VMs. 
Dataflow templatesDataflow templates allow you to easily share your pipelines with team members and across your organization or take advantage of many Google-provided templates to implement simple but useful data processing tasks. With Flex Templates, you can create a template out of any Dataflow pipeline.
Notebooks integrationIteratively build pipelines from the ground up with AI Platform Notebooks and deploy with the Dataflow runner. Author Apache Beam pipelines step by step by inspecting pipeline graphs in a read-eval-print-loop (REPL) workflow. Available through Google’s AI Platform, Notebooks allows you to write pipelines in an intuitive environment with the latest data science and machine learning frameworks.
Inline monitoringDataflow inline monitoring lets you directly access job metrics to help with troubleshooting batch and streaming pipelines. You can access monitoring charts at both the step and worker level visibility and set alerts for conditions such as stale data and high system latency.
Customer-managed encryption keysYou can create a batch or streaming pipeline that is protected with a customer-managed encryption key (CMEK) or access CMEK-protected data in sources and sinks.
Dataflow VPC Service ControlsDataflow’s integration with VPC Service Controls provides additional security for your data processing environment by improving your ability to mitigate the risk of data exfiltration.
Private IPsTurning off public IPs allows you to better secure your data processing infrastructure. By not using public IP addresses for your Dataflow workers, you also lower the number of public IP addresses you consume against your Google Cloud project quota.

Comments

Popular posts from this blog

SQL basic interview question

gsutil Vs Storage Transfer Service Vs Transfer Appliance