Dataproc

A faster, easier, more cost-effective way to run Apache Spark and Apache Hadoop

Try It Free

Cloud-native Apache Hadoop & Apache Spark

Dataproc is a fast, easy-to-use, fully managed cloud service for running Apache Spark and Apache Hadoop clusters in a simpler, more cost-efficient way. Operations that used to take hours or days now complete in seconds or minutes instead, and you pay only for the resources you use (with per-second billing). Dataproc also easily integrates with other Google Cloud Platform (GCP) services, giving you a powerful and complete platform for data processing, analytics, and machine learning.

Fast & Scalable Data Processing

Create Dataproc clusters quickly and resize them at any time—from three to hundreds of nodes—so you don't have to worry about your data pipelines outgrowing your clusters. You have more time to focus on insights, with less time lost to infrastructure—each cluster action takes less than 90 seconds on average.

Affordable Pricing

Adopting Google Cloud Platform pricing principles, Dataproc has a low cost and an easy to understand price structure, based on actual use, measured by the second. Also, Dataproc clusters can include lower-cost preemptible instances, committed use discounts, and sustained use discounts, giving you powerful clusters at an even lower total cost.

Open source Ecosystem

You can use Spark and Hadoop tools, libraries, and documentation with Dataproc. Dataproc provides frequent updates to native versions of Spark, Hadoop, Pig, and Hive, so you can get started without the need to learn new tools or APIs, and move existing projects or ETL pipelines without redevelopment.

Dataproc Features

Dataproc is a managed Apache Spark and Apache Hadoop service that is fast, easy to use, and low cost.

Automated Cluster Management: Managed deployment, logging, and monitoring let you focus on your data, not on your cluster. Dataproc clusters are stable, scalable, and speedy.
Resizable Clusters: Create and scale clusters quickly with various virtual machine types, disk sizes, number of nodes, and networking options.
Autoscaling Clusters: Dataproc Autoscaling provides a mechanism for automating cluster resource management, and enables automatic addition and subtraction of cluster workers (nodes).
Cloud Integrated: Built-in integration with Cloud Storage, BigQuery, Bigtable, Cloud Logging, Cloud Monitoring, and AI Hub, giving you a complete and robust data platform.
Versioning: Image versioning allows you to switch between different versions of Apache Spark, Apache Hadoop, and other tools.
Highly available: Run clusters in high availability mode with multiple master nodes, and set jobs to restart on failure to ensure your clusters and jobs are highly available.
Enterprise Security: When you create a Dataproc cluster, you can enable Hadoop Secure Mode via Kerberos by adding a Security Configuration. Also,GCP and Dataproc offer additional security features that help protect your data. Some of the most commonly used GCP-specific security features used with Dataproc include default at-rest encryption, OS Login, VPC Service Controls, and Customer Managed Encryption Keys (CMEK)
Cluster Scheduled Deletion: To help avoid incurring charges for an inactive cluster, you can use Dataproc's scheduled deletion, which provides options to delete a cluster after a specified cluster idle period, at a specified future time, or after a specfied time period.

Automatic or Manual Configuration: Dataproc automatically configures hardware and software, but also gives you manual control.
Developer Tools: Multiple ways to manage a cluster, including an easy-to-use web UI, the Cloud SDK, RESTful APIs, and SSH access.
Initialization Actions: Run initialization actions to install or customize the settings and libraries you need when your cluster is created.
Optional Components: Use optional components to install and configure additional components on the cluster. Optional components are integrated with Dataproc components, and offer fully configured environments for Zeppelin, Druid, Presto, and other open source software components related to the Apache Hadoop and Apache Spark ecosystem.
Custom Images: Dataproc clusters can be provisioned with a custom image that includes your pre-installed Linux operating system packages.
Flexible Virtual Machines: Clusters can use custom machine types and preemptible virtual machines to make them the perfect size for your needs.
Component Gateway and Notebook Access: Dataproc Component Gateway enables secure, one-click access to Dataproc default and optional component web interfaces running on the cluster.
Workflow Templates: Dataproc workflow templates provide a flexible and easy-to-use mechanism for managing and executing workflows. A Workflow Template is a reusable workflow configuration that defines a graph of jobs with information on where to run those jobs.

Dataproc Pricing

Dataproc incurs a small incremental fee per virtual CPU in the Compute Engine instances used in your cluster¹.

Machine Type	Price
Standard Machines 1-64 Virtual CPUs	$0.010 - $0.640
High Memory Machines 2-64 Virtual CPUs	$0.020 - $0.640
High CPU Machines 2-64 Virtual CPUs	$0.020 - $0.640
Custom Machines Based on vCPU and memory usage	$0.010/ vCPU hour

If you pay in a currency other than USD, the prices listed in your currency on Cloud Platform SKUs apply.

¹Dataproc incurs a small incremental fee per virtual CPU in the Compute Engine instances used in your cluster while the cluster is operational. Other resources used by Dataproc, including Compute Engine network, BigQuery, and Cloud Bigtable, are billed as they are consumed. For detailed pricing information, view the pricing guide.

Try It Free

Featured Blogs

Read the latest blogs to better understand open source data processing in the cloud

Fastest track to Apache Hadoop and Spark success: using job-scoped clusters on cloud-native architecture

A combination of rapid startup time, per-second billing and cloud-native architecture is transformative for operators

Read blog

10 tips for building long-running clusters using Dataproc

On the Dataproc team, we’ve worked with countless customers who are creating clusters for their particular use cases. However, not all Hadoop and Spark workloads are appropriately served by an ephemeral job-scoped cluster model. Our goal on the Dataproc team is to make sure every customer’s use case can be addressed. To that end, we’re excited to share these tips and recommendations for using Dataproc in a non-ephemeral model.

Run Apache Spark and Apache Hadoop workloads with the flexibility and predictability of Dataproc

Recent market changes underscore the benefits of running cloud-native managed Spark and services, such as Dataproc, which helps to reduce uncertainty and to provide flexibility

Read blog

Easier integration with Apache Spark and Hadoop via Google Dataproc Job IDs and Labels

Many users are unaware that the user-specified Job IDs feature and a design pattern based on Dataproc labels can be helpful in development. In this blog post, we’ll provide some best practices for using them to integrate your apps with the service in a highly productive way.

Read blog

HDFS vs. Cloud Storage: Pros, cons and migration tips

HDFS was once the quintessential component of the Hadoop stack. So the yellow elephant in the room here is: Can HDFS really be a dying technology if Apache Hadoop and Apache Spark continue to be widely used? This blog explores that topic

New open-source tools in Dataproc process data at cloud scale

In this post, we’ll give you a whirlwind tour of the most recent Dataproc features announced at Cloud Next 2019. Everything listed here is publicly available today and ready for you to try.

Read blog

Extending the SQL capabilities of your Dataproc cluster with the Presto optional component

A Presto query can efficiently process data from multiple sources such as HDFS, Cloud Storage, MySQL, Cassandra, or even Kafka. It’s a well-supported methodology that runs federated queries and makes a great tool for ad hoc analysis that requires linking disparate systems. In this post, we provide an example using publicly available Chicago taxi data

Read blog

7 best practices for running Dataproc in production

We’ve put together the top seven best practices to help you develop highly reliant and stable production processes that use Dataproc. These will help you process data faster to get better insights and outcomes.

Learn more

Help for slow Hadoop/Spark jobs on Google Cloud: 10 questions to ask about your Hadoop and Spark cluster performance

A single, pointed question faces many first time Hadoop-in-the-cloud users: Is this the performance I should expect? Customers sometimes deploy a proof-of-concept or move their first data set over to Cloud Storage, and then kick off their Hadoop and Spark jobs only to find performance far below their expectations. Almost always, it’s possible to quickly and efficiently restore the performance to your deployment

New report examines the economic value of Dataproc’s managed Spark and Hadoop solution

ESG recently published a blog and an Economic Value Validation (EVV) report commissioned by Google, which examines the value delivered by Dataproc. According to Mike Leone, senior analyst with ESG, comparing an on-premises Hadoop and Spark environment against hosting the same infrastructure in Dataproc “highlighted a 57 percent cost savings when leveraging Google Dataproc compared to an on-premises environment, and a 32 percent cost savings compared to Amazon EMR

Read blog

Introducing advanced security options for Dataproc, now generally available

Dataproc’s new security configurations give you the best of two worlds: access to modern, best-in-class security features and infrastructure, and the familiar controls you’ve already developed for your Hadoop and Spark environments

Read blog

SparkR job types in Dataproc

Using GCP for R lets you avoid the infrastructure barriers that used to impose limits on understanding your data, such as choosing which datasets to sample because of compute or data size limits. With GCP, you can build large-scale models to analyze datasets of sizes that previously would have required huge upfront investments in high-performance computing infrastructures

Read blog

Highlights from Next ’19

Watch how customers use Dataproc to lower cost and make data driven decisions in their organization

video_youtube

Dataproc's Newest Features

Watch video

video_youtube

How Customers Are Migrating Hadoop to Google Cloud Platform

Watch video

video_youtube

Democratizing Dataproc

Watch video

Get started

Learn and build

New to GCP? Get started with any GCP product for free with a $300 credit.

Try free

Need more help?

Our experts will help you build the right solution or find the right partner for your needs.

Products listed on this page are in alpha, beta, or early access. For more information on our product launch stages, see here.

Cloud AI products comply with the SLA policies listed here. They may offer different latency or availability guarantees from other Google Cloud services.

archive.today webpage capture	Saved from		24 Mar 2020 21:41:45 UTC
	All snapshots	from host cloud.google.com
Webpage Screenshot
		share download .zip report bug or abuse Buy me a coffee

Dataproc - Cloud-native Apache Hadoop & Apache Spark

Cloud-native Apache Hadoop & Apache Spark

Fast & Scalable Data Processing

Affordable Pricing

Open source Ecosystem

Getting Started Guide

Dataproc Docs

Open Source Connectors

Initialization Actions

Video:Dataproc Intro

Tutorial:Spark ML and BigQuery

Fastest track to Apache Hadoop and Spark success: using job-scoped clusters on cloud-native architecture

10 tips for building long-running clusters using Dataproc

Run Apache Spark and Apache Hadoop workloads with the flexibility and predictability of Dataproc

Easier integration with Apache Spark and Hadoop via Google Dataproc Job IDs and Labels

HDFS vs. Cloud Storage: Pros, cons and migration tips

New open-source tools in Dataproc process data at cloud scale

Extending the SQL capabilities of your Dataproc cluster with the Presto optional component

7 best practices for running Dataproc in production

Help for slow Hadoop/Spark jobs on Google Cloud: 10 questions to ask about your Hadoop and Spark cluster performance

New report examines the economic value of Dataproc’s managed Spark and Hadoop solution

Introducing advanced security options for Dataproc, now generally available

SparkR job types in Dataproc

Dataproc's Newest Features

How Customers Are Migrating Hadoop to Google Cloud Platform

Democratizing Dataproc

Get started

Learn and build

Need more help?