The Wayback Machine - https://web.archive.org/web/20230727014546/https://cloud.google.com/dataproc
Google Cloud
  • ‪English‬
  • ‪Deutsch‬
  • ‪Español‬
  • ‪Español (Latinoamérica)‬
  • ‪Français‬
  • ‪Indonesia‬
  • ‪Italiano‬
  • ‪Português (Brasil)‬
  • ‪简体中文‬
  • ‪繁體中文‬
  • ‪日本語‬
  • ‪한국어‬
Console
Google Cloud
Learn how your organization can prepare for the new data economy with the analytics lakehouse. Register here.
Jump to
Dataproc

Dataproc

Dataproc is a fully managed and highly scalable service for running Apache Hadoop, Apache Spark, Apache Flink, Presto, and 30+ open source tools and frameworks. Use Dataproc for data lake modernization, ETL, and secure data science, at scale, integrated with Google Cloud, at a fraction of the cost.

  • Open: Run open source data analytics at scale, with enterprise grade security

  • Flexible: Use serverless, or manage clusters on Google Compute and Kubernetes 

  • Intelligent: Enable data users through integrations with Vertex AI, BigQuery, and Dataplex 

  • Secure: Configure advanced security such as Kerberos, Apache Ranger and Personal Authentication

  • Cost-effective: Realize 54% lower TCO compared to on-prem data lakes with per-second pricing

Benefits

Modernize your open source data processing

Whether you need VMs or Kubernetes, extra memory for Presto, or even GPUs, Dataproc can help accelerate your data and analytics processing through on-demand purpose-built or serverless environments.

Intelligent and seamless OSS for data science

Enable data scientists and data analysts to seamlessly perform data science jobs through native integrations with BigQuery, Dataplex, and Vertex AI.

Advanced security, compliance, and governance

Enforce fine-grained row & column-level access controls with Dataproc, BigLake & Dataplex. Manage & enforce user authorization and authentication using existing KerberosApache Ranger policies. 

Key features

Documentation

Use cases

Use cases

Use case
Move your Hadoop and Spark clusters to the cloud

Enterprises are migrating their existing on-premises Apache Hadoop and Spark clusters over to Dataproc to manage costs and unlock the power of elastic scale. With Dataproc, enterprises get a fully managed, purpose-built cluster that can autoscale to support any data or analytics processing job. 

All features

All features

Serverless SparkDeploy Spark applications and pipelines that autoscale without any manual infrastructure provisioning or tuning. 
Resizable clustersCreate and scale clusters quickly with various virtual machine types, disk sizes, number of nodes, and networking options.
Autoscaling clustersDataproc autoscaling provides a mechanism for automating cluster resource management and enables automatic addition and subtraction of cluster workers (nodes).
Cloud integratedBuilt-in integration with Cloud Storage, BigQuery, Dataplex, Vertex AI, Composer, Cloud Bigtable, Cloud Logging, and Cloud Monitoring, giving you a more complete and robust data platform.
VersioningImage versioning allows you to switch between different versions of Apache Spark, Apache Hadoop, and other tools.
Cluster scheduled deletionTo help avoid incurring charges for an inactive cluster, you can use Dataproc's scheduled deletion, which provides options to delete a cluster after a specified cluster idle period, at a specified future time, or after a specified time period.
Automatic or manual configurationDataproc automatically configures hardware and software but also gives you manual control.
Developer toolsMultiple ways to manage a cluster, including an easy-to-use web UI, the Cloud SDK, RESTful APIs, and SSH access.
Initialization actionsRun initialization actions to install or customize the settings and libraries you need when your cluster is created.
Optional componentsUse optional components to install and configure additional components on the cluster. Optional components are integrated with Dataproc components and offer fully configured environments for Zeppelin, Presto, and other open source software components related to the Apache Hadoop and Apache Spark ecosystem.
Custom containers and imagesDataproc serverless Spark can be provisioned with custom docker containers. Dataproc clusters can be provisioned with a custom image that includes your pre-installed Linux operating system packages.
Flexible virtual machinesClusters can use custom machine types and preemptible virtual machines to make them the perfect size for your needs.
Component Gateway and notebook accessDataproc Component Gateway enables secure, one-click access to Dataproc default and optional component web interfaces running on the cluster.
Workflow templatesDataproc workflow templates provide a flexible and easy-to-use mechanism for managing and executing workflows. A workflow template is a reusable workflow configuration that defines a graph of jobs with information on where to run those jobs. 
Automated policy management Standardize security, cost, and infrastructure policies across a fleet of clusters. You can create policies for resource management, security, or network at a project level. You can also make it easy for users to use the correct images, components, metastore, and other peripheral services, enabling you to manage your fleet of clusters and serverless Spark policies in the future. 
Smart alertsDataproc recommended alerts allow customers to adjust the thresholds for the pre-configured alerts to get alerts on idle, runaway clusters, jobs, overutilized clusters and more. Customers can further customize these alerts and even create advanced cluster and job management capabilities. These capabilities allow customers to manage their fleet at scale.
Dataproc metastoreFully managed, highly available Hive Metastore (HMS) with fine-grained access control and integration with BigQuery metastore, Dataplex, and Data Catalog.

Pricing

Pricing

Dataproc pricing is based on the number of vCPU and the duration of time that they run. While pricing shows hourly rate, we charge down to the second, so you only pay for what you use.

Ex: 6 clusters (1 main + 5 workers) of 4 CPUs each ran for 2 hours would cost $.48.  Dataproc charge = # of vCPUs * hours * Dataproc price = 24 * 2 * $0.01 = $0.48

Please see pricing page for details.

Partners

Partners

Dataproc integrates with key partners to complement your existing investments and skill sets. 

Take the next step

Start building on Google Cloud with $300 in free credits and 20+ always free products.