Deploy a Scala Spark job on GCP Dataproc with IntelliJ

5 min readNov 11, 2024

In this article, we will use IntelliJ to create, compile, and deploy a Scala Spark job on GCP Dataproc.

Step 1: Install and Set Up IntelliJ with the Scala Plugin

Install IntelliJ: If you don’t have this IDE installed, download it from JetBrains’ website.
Install the Scala Plugin:
- Open IntelliJ IDEA.
- Go to File > Settings > Plugins.
- Search for Scala and install the plugin.

**Figure 1:** Install Scala plugin on IntelliJ.

Step 2: Create a New Scala SBT Project in IntelliJ

Open IntelliJ IDEA and select New Project.
Choose Scala on the left sidebar.
Name your project (e.g., CustomerDataproc) and choose a location.
For Project SDK, select your JDK (I use JDK 23 in this example).
Select sbt as the build system, then click Create.

**Figure 2:** Create a new Scala SBT project.

Step 3: Configure `build.sbt` for Spark Dependencies

Once your project is created, open the build.sbt file in the project folder and add the necessary Spark dependencies:

ThisBuild / version := "0.1.0-SNAPSHOT"

ThisBuild / scalaVersion := "2.12.18"

ThisBuild / libraryDependencies += "org.apache.spark" %% "spark-core" % "3.5.1"
ThisBuild / libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.5.1"

lazy val root = (project in file("."))
  .settings(
    name := "CustomerDataproc",
    idePackagePrefix := Some("org.henri")
  )

Note: We should update spark-core and spark-sql versions to match the Dataproc image. In this example, we use Spark version 3.5.1 that matches the 2.2.39-debian12 Dataproc image. At the same time, to ensure Scala version compatibility, we choose Scala 2.12.18 which is compatible with Spark 3.5.1.

Step 4: Write Your Spark Job Code

In the Project view, navigate to src/main/scala.

2. Right-click on scala, select New > Scala Class. In the Create New Scala Class window, choose Object, and name it DataprocJob.

3. Write your Spark job code in this file. For example:

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._

object DataprocJob {
  def main(args: Array[String]): Unit = {
    // Init Spark session
    val spark = SparkSession.builder
      .appName("CustDataApp")
      .getOrCreate()

    val inputPath = "gs://h_customer_data/inputData/custData.csv"
    val outputPath = "gs://h_customer_data/outputData/average_purchase_by_gender_scala"

    // Read CSV file and save data into DataFrame
    val df = spark.read
      .option("header", "true")
      .option("inferSchema", "true")
      .csv(inputPath)

    // Filter and transform data by removing any customers under 18 years old
    val filteredDf = df.filter(df("age") >= 18)

    // Group by gender and calculate average purchase amount
    val resultDf = filteredDf.groupBy("gender")
      .agg(avg("purchase_amount").as("average_purchase_amount"))

    // Write results back to GCS
    resultDf.write
      .option("header", "true")
      .csv(outputPath)

    spark.stop()
  }
}

Step 5: Compile and Package the Project as a JAR

In the top menu, go to View > Tool Windows > sbt to open the sbt tool window.
In the sbt tool window, click Refresh to load the new dependencies.
To package the project, you can expand the sbt tasks section and navigate to package. Then, double click into package to start the process.

After successful packaging, the JAR will be created in target/scala-2.12/. You should see a file like customerdataproc_2.12-0.1.0-SNAPSHOT.jar.

**Figure 4:** Package the project as a jar file.

Step 6: Upload the JAR to Google Cloud Storage

Open Terminal in IntelliJ: Go to View > Tool Windows > Terminal.
Then, you can use the gsutil command to upload the JAR file to your bucket on Google Cloud Storage:

gsutil cp target/scala-2.12/customerdataproc_2.12-0.1.0-SNAPSHOT.jar gs://h_customr_data/jars/

Note: If you obtain an error likes “gsutil command not found”, it means that gsutil tool is not installed or not in your system’s PATH. To fix it, you should go to the Google Cloud SDK installation guide to install Google Cloud SDK, then run the command “gcloud init” to initialize this SDK. Next, you should follow the prompts to authenticate, select your Google Cloud project, and set default configurations. Finally, you should add gsutil to Your PATH.

**Figure 5:** Add **gsutil** to your PATH on Windows.

Step 7: Submit the Job to Google Cloud Dataproc

Once the JAR is uploaded, you can submit the job from the terminal in IntelliJ or Cloud Shell:

gcloud dataproc jobs submit spark \
		--cluster=h-my-dataproc-cluster \
		--region=europe-west9 \
		--jars=gs://h_customer_data/jars/customerdataproc_2.12-0.1.0-SNAPSHOT.jar \
		--class=DataprocJob

Step 8: Verify Job Execution on Dataproc

After submission, you can monitor the job in the Google Cloud Console by navigating to Dataproc > Jobs. Logs and job statuses will be available there to help you troubleshoot or confirm successful execution.

**Figure 6:** See job logs from Dataproc.

By following the guide in this article, we can use IntelliJ for end-to-end development, packaging, and deployment of a Spark job to Dataproc.

Tips: To visualize your CSV result data in your bucket (e.g., gs://h_customer_data/outputData/average_purchase_by_gender_scala) in Looker Studio as a bar chart, you should follow these steps:

Step 1: Make Your CSV Result Data Available in BigQuery
Looker Studio can connect to BigQuery. So firstly, you should upload your CSV file to BigQuery. To do this, you create a dataset. Next, in this dataset, you create a table from your bucket (select Google Cloud Storage as your source).

Step 2: Connect BigQuery to Looker Studio
Open Looker Studio by going to the link https://lookerstudio.google.com. Next, you click on “+ Create” and select Data Source.

Then, you search for and select BigQuery as the connector. In the Data Source view, you select the project, dataset, and table. After that, you click Connect. Then, in the next page, you click Create Report.

**Figure 8:** Configure Data Source Connection.

Step 3: Create a Bar Chart in Looker Studio
In Looker Studio, from the menu bar, click on Add a Chart > Bar chart and choose a style. To configure the chart, from the Setup tab, you can set Dimension to gender and set Metric to average_purchase. You can customize the chart from the Style tab to change the colors, add labels, and format the chart as needed.

Deploy a Scala Spark job on GCP Dataproc with IntelliJ

Step 1: Install and Set Up IntelliJ with the Scala Plugin

Step 2: Create a New Scala SBT Project in IntelliJ

Step 3: Configure `build.sbt` for Spark Dependencies

Step 4: Write Your Spark Job Code

Step 5: Compile and Package the Project as a JAR

Step 6: Upload the JAR to Google Cloud Storage

Step 7: Submit the Job to Google Cloud Dataproc

Step 8: Verify Job Execution on Dataproc

Written by Henri TO, PhD

No responses yet

Deploy a Scala Spark job on GCP Dataproc with IntelliJ

Step 1: Install and Set Up IntelliJ with the Scala Plugin

Step 2: Create a New Scala SBT Project in IntelliJ

Step 3: Configure build.sbt for Spark Dependencies

Step 4: Write Your Spark Job Code

Step 5: Compile and Package the Project as a JAR

Step 6: Upload the JAR to Google Cloud Storage

Step 7: Submit the Job to Google Cloud Dataproc

Step 8: Verify Job Execution on Dataproc

Written by Henri TO, PhD

No responses yet

Step 3: Configure `build.sbt` for Spark Dependencies