Sitemap

Deploy a Scala Spark job on GCP Dataproc with IntelliJ

5 min readNov 11, 2024

In this article, we will use IntelliJ to create, compile, and deploy a Scala Spark job on GCP Dataproc.

Press enter or click to view image in full size
Photo by Rui Alves on Unsplash

Step 1: Install and Set Up IntelliJ with the Scala Plugin

  1. Install IntelliJ: If you don’t have this IDE installed, download it from JetBrains’ website.
  2. Install the Scala Plugin:
    - Open IntelliJ IDEA.
    - Go to File > Settings > Plugins.
    - Search for Scala and install the plugin.
Press enter or click to view image in full size
Figure 1: Install Scala plugin on IntelliJ.

Step 2: Create a New Scala SBT Project in IntelliJ

  1. Open IntelliJ IDEA and select New Project.
  2. Choose Scala on the left sidebar.
  3. Name your project (e.g., CustomerDataproc) and choose a location.
  4. For Project SDK, select your JDK (I use JDK 23 in this example).
  5. Select sbt as the build system, then click Create.
Press enter or click to view image in full size
Figure 2: Create a new Scala SBT project.

Step 3: Configure build.sbt for Spark Dependencies

Once your project is created, open the build.sbt file in the project folder and add the necessary Spark dependencies:

ThisBuild / version := "0.1.0-SNAPSHOT"

ThisBuild / scalaVersion := "2.12.18"

ThisBuild / libraryDependencies += "org.apache.spark" %% "spark-core" % "3.5.1"
ThisBuild / libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.5.1"

lazy val root = (project in file("."))
.settings(
name := "CustomerDataproc",
idePackagePrefix := Some("org.henri")
)

Note: We should update spark-core and spark-sql versions to match the Dataproc image. In this example, we use Spark version 3.5.1 that matches the 2.2.39-debian12 Dataproc image. At the same time, to ensure Scala version compatibility, we choose Scala 2.12.18 which is compatible with Spark 3.5.1.

Step 4: Write Your Spark Job Code

  1. In the Project view, navigate to src/main/scala.

2. Right-click on scala, select New > Scala Class. In the Create New Scala Class window, choose Object, and name it DataprocJob.

Figure 3: Create a Scala object.

3. Write your Spark job code in this file. For example:

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._

object DataprocJob {
def main(args: Array[String]): Unit = {
// Init Spark session
val spark = SparkSession.builder
.appName("CustDataApp")
.getOrCreate()

val inputPath = "gs://h_customer_data/inputData/custData.csv"
val outputPath = "gs://h_customer_data/outputData/average_purchase_by_gender_scala"

// Read CSV file and save data into DataFrame
val df = spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv(inputPath)

// Filter and transform data by removing any customers under 18 years old
val filteredDf = df.filter(df("age") >= 18)

// Group by gender and calculate average purchase amount
val resultDf = filteredDf.groupBy("gender")
.agg(avg("purchase_amount").as("average_purchase_amount"))

// Write results back to GCS
resultDf.write
.option("header", "true")
.csv(outputPath)

spark.stop()
}
}

Step 5: Compile and Package the Project as a JAR

  1. In the top menu, go to View > Tool Windows > sbt to open the sbt tool window.
  2. In the sbt tool window, click Refresh to load the new dependencies.
  3. To package the project, you can expand the sbt tasks section and navigate to package. Then, double click into package to start the process.

After successful packaging, the JAR will be created in target/scala-2.12/. You should see a file like customerdataproc_2.12-0.1.0-SNAPSHOT.jar.

Press enter or click to view image in full size
Figure 4: Package the project as a jar file.

Step 6: Upload the JAR to Google Cloud Storage

  1. Open Terminal in IntelliJ: Go to View > Tool Windows > Terminal.
  2. Then, you can use the gsutil command to upload the JAR file to your bucket on Google Cloud Storage:
gsutil cp target/scala-2.12/customerdataproc_2.12-0.1.0-SNAPSHOT.jar gs://h_customr_data/jars/

Note: If you obtain an error likes “gsutil command not found”, it means that gsutil tool is not installed or not in your system’s PATH. To fix it, you should go to the Google Cloud SDK installation guide to install Google Cloud SDK, then run the command “gcloud init” to initialize this SDK. Next, you should follow the prompts to authenticate, select your Google Cloud project, and set default configurations. Finally, you should add gsutil to Your PATH.

Figure 5: Add gsutil to your PATH on Windows.

Step 7: Submit the Job to Google Cloud Dataproc

Once the JAR is uploaded, you can submit the job from the terminal in IntelliJ or Cloud Shell:

gcloud dataproc jobs submit spark \
--cluster=h-my-dataproc-cluster \
--region=europe-west9 \
--jars=gs://h_customer_data/jars/customerdataproc_2.12-0.1.0-SNAPSHOT.jar \
--class=DataprocJob

Step 8: Verify Job Execution on Dataproc

After submission, you can monitor the job in the Google Cloud Console by navigating to Dataproc > Jobs. Logs and job statuses will be available there to help you troubleshoot or confirm successful execution.

Press enter or click to view image in full size
Figure 6: See job logs from Dataproc.

By following the guide in this article, we can use IntelliJ for end-to-end development, packaging, and deployment of a Spark job to Dataproc.

Tips: To visualize your CSV result data in your bucket (e.g., gs://h_customer_data/outputData/average_purchase_by_gender_scala) in Looker Studio as a bar chart, you should follow these steps:

Step 1: Make Your CSV Result Data Available in BigQuery
Looker Studio can connect to BigQuery. So firstly, you should upload your CSV file to BigQuery. To do this, you create a dataset. Next, in this dataset, you create a table from your bucket (select Google Cloud Storage as your source).

Step 2: Connect BigQuery to Looker Studio
Open Looker Studio by going to the link https://lookerstudio.google.com. Next, you click on “+ Create” and select Data Source.

Press enter or click to view image in full size
Figure 7: Create a Data Source.

Then, you search for and select BigQuery as the connector. In the Data Source view, you select the project, dataset, and table. After that, you click Connect. Then, in the next page, you click Create Report.

Press enter or click to view image in full size
Figure 8: Configure Data Source Connection.

Step 3: Create a Bar Chart in Looker Studio
In Looker Studio, from the menu bar, click on Add a Chart > Bar chart and choose a style. To configure the chart, from the Setup tab, you can set Dimension to gender and set Metric to average_purchase. You can customize the chart from the Style tab to change the colors, add labels, and format the chart as needed.

Press enter or click to view image in full size
Figure 9: Create and configure bar chart.

--

--

Henri TO, PhD
Henri TO, PhD

Written by Henri TO, PhD

Data Engineer | MLOps | GCP Specialist

No responses yet