Deploy a Scala Spark job on GCP Dataproc with IntelliJ
In this article, we will use IntelliJ to create, compile, and deploy a Scala Spark job on GCP Dataproc.
Step 1: Install and Set Up IntelliJ with the Scala Plugin
- Install IntelliJ: If you don’t have this IDE installed, download it from JetBrains’ website.
- Install the Scala Plugin:
- Open IntelliJ IDEA.
- Go to File > Settings > Plugins.
- Search for Scala and install the plugin.
Step 2: Create a New Scala SBT Project in IntelliJ
- Open IntelliJ IDEA and select New Project.
- Choose Scala on the left sidebar.
- Name your project (e.g.,
CustomerDataproc
) and choose a location. - For Project SDK, select your JDK (I use JDK 23 in this example).
- Select sbt as the build system, then click Create.
Step 3: Configure build.sbt
for Spark Dependencies
Once your project is created, open the build.sbt
file in the project folder and add the necessary Spark dependencies:
ThisBuild / version := "0.1.0-SNAPSHOT"
ThisBuild / scalaVersion := "2.12.18"
ThisBuild / libraryDependencies += "org.apache.spark" %% "spark-core" % "3.5.1"
ThisBuild / libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.5.1"
lazy val root = (project in file("."))
.settings(
name := "CustomerDataproc",
idePackagePrefix := Some("org.henri")
)
Note: We should update
spark-core
andspark-sql
versions to match the Dataproc image. In this example, we use Spark version 3.5.1 that matches the 2.2.39-debian12 Dataproc image. At the same time, to ensure Scala version compatibility, we choose Scala 2.12.18 which is compatible with Spark 3.5.1.
Step 4: Write Your Spark Job Code
- In the Project view, navigate to
src/main/scala
.
2. Right-click on scala
, select New > Scala Class. In the Create New Scala Class window, choose Object, and name it DataprocJob.
3. Write your Spark job code in this file. For example:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object DataprocJob {
def main(args: Array[String]): Unit = {
// Init Spark session
val spark = SparkSession.builder
.appName("CustDataApp")
.getOrCreate()
val inputPath = "gs://h_customer_data/inputData/custData.csv"
val outputPath = "gs://h_customer_data/outputData/average_purchase_by_gender_scala"
// Read CSV file and save data into DataFrame
val df = spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv(inputPath)
// Filter and transform data by removing any customers under 18 years old
val filteredDf = df.filter(df("age") >= 18)
// Group by gender and calculate average purchase amount
val resultDf = filteredDf.groupBy("gender")
.agg(avg("purchase_amount").as("average_purchase_amount"))
// Write results back to GCS
resultDf.write
.option("header", "true")
.csv(outputPath)
spark.stop()
}
}
Step 5: Compile and Package the Project as a JAR
- In the top menu, go to View > Tool Windows > sbt to open the sbt tool window.
- In the sbt tool window, click Refresh to load the new dependencies.
- To package the project, you can expand the
sbt tasks
section and navigate topackage
. Then, double click intopackage
to start the process.
After successful packaging, the JAR will be created in target/scala-2.12/
. You should see a file like customerdataproc_2.12-0.1.0-SNAPSHOT.jar
.
Step 6: Upload the JAR to Google Cloud Storage
- Open Terminal in IntelliJ: Go to View > Tool Windows > Terminal.
- Then, you can use the
gsutil
command to upload the JAR file to your bucket on Google Cloud Storage:
gsutil cp target/scala-2.12/customerdataproc_2.12-0.1.0-SNAPSHOT.jar gs://h_customr_data/jars/
Note: If you obtain an error likes “gsutil command not found”, it means that gsutil tool is not installed or not in your system’s PATH. To fix it, you should go to the Google Cloud SDK installation guide to install Google Cloud SDK, then run the command “gcloud init” to initialize this SDK. Next, you should follow the prompts to authenticate, select your Google Cloud project, and set default configurations. Finally, you should add gsutil to Your PATH.
Step 7: Submit the Job to Google Cloud Dataproc
Once the JAR is uploaded, you can submit the job from the terminal in IntelliJ or Cloud Shell:
gcloud dataproc jobs submit spark \
--cluster=h-my-dataproc-cluster \
--region=europe-west9 \
--jars=gs://h_customer_data/jars/customerdataproc_2.12-0.1.0-SNAPSHOT.jar \
--class=DataprocJob
Step 8: Verify Job Execution on Dataproc
After submission, you can monitor the job in the Google Cloud Console by navigating to Dataproc > Jobs. Logs and job statuses will be available there to help you troubleshoot or confirm successful execution.
By following the guide in this article, we can use IntelliJ for end-to-end development, packaging, and deployment of a Spark job to Dataproc.
Tips: To visualize your CSV result data in your bucket (e.g., gs://h_customer_data/outputData/average_purchase_by_gender_scala) in Looker Studio as a bar chart, you should follow these steps:
Step 1: Make Your CSV Result Data Available in BigQuery
Looker Studio can connect to BigQuery. So firstly, you should upload your CSV file to BigQuery. To do this, you create a dataset. Next, in this dataset, you create a table from your bucket (select Google Cloud Storage as your source).
Step 2: Connect BigQuery to Looker Studio
Open Looker Studio by going to the link https://lookerstudio.google.com. Next, you click on “+ Create” and select Data Source.
Then, you search for and select BigQuery as the connector. In the Data Source view, you select the project, dataset, and table. After that, you click Connect. Then, in the next page, you click Create Report.
Step 3: Create a Bar Chart in Looker Studio
In Looker Studio, from the menu bar, click on Add a Chart > Bar chart and choose a style. To configure the chart, from the Setup tab, you can set Dimension to gender
and set Metric to average_purchase
. You can customize the chart from the Style tab to change the colors, add labels, and format the chart as needed.