This document show you how to configure Filestore shared storage for Slurm jobs on Google Kubernetes Engine (GKE). Shared storage is essential for Slurm clusters to to help ensure that the Slurm login and worker nodes can access the same configuration files, scripts, and job data.
Before reading this document, ensure that you're familiar with the Slurm Operator add-on for GKE.
Before you begin
Before you start, make sure that you have performed the following tasks:
- Enable the Cloud Filestore API and the Google Kubernetes Engine API. Enable APIs
- If you want to use the Google Cloud CLI for this task,
install and then
initialize the
gcloud CLI. If you previously installed the gcloud CLI, get the latest
version by running the
gcloud components updatecommand. Earlier gcloud CLI versions might not support running the commands in this document.
Set up the cluster
To use shared storage with Slurm, you need a GKE cluster with the Slurm Operator add-on for GKE enabled. In this section, you set up a GKE cluster. You also set up OS Login for secure access and consistent user identification across the cluster.
Grant the
roles/file.editorrole to the service account that your cluster uses. If your nodes use the Compute Engine default service account, run the following command:gcloud projects add-iam-policy-binding PROJECT_ID \ --member serviceAccount:PROJECT_NUMBER-compute@developer.gserviceaccount.com \ --role roles/file.editorReplace the following:
PROJECT_ID: your project ID.PROJECT_NUMBER: your project number.
Create a GKE cluster with the Slurm Operator add-on for GKE and OS Login enabled. To create the cluster, complete the steps in Deploy Slurm on GKE.
Enable the Filestore CSI driver on the cluster:
gcloud container clusters update CLUSTER_NAME \ --location=CONTROL_PLANE_LOCATION \ --update-addons=GcpFilestoreCsiDriver=ENABLEDReplace the following:
CLUSTER_NAME: the name of the cluster.CONTROL_PLANE_LOCATION: the Compute Engine location of the control plane of your cluster. Provide a region for regional clusters, or a zone for zonal clusters.
Configure Filestore
To use Filestore as shared storage for your Slurm cluster, you must enable the Filestore CSI driver and create a PersistentVolume (PV) and PersistentVolumeClaim (PVC).
Create a Filestore instance:
gcloud filestore instances create INSTANCE_NAME \ --zone=INSTANCE_ZONE \ --tier=BASIC_HDD \ --file-share=name="SHARE_NAME",capacity=1TB \ --network=name="default"Replace the following:
INSTANCE_NAME: the name of your new instance.INSTANCE_ZONE: the zone of your Filestore instance. If you're using a regional cluster, use a zone in its region. If you're using a zonal cluster, use the same zone as the cluster.SHARE_NAME: the name of the file share.
Annotate the values for the
INSTANCE_IPfield that's needed for the manifest:gcloud filestore instances describe INSTANCE_NAME \ --format="table(networks[0].ipAddresses[0])"The output includes the value for the
INSTANCE_IPfield. You use this field in the manifest in the following step.To define the PV and PVC, create a manifest file named
filestore-pvc.yaml:apiVersion: v1 kind: PersistentVolume metadata: name: slurm-filestore-pv spec: storageClassName: "" capacity: storage: 1Ti accessModes: - ReadWriteMany persistentVolumeReclaimPolicy: Retain volumeMode: Filesystem csi: driver: filestore.csi.storage.gke.io volumeHandle: "modeInstance/INSTANCE_ZONE/INSTANCE_NAME/SHARE_NAME" volumeAttributes: ip: INSTANCE_IP volume: SHARE_NAME --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: slurm-shared-storage-pvc namespace: slurm spec: accessModes: - ReadWriteMany storageClassName: "" resources: requests: storage: 1Ti volumeName: slurm-filestore-pvReplace the following:
INSTANCE_ZONE: the zone of your Filestore instance.INSTANCE_NAME: the name of your Filestore instance.SHARE_NAME: the name of the share on the Filestore instance.INSTANCE_IP: the IP address of the Filestore instance that you retrieved earlier.
Apply the manifest:
kubectl apply -f filestore-pvc.yaml
Use shared storage in Slurm configurations
After creating your PVC, you can configure the Slurm Operator resources to mount the shared storage.
Configure Slurm login node and worker nodes
Find an available image tag:
In the Google Cloud console, go to the Artifact Registry repository page that includes the
slinky/slurmdpackage.Annotate one of the image tag values, for example
25.11-ubuntu24.04-gke.4. You use this tag in theIMAGE_TAGplaceholder in the following configuration file.
Save the following configuration to a file named
values.yaml:controller: slurmctld: image: repository: gcr.io/gke-release/slinky/slurmctld tag: IMAGE_TAG reconfigure: image: repository: gcr.io/gke-release/slinky/slurmctld tag: IMAGE_TAG restapi: replicas: 1 slurmrestd: image: repository: gcr.io/gke-release/slinky/slurmrestd tag: IMAGE_TAG nodesets: slinky: replicas: 1 slurmd: image: repository: gcr.io/gke-release/slinky/slurmd tag: IMAGE_TAG volumeMounts: - name: home-vol mountPath: /home podSpec: nodeSelector: cloud.google.com/gke-nodepool: NODE_POOL_NAME volumes: - name: home-vol persistentVolumeClaim: claimName: slurm-shared-storage-pvc loginsets: slinky: enabled: true replicas: 1 login: image: repository: gcr.io/gke-release/slinky/login tag: IMAGE_TAG volumeMounts: - name: home-vol mountPath: /home podSpec: volumes: - name: home-vol persistentVolumeClaim: claimName: slurm-shared-storage-pvcReplace the following:
IMAGE_TAG: the tag that you annotated in the previous step.NODE_POOL_NAME: the name of the node pool where you want to deploy the Slurm worker Pods.
Upgrade the Slurm chart by using the
values.yamlfile:helm upgrade slurm oci://ghcr.io/slinkyproject/charts/slurm \ --version 1.0.2 \ --namespace=slurm \ -f values.yaml
Verify shared storage
To verify that shared storage is mounted correctly, follow these steps:
Check that the PVCs and PVs are bound:
kubectl get pvc -n slurmThe output should show the status of all PVCs as
Boundto their PVs.Connect to the Slurm login node by completing the steps in Configure OS Login.
On the login node, check the mount paths:
df -h /homeCheck the mount paths on the worker nodes:
srun -N 1 df -h /home
Clean up
Clean up Slurm cluster and resources by following the directions in the Clean up section of Deploy a Slurm cluster on GKE.
Delete the Filestore instance:
gcloud filestore instances delete INSTANCE_NAME --zone=INSTANCE_ZONE --quietReplace
INSTANCE_NAMEwith the name of your Filestore instance andINSTANCE_ZONEwith its zone.
What's next
- Learn more about Filestore on GKE.