Learn how we improved our deployment techniques using Terraform and ArgoCD to manage infrastructure and continuous service deployment. Understand the challenges and solutions we implemented to enhance our Kubernetes deployment processes.
This article is the first in a series aimed at sharing and explaining the technical efforts we are implementing so that you can draw inspiration from them or discuss them with us.
Introduction
This blog article is intended to be read by technical profiles with a basic understanding of Kubernetes (though nothing prevents you from reading it and documenting as you go).
The objective of this article is to explain how we've enhanced our deployment techniques by implementing best practices for infrastructure reinstallation and continuous service deployment, leveraging Terraform and ArgoCD effectively.
This article will explain all our considerations to arrive at this architecture diagram:
Context
We have always favored infrastructure-as-code over manual administration via graphical interfaces, particularly for its reproducibility and maintenance properties.
ArgoCD and Terraform are two perfect candidates for applying these principles:
- Terraform for all projects to be installed in a "one-shot" manner
- ArgoCD for projects that would be in constant evolution
Here is the initial configuration we had before generalizing ArgoCD to our technical services:
A Terraform manages operations to be performed only once:
- The creation of the cluster (Azure Kubernetes Service)
- The installations of Helm Charts:
- The CNI
- The observability stack (Prometheus for metrics and Loki for logs)
- The HTTP stack (Ingress and Cert-Manager)
- Our secrets stack
- The installation of ArgoCD and the initialization of its applications Subsequently, ArgoCD will take care of continuously deploying our applications.
A CNI (Container Network Interface) enables Kubernetes networking capabilities.
Our Issues
Several problems arise with our current setup:
- How to avoid errors during the installation of the Terraform project?
- How to properly update the installed charts?
- How (and why) to persist IP addresses through cluster recreations?
- How to persist logs during cluster recreations?
- How to make this transition without recreating a cluster?
We will address these questions in dedicated sections.
1. Terraform
How to avoid errors during the installation of the Terraform project?
To begin, Terraform is a tool that allows you to shape infrastructure on a cloud provider using code.
It is very useful when you want to reproduce identical infrastructure and avoid forgetting things (especially when recreating it months or years later).
I will say Terraform in this article, but we use "OpenTofu," an open-source fork. To find out why, I refer you to this section of their FAQ: OpenTofu FAQ.
Terraform is primarily intended to create and maintain infrastructure rather than software (though it is perfectly capable of doing so).
Another issue with Terraform is that it must save a state of its installation (in the form of a file called terraform.tfstate
), and this state can be cumbersome to maintain or share with a team for future developments or maintenance.
There are solutions to save and share this state with a team (for example: Hashicorp, Gitlab Terraform, or in an S3), but this option is too sophisticated for our use case.
Another point is that Terraform is very sensitive to errors; it will stop as soon as it encounters one. The later an error occurs in the installation process, the longer it will take to restart all previous steps.
For all these reasons, we need Terraform to only handle cloud infrastructures (machine, network) and the initialization of the installation of our continuous deployment software (ArgoCD).
If we revisit our diagram, it would look like this:
2. ArgoCD and Helm Charts
How to properly update the installed charts?
New problems (and thus new solutions): how do we install all the Helm Charts that Terraform was handling?
For this, we quickly turned to the ArgoCD documentation to realize that this software supports the (continuous) deployment of Helm projects 🥳
Before continuing, and for those who are not familiar with ArgoCD, here is how it works: you create a Kubernetes resource called "Application" (which resides in "projects") that defines its type (Yaml file, Kustomization, Helm Chart, or others), where ArgoCD should find it, and its configuration (auto-synchronization, automatic pruning of undefined resources, etc.). Subsequently, ArgoCD will regularly synchronize these applications.
Here is the structure of an application, for example K8TZ (a Kubernetes utility that allows setting the timezone of pods):
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: k8tz
namespace: argocd
spec:
project: infra
source:
chart: k8tz
repoURL: https://k8tz.github.io/k8tz/
targetRevision: 0.18.0
helm:
releaseName: k8tz
valuesObject:
namespace: k8tz
injectionStrategy: initContainer
timezone: Europe/Paris
injectAll: false
destination:
server: "https://kubernetes.default.svc"
namespace: k8tz
syncPolicy:
automated: {}
Breaking it down, here is the information for ArgoCD:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: k8tz
namespace: argocd
spec:
project: infra
source:
...
destination:
server: "https://kubernetes.default.svc"
namespace: ingress
syncPolicy:
automated: {}
And here is the information for the Helm Chart:
chart: k8tz
repoURL: https://k8tz.github.io/k8tz/
targetRevision: 0.18.0
helm:
releaseName: k8tz
valuesObject:
namespace: k8tz
injectionStrategy: initContainer
timezone: Europe/Paris
injectAll: false
The valuesObject
contains the values.yaml
usually passed to Helm with the -f
option.
The advantage of this type of installation is that it can be easily adapted to different environments (thanks to overlays), for example, with a patch:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: k8tz
namespace: argocd
spec:
source:
targetRevision: 0.17.2
helm:
valuesObject:
timezone: Europe/Berlin
Here, we change the version of the Helm Chart and override the timezone to Berlin.
You can find the complete example in this repo: neomi-article-argocd
3. IPs
How (and why) to persist IP addresses through cluster recreations?
Let's start with the "Why" by asking the question: What is the use of IP addresses?
At Azure, by default, the cluster is created with an IP that it uses to make requests over the Internet.
Additionally, we create another IP for Ingress requests to add a level of security (since this IP will be dedicated to receiving HTTP/s traffic).
The IP we are interested in here is the second one, as we need to direct the Cloudflare traffic to it. To avoid adding a new step in Terraform (and especially to avoid DNS modifications).
For this, what we can do is create a resource group (for example: neomi-ips
):
az group create --name neomi-ips --location france-central
Then, we can create a "Public IP Address":
az network public-ip create \
--name neomi-ip-dev \
--resource-group neomi-ips \
--allocation-method Static \
--ddos-protection-mode Disabled \
--dns-name neomi-ip-dev \
--location france-central \
--sku Standard \
--tier Regional \
--version IPv4
After that, we can simply edit the Kubernetes service to our Ingress:
apiVersion: v1
kind: Service
metadata:
annotations:
service.beta.kubernetes.io/azure-load-balancer-resource-group: neomi-ips
name: ingress-nginx-controller
namespace: ingress
spec:
loadBalancerIP: [previously created IP]
ports:
- appProtocol: http
name: http
nodePort: 30209
port: 80
protocol: TCP
targetPort: http
- appProtocol: https
name: https
nodePort: 31433
port: 443
protocol: TCP
targetPort: https
type: LoadBalancer
(The best approach is to persist this IP in an overlay that patches a deployment, for example, a Helm Chart managed by ArgoCD 😜)
Thanks to this, we can add a DNS zone in Cloudflare like this:
(Target: [DNS name given at IP creation].[IP location].cloudapp.azure.com
)
Why use a CNAME? When modifying an IP, Azure changes the DNS to point to the modified IP, and we won't have to manually edit the DNS in Cloudflare.
4. Persist Our Logs
How to persist logs during cluster recreations?
Before starting this section, let's describe our constraints:
- We want to use storage that is easily backable to another cloud provider
- We need storage that supports encryption with a customer key
- This storage should not reside in the cluster's resource group (which may be destroyed during cluster recreation)
- Be able to instruct Loki to use this storage
The answer to these criteria is:
- Create a disk with an encryption option in a new resource group
- Create a new resource group for Loki disks for each environment
- Change the Loki Helm deployment
For the first two points, we can simply create a new resource group:
az group create --name neomi-loki-disks --location france-central
And create a new disk:
az disk create --name neomi-loki-dev \
--resource-group neomi-loki-disks \
--disk-encryption-set [your encryption set] \
--encryption-type EncryptionAtRestWithCustomerKey \
--location france-central \
--os-type Linux \
--size-gb 200 \
--sku Premium_LRS \
--tier P15
Now, let's take our Loki Helm Chart deployment:
loki:
commonConfig:
replication_factor: 1
schemaConfig:
configs:
- from: "2024-04-01"
store: tsdb
object_store: filesystem
schema: v13
index:
prefix: loki_index_
period: 24h
storage:
type: 'filesystem'
singleBinary: # We use a small deployment rather than a scalable infrastructure
replicas: 1
persistence:
enabled: false # We disable it to be able to mount our own disk
extraVolumes:
# Here we specify that we are using a disk created by us
- name: storage
persistentVolumeClaim:
claimName: storage-loki-0
extraVolumeMounts:
# Here we specify the disk mount
- name: storage
mountPath: /var/loki
# We reduce the default allocated resources so that the deployment does not request (CPU and memory) too large a share of the machine
chunksCache:
allocatedMemory: 1000
resultsCache:
allocatedMemory: 1000
And to complete the deployment, we need to create the persistent volume and its claim:
apiVersion: v1
kind: PersistentVolume
metadata:
annotations:
pv.kubernetes.io/provisioned-by: disk.csi.azure.com
name: pv-loki
spec:
capacity:
storage: 200Gi
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
storageClassName: managed-csi
csi:
driver: disk.csi.azure.com
volumeHandle: /subscriptions/[subscription id]/resourceGroups/[resource group name]/providers/Microsoft.Compute/disks/[disk name]
volumeAttributes:
fsType: ext4
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: storage-loki-0
namespace: loki
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 200Gi
volumeName: pv-loki
storageClassName: managed-csi
Thus, our disk will not be deleted through recreations and will be automatically attached to the Loki pod upon cluster creation!
5. On-the-Fly Maintenance
How to make this transition without recreating a cluster?
One of our constraints after all these operations was not having to recreate our clusters to apply these changes.
The advantage is that we limited this maintenance to Helm Charts, so we can simply uninstall them and ensure that the used namespaces are deleted.
# We delete ArgoCD to prevent it from creating resources that we will intentionally delete
kubectl delete namespace argocd --cascade
# We delete the Helm Charts by deleting the NS
kubectl delete namespace cert-manager --cascade
kubectl delete namespace monitoring --cascade
kubectl delete namespace loki --cascade
# A specific use case for K8TZ is to also delete the Helm Chart as it deploys resources outside its namespace
kubectl delete namespace k8tz --cascade
helm -n default delete k8tz
# And for ingress, we will also delete its ValidatingWebhookConfigurations
kubectl delete namespace ingress --cascade
kubectl delete -A validatingwebhookconfigurations.admissionregistration.k8s.io ingress-nginx-admission
At this point, the cluster becomes inaccessible to clients, so we need to hurry to reinstantiate ArgoCD!
And the loop is complete!
ArgoCD will create all the resources we deleted and ensure their proper deployment.
I will not show the installation and initialization script for ArgoCD because there is nothing interesting. You can find this in their Getting started.
BONUS: Fully Automate Helm Chart Updates
How can a CI application notify us of new versions?
As explained earlier, we use declarative definitions to specify the desired state of our cluster. To store all this, we use a GitLab repository.
We can therefore create a step in our CI to run Renovate.
Renovate is a program that reads Git repos, scans dependencies, and if it finds new versions: creates a new Merge Request (Pull Request for GitHub).
I will not go through all the steps of instantiating and running a CI and Renovate.
Here are the resources for:
- Instantiating Renovate: Renovate Runner
- Configuring Renovate to read your ArgoCD Helm declarations: Renovate ArgoCD
- And don't forget to schedule the pipeline: Renovate Scheduling
Conclusion
This transition now allows us to deploy continuously (during the day and without service interruptions for our clients). It unifies our technical stack and simplifies its understanding to enable new team members to quickly and easily grasp the functioning of our clusters.
If you enjoyed this article, you can follow us to be the first to know when we publish our future articles.
Acknowledgments
Thanks to @louisneomi, @nabil_y, Camille Vauchel and Xavier Laurent for reviewing the article and their advice.
Top comments (0)