Timothée B. for Neomi.immo

Posted on Jun 24

Transition to a Robust Infrastructure - Terraform and ArgoCD in Action

#devops #terraform #cicd #kubernetes

Learn how we improved our deployment techniques using Terraform and ArgoCD to manage infrastructure and continuous service deployment. Understand the challenges and solutions we implemented to enhance our Kubernetes deployment processes.

This article is the first in a series aimed at sharing and explaining the technical efforts we are implementing so that you can draw inspiration from them or discuss them with us.

Introduction

This blog article is intended to be read by technical profiles with a basic understanding of Kubernetes (though nothing prevents you from reading it and documenting as you go).

The objective of this article is to explain how we've enhanced our deployment techniques by implementing best practices for infrastructure reinstallation and continuous service deployment, leveraging Terraform and ArgoCD effectively.

This article will explain all our considerations to arrive at this architecture diagram:

Context

We have always favored infrastructure-as-code over manual administration via graphical interfaces, particularly for its reproducibility and maintenance properties.
ArgoCD and Terraform are two perfect candidates for applying these principles:

Terraform for all projects to be installed in a "one-shot" manner
ArgoCD for projects that would be in constant evolution

Here is the initial configuration we had before generalizing ArgoCD to our technical services:

A Terraform manages operations to be performed only once:

The creation of the cluster (Azure Kubernetes Service)
The installations of Helm Charts:
- The CNI
- The observability stack (Prometheus for metrics and Loki for logs)
- The HTTP stack (Ingress and Cert-Manager)
- Our secrets stack
The installation of ArgoCD and the initialization of its applications Subsequently, ArgoCD will take care of continuously deploying our applications.

A CNI (Container Network Interface) enables Kubernetes networking capabilities.

Our Issues

Several problems arise with our current setup:

We will address these questions in dedicated sections.

1. Terraform

How to avoid errors during the installation of the Terraform project?

To begin, Terraform is a tool that allows you to shape infrastructure on a cloud provider using code.

It is very useful when you want to reproduce identical infrastructure and avoid forgetting things (especially when recreating it months or years later).

I will say Terraform in this article, but we use "OpenTofu," an open-source fork. To find out why, I refer you to this section of their FAQ: OpenTofu FAQ.

Terraform is primarily intended to create and maintain infrastructure rather than software (though it is perfectly capable of doing so).

Another issue with Terraform is that it must save a state of its installation (in the form of a file called terraform.tfstate), and this state can be cumbersome to maintain or share with a team for future developments or maintenance.

There are solutions to save and share this state with a team (for example: Hashicorp, Gitlab Terraform, or in an S3), but this option is too sophisticated for our use case.

Another point is that Terraform is very sensitive to errors; it will stop as soon as it encounters one. The later an error occurs in the installation process, the longer it will take to restart all previous steps.

For all these reasons, we need Terraform to only handle cloud infrastructures (machine, network) and the initialization of the installation of our continuous deployment software (ArgoCD).

If we revisit our diagram, it would look like this:

2. ArgoCD and Helm Charts

How to properly update the installed charts?

New problems (and thus new solutions): how do we install all the Helm Charts that Terraform was handling?

For this, we quickly turned to the ArgoCD documentation to realize that this software supports the (continuous) deployment of Helm projects 🥳

Before continuing, and for those who are not familiar with ArgoCD, here is how it works: you create a Kubernetes resource called "Application" (which resides in "projects") that defines its type (Yaml file, Kustomization, Helm Chart, or others), where ArgoCD should find it, and its configuration (auto-synchronization, automatic pruning of undefined resources, etc.). Subsequently, ArgoCD will regularly synchronize these applications.

Here is the structure of an application, for example K8TZ (a Kubernetes utility that allows setting the timezone of pods):

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: k8tz
  namespace: argocd
spec:
  project: infra
  source:
    chart: k8tz
    repoURL: https://k8tz.github.io/k8tz/
    targetRevision: 0.18.0
    helm:
      releaseName: k8tz
      valuesObject:
        namespace: k8tz
        injectionStrategy: initContainer
        timezone: Europe/Paris
        injectAll: false
  destination:
    server: "https://kubernetes.default.svc"
    namespace: k8tz
  syncPolicy:
    automated: {}

Breaking it down, here is the information for ArgoCD:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: k8tz
  namespace: argocd
spec:
  project: infra
  source:
    ...
  destination:
    server: "https://kubernetes.default.svc"
    namespace: ingress
  syncPolicy:
    automated: {}

And here is the information for the Helm Chart:

    chart: k8tz
    repoURL: https://k8tz.github.io/k8tz/
    targetRevision: 0.18.0
    helm:
      releaseName: k8tz
      valuesObject:
        namespace: k8tz
        injectionStrategy: initContainer
        timezone: Europe/Paris
        injectAll: false

The valuesObject contains the values.yaml usually passed to Helm with the -f option.

The advantage of this type of installation is that it can be easily adapted to different environments (thanks to overlays), for example, with a patch:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: k8tz
  namespace: argocd
spec:
  source:
    targetRevision: 0.17.2
    helm:
      valuesObject:
        timezone: Europe/Berlin

Here, we change the version of the Helm Chart and override the timezone to Berlin.

You can find the complete example in this repo: neomi-article-argocd

3. IPs

How (and why) to persist IP addresses through cluster recreations?

Let's start with the "Why" by asking the question: What is the use of IP addresses?

At Azure, by default, the cluster is created with an IP that it uses to make requests over the Internet.
Additionally, we create another IP for Ingress requests to add a level of security (since this IP will be dedicated to receiving HTTP/s traffic).

The IP we are interested in here is the second one, as we need to direct the Cloudflare traffic to it. To avoid adding a new step in Terraform (and especially to avoid DNS modifications).

For this, what we can do is create a resource group (for example: neomi-ips):

az group create --name neomi-ips --location france-central

Then, we can create a "Public IP Address":

az network public-ip create \
    --name neomi-ip-dev \
    --resource-group neomi-ips \
    --allocation-method Static \
    --ddos-protection-mode Disabled \
    --dns-name neomi-ip-dev \
    --location france-central \
    --sku Standard \
    --tier Regional \
    --version IPv4

After that, we can simply edit the Kubernetes service to our Ingress:

apiVersion: v1
kind: Service
metadata:
  annotations:
    service.beta.kubernetes.io/azure-load-balancer-resource-group: neomi-ips
  name: ingress-nginx-controller
  namespace: ingress
spec:
  loadBalancerIP: [previously created IP]
  ports:
  - appProtocol: http
    name: http
    nodePort: 30209
    port: 80
    protocol: TCP
    targetPort: http
  - appProtocol: https
    name: https
    nodePort: 31433
    port: 443
    protocol: TCP
    targetPort: https
  type: LoadBalancer

(The best approach is to persist this IP in an overlay that patches a deployment, for example, a Helm Chart managed by ArgoCD 😜)

Thanks to this, we can add a DNS zone in Cloudflare like this:

(Target: [DNS name given at IP creation].[IP location].cloudapp.azure.com)

Why use a CNAME? When modifying an IP, Azure changes the DNS to point to the modified IP, and we won't have to manually edit the DNS in Cloudflare.

4. Persist Our Logs

How to persist logs during cluster recreations?

Before starting this section, let's describe our constraints:

We want to use storage that is easily backable to another cloud provider
We need storage that supports encryption with a customer key
This storage should not reside in the cluster's resource group (which may be destroyed during cluster recreation)
Be able to instruct Loki to use this storage

The answer to these criteria is:

Create a disk with an encryption option in a new resource group
Create a new resource group for Loki disks for each environment
Change the Loki Helm deployment

For the first two points, we can simply create a new resource group:

az group create --name neomi-loki-disks --location france-central

And create a new disk:

az disk create --name neomi-loki-dev \
               --resource-group neomi-loki-disks \
               --disk-encryption-set [your encryption set] \
               --encryption-type EncryptionAtRestWithCustomerKey \
               --location france-central \
               --os-type Linux \
               --size-gb 200 \
               --sku Premium_LRS \
               --tier P15

Now, let's take our Loki Helm Chart deployment:

loki:
  commonConfig:
    replication_factor: 1
  schemaConfig:
    configs:
      - from: "2024-04-01"
        store: tsdb
        object_store: filesystem
        schema: v13
        index:
          prefix: loki_index_
          period: 24h
  storage:
    type: 'filesystem'
singleBinary: # We use a small deployment rather than a scalable infrastructure
  replicas: 1
  persistence:
    enabled: false # We disable it to be able to mount our own disk
  extraVolumes:
    # Here we specify that we are using a disk created by us
    - name: storage
      persistentVolumeClaim:
        claimName: storage-loki-0
  extraVolumeMounts:
    # Here we specify the disk mount
    - name: storage
      mountPath: /var/loki

# We reduce the default allocated resources so that the deployment does not request (CPU and memory) too large a share of the machine
chunksCache:
  allocatedMemory: 1000
resultsCache:
  allocatedMemory: 1000

And to complete the deployment, we need to create the persistent volume and its claim:

apiVersion: v1
kind: PersistentVolume
metadata:
  annotations:
    pv.kubernetes.io/provisioned-by: disk.csi.azure.com
  name: pv-loki
spec:
  capacity:
    storage: 200Gi
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  storageClassName: managed-csi
  csi:
    driver: disk.csi.azure.com
    volumeHandle: /subscriptions/[subscription id]/resourceGroups/[resource group name]/providers/Microsoft.Compute/disks/[disk name]
    volumeAttributes:
      fsType: ext4
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: storage-loki-0
  namespace: loki
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 200Gi
  volumeName: pv-loki
  storageClassName: managed-csi

Thus, our disk will not be deleted through recreations and will be automatically attached to the Loki pod upon cluster creation!

5. On-the-Fly Maintenance

How to make this transition without recreating a cluster?

One of our constraints after all these operations was not having to recreate our clusters to apply these changes.
The advantage is that we limited this maintenance to Helm Charts, so we can simply uninstall them and ensure that the used namespaces are deleted.

# We delete ArgoCD to prevent it from creating resources that we will intentionally delete
kubectl delete namespace argocd --cascade
# We delete the Helm Charts by deleting the NS
kubectl delete namespace cert-manager --cascade
kubectl delete namespace monitoring --cascade
kubectl delete namespace loki --cascade
# A specific use case for K8TZ is to also delete the Helm Chart as it deploys resources outside its namespace
kubectl delete namespace k8tz --cascade
helm -n default delete k8tz
# And for ingress, we will also delete its ValidatingWebhookConfigurations
kubectl delete namespace ingress --cascade
kubectl delete -A validatingwebhookconfigurations.admissionregistration.k8s.io ingress-nginx-admission

At this point, the cluster becomes inaccessible to clients, so we need to hurry to reinstantiate ArgoCD!
And the loop is complete!

ArgoCD will create all the resources we deleted and ensure their proper deployment.
I will not show the installation and initialization script for ArgoCD because there is nothing interesting. You can find this in their Getting started.

BONUS: Fully Automate Helm Chart Updates

How can a CI application notify us of new versions?

As explained earlier, we use declarative definitions to specify the desired state of our cluster. To store all this, we use a GitLab repository.
We can therefore create a step in our CI to run Renovate.

Renovate is a program that reads Git repos, scans dependencies, and if it finds new versions: creates a new Merge Request (Pull Request for GitHub).

I will not go through all the steps of instantiating and running a CI and Renovate.
Here are the resources for:

Instantiating Renovate: Renovate Runner
Configuring Renovate to read your ArgoCD Helm declarations: Renovate ArgoCD
And don't forget to schedule the pipeline: Renovate Scheduling

Conclusion

This transition now allows us to deploy continuously (during the day and without service interruptions for our clients). It unifies our technical stack and simplifies its understanding to enable new team members to quickly and easily grasp the functioning of our clusters.

If you enjoyed this article, you can follow us to be the first to know when we publish our future articles.

Acknowledgments

Thanks to @louisneomi, @nabil_y, Camille Vauchel and Xavier Laurent for reviewing the article and their advice.

DEV Community