Containers allow applications to run quicker across many different development environments, and a single container encapsulates everything needed to run an application. Container technologies have exploded in popularity in recent years, leading to diverse use cases as well as new and unexpected challenges. This Zone offers insights into how teams can solve these challenges through its coverage of container performance, Kubernetes, testing, container orchestration, microservices usage to build and deploy containers, and more.
Networking’s Open Source Era Is Just Getting Started
Death by a Thousand YAMLs: Surviving Kubernetes Tool Sprawl
Running AI agents locally feels simple until you try it: dependencies break, configs drift, and your laptop slows to a crawl. An agent isn’t one process — it’s usually a mix of a language model, a database, and a frontend. Managing these by hand means juggling installs, versions, and ports. Docker Compose changes that. You can now define these services in a single YAML file and run them together as one app. Compose even supports declaring AI models directly with the models element. With one command — docker compose up — your full agent stack runs locally. But local machines hit limits fast. Small models like DistilGPT-2 run on CPUs, but bigger ones like LLaMA-2 need GPUs. Most laptops don’t have that kind of power. Docker Offload bridges this gap. It runs the same stack in the cloud on GPU-backed hosts, using the same YAML file and the same commands. This tutorial walks through: Defining an AI agent with ComposeRunning it locally for fast iterationOffloading the same setup to cloud GPUs for scale The result: local iteration, cloud execution — without rewriting configs. Why Agents + Docker AI agents aren’t monoliths. They’re composite apps that bundle services such as: Language model (LLM or fine-tuned API)Vector database for long-term memory and embeddingsFrontend/UI for user interactionOptional monitoring, cache, or file storage Traditionally, you’d set these up manually: Postgres installed locally, Python for the LLM, Node.js for the UI. Each piece required configs, version checks, and separate commands. When one broke, the whole system failed. Docker Compose fixes this. Instead of manual installs, you describe services in a single YAML file. Compose launches containers, wires them together, and keeps your stack reproducible. There are also options such as Kubernetes, HashiCorp Nomad, or even raw Docker commands, but all options have a trade-off. Kubernetes can scale to support large-scale production applications, providing sophisticated scheduling, autoscaling, and service discovery capabilities. Nomad is a more basic alternative to Kubernetes that is very friendly to multi-cloud deployments. Raw Docker commands provide a level of control that is hard to manage when managing more than a few services. Conversely, Docker Compose targets developers expressing the need to iterate fast and have a lightweight orchestration. It balances the requirements of just containers with full Kubernetes, and thus it is suitable for local development and early prototyping. Still, laptops have limits. CPUs can handle small models but not the heavier workloads. That’s where Docker Offload enters. It extends the same Compose workflow into the cloud, moving the heavy lifting to GPU servers. Figure 1: Local vs. Cloud workflow with Docker Offload AI agent services (LLM, database, frontend) run locally with Docker Compose. With docker offload up, the same services move to GPU-backed cloud servers, using the same YAML file. Define the Agent With Compose Step 1: Create a compose.yaml File YAML services: llm: image: ghcr.io/langchain/langgraph:latest ports: - "8080:8080" db: image: postgres:15 environment: POSTGRES_PASSWORD: secret ui: build: ./frontend ports: - "3000:3000" This file describes three services: llm: Runs a language model server on port 8080. You could replace this with another image, such as Hugging Face’s text-generation-inference.db: Runs Postgres 15 with an environment variable for the password. Using environment variables avoids hardcoding sensitive data.ui: Builds a custom frontend from your local ./frontend directory. It exposes port 3000 for web access. For more advanced setups, your compose.yaml can include features like multi-stage builds, health checks, or GPU requirements. Here’s an example: YAML services: llm: build: context: ./llm-service dockerfile: Dockerfile deploy: resources: reservations: devices: - driver: nvidia count: 1 capabilities: [gpu] ports: - "8080:8080" healthcheck: test: ["CMD", "curl", "-f", "http://localhost:8080/health"] interval: 30s retries: 3 db: image: postgres:15 environment: POSTGRES_PASSWORD: ${POSTGRES_PASSWORD} ui: build: ./frontend ports: - "3000:3000" In this configuration: Multi-stage builds reduce image size by separating build tools from the final runtime.GPU requirements ensure the service runs on a node with NVIDIA GPUs when offloaded.Health checks allow Docker (and Offload) to detect when a service is ready. Step 2: Run the Stack PowerShell docker compose up Compose builds and starts all three services. Containers are networked together automatically. Expected output from docker compose ps: PowerShell NAME IMAGE PORTS agent-llm ghcr.io/langchain/langgraph 0.0.0.0:8080->8080/tcp agent-db postgres:15 5432/tcp agent-ui frontend:latest 0.0.0.0:3000->3000/tcp Now open http://localhost:3000 to see the UI talking to the LLM and database. You can use docker compose ps to check running services and Docker Compose logs to see real-time logs for debugging. Figure 2: Compose stack for AI agent (LLM + DB + UI) A compose.yaml defines all agent components: LLM, database, and frontend. Docker Compose connects them automatically, making the stack reproducible across laptops and the cloud. Offload to the Cloud Once your local laptop hits its limit, shift to the cloud with Docker Offload. Step 1: Install the Extension PowerShell docker extension install offload Step 2: Start the Stack in the Cloud PowerShell docker offload up That’s it. Your YAML doesn’t change. Your commands don’t change. Only the runtime location does. Step 3: Verify PowerShell docker offload ps This shows which services are running remotely. Meanwhile, your local terminal still streams logs so you can debug without switching tools. Other useful commands: docker offload status – Check if your deployment is healthy.docker offload stop – Shut down cloud containers when done.docker offload logs <service> – View logs for a specific container. You can use .dockerignore to reduce build context, especially when sending files to the cloud. Figure 3: Dev → Cloud GPU Offload → Full agent workflow The workflow for scaling AI agents is straightforward. A developer tests locally with docker compose up. When more power is needed, docker offload up sends the same stack to the cloud. Containers run remotely on GPUs, but logs and results stream back to the local machine for debugging. Real-World Scaling Example Let’s say you’re building a research assistant chatbot. Local testing: Model: DistilGPT-2 (lightweight, CPU-friendly)Database: PostgresUI: simple React appRun with docker compose up This setup is fine for testing flows, building the frontend, and validating prompts. Scaling to cloud: Replace the model service with LLaMA-2-13B or Falcon for better answers.Add a vector database like Weaviate or Chroma for semantic memory.Run with docker offload up Now your agent can handle larger queries and store context efficiently. The frontend doesn’t care if the model is local or cloud-based — it just connects to the same service port. This workflow matches how most teams build: fast iteration locally, scale in the cloud when ready for heavier testing or deployment. Advantages and Trade-Offs Figure 4: Visual comparison of Local Docker Compose vs. Docker Offload The same compose.yaml defines both environments. Locally, agents run on CPUs with minimal cost and latency. With Offload, the same config shifts to GPU-backed cloud servers, enabling scale but adding cost and latency. Advantages One config: Same YAML works everywhereSimple commands: docker compose up vs. docker offload upCloud GPUs: Access powerful hardware without setting up infraUnified debugging: Logs stream to the local terminal for easy monitoring Trade-Offs Latency: Cloud adds round trips. A 50ms local API call may take 150–200ms remotely, depending on network conditions. This matters for latency-sensitive apps like chatbots.Cost: GPU time is expensive. A standard AWS P4d.24xlarge (8×A100) costs about $32.77/hour, or $4.10 per GPU/hour. On GCP, an A100-80 GB instance is approximately $6.25/hour, while high-end H100-equipped VMs can reach $88.49/hour. Spot instances, when available, can offer 60–91% discounts, cutting costs significantly for batch jobs or CI pipelines.Coverage: Offload supports limited backends today, though integrations are expanding. Enterprises should check which providers are supported.Security implications: Offloading workloads implies that your model, data, and configs execute on remote infrastructure. Businesses must consider transit (TLS), data at rest, and access controls. Other industries might also be subject to HIPAA, PCI DSS, or GDPR compliance prior to the offloading of workloads.Network and firewall settings: Offload requires outbound access to Docker’s cloud endpoints. In enterprises with restricted egress policies or firewalls, security teams may need to open specific ports or allowlist Offload domains. Best Practices To get the most out of Compose + Offload: Properly manage secrets: To use hardcoded sensitive values in compose.yaml, use core secrets with .env files or Docker secrets. This prevents inadvertent leaks in version control.Pin image versions: Avoid using :latest tags, as they can pull unexpected updates. Pin versions like :1.2.0 for stability and reproducibility.Scan images for vulnerabilities: Use docker scout cves to scan images before offloading. Catching issues early helps avoid deploying insecure builds.Optimize builds with multi-stage: Multi-stage builds and .dockerignore files keep images slim, saving both storage and bandwidth during cloud offload.Add health checks: Health checks let Docker and Offload know when a service is ready, improving resilience in larger stacks. PowerShell healthcheck: test: ["CMD", "curl", "-f", "http://localhost:8080/health"] interval: 30s retries: 3 Monitor usage: Use Docker offload status and logs to track GPU consumption and stop idle workloads to avoid unnecessary costs.Version control your YAML: Commit your Compose files to Git so the entire team runs the same stack consistently. These practices reduce surprises and make scaling smoother. Conclusion AI agents are multi-service apps. Running them locally works for small tests, but scaling requires more power. Docker Compose defines the stack once. Docker Offload runs the same setup on GPUs in the cloud. This workflow — local iteration, cloud execution — means you can build and test quickly, then scale up without friction. As Docker expands AI features, Compose and Offload are becoming the natural choice for developers building AI-native apps. If you’re experimenting with agents, start with Compose on your laptop, then offload when you need more processing power. The change is smooth, and the payoff is quicker, and it builds with fewer iterations.
The Jenkins pipeline below automates the secure management of Kubernetes sealed secrets across multiple environments and clusters, including AKS (Non-Production), GKE (Production Cluster 1), and EKS (Production Cluster 2). It dynamically adapts based on the selected environment, processes secrets in parallel for scalability, and ensures secure storage of credentials and artifacts. With features like dynamic cluster mapping, parallel execution, and post-build artifact archiving, the pipeline is optimized for efficiency, security, and flexibility in a multi-cloud Kubernetes landscape. Key Features and Workflow Dynamic Cluster Selection Based on the ENVIRONMENT parameter, the pipeline dynamically determines the target clusters:Non-Production: Targets the AKS cluster using the Stage credential.Production: Targets both GKE and EKS clusters with Production_1 and Production_2 credentials respectively. Parallel Processing The Process Clusters stage executes cluster-specific workflows in parallel, significantly reducing runtime for multi-cluster operations. For example:In Production, the pipeline simultaneously processes GKE and EKS clusters.In Non-Production, only the AKS cluster is processed. Secure Sealed Secrets Workflow Decodes the Base64-encoded Secrets.yaml file.Fetches the public certificate from the Sealed Secrets controller.Encrypts the secrets for the respective cluster and namespace.Generates sealed-secrets.yaml artifacts. Dynamic and Reusable Pipeline The cluster list and credentials are dynamically configured, making the pipeline adaptable for additional clusters or environments with minimal changes. Post-Build Artifact Management Artifacts for each cluster, including sealed-secrets.yaml and metadata files (README.txt), are archived and made accessible in Jenkins UI for easy retrieval. Parallel Execution Logic The pipeline uses Groovy’s parallel directive to process clusters concurrently: Cluster Mapping The ENVIRONMENT parameter determines the cluster list:Non-Production: Includes only the AKS cluster.Production: Includes both GKE and EKS clusters. Parallel Stage Creation For each cluster: A separate parallel stage is defined dynamically with cluster-specific names, credentials, and directories.Each stage independently fetches certificates and generates sealed secrets. Execution The parallel block runs all stages concurrently, optimizing execution time. Scenario 1: Non-Production (AKS) Selected environment: Non-Production.The pipeline:Processes the AKS cluster only.Generates sealed secrets for AKS.Archives artifacts for the AKS environment. Scenario 2: Production (GKE and EKS) Selected environment: Production.The pipeline:Processes both GKE and EKS clusters simultaneously.Generates separate sealed secrets for each cluster.Archives artifacts for both GKE and EKS. Detailed Explanation of the Jenkins Pipeline Script This Jenkins pipeline script automates the process of managing Kubernetes sealed secrets in a multi-cloud environment consisting of AKS, GKE, and EKS clusters. Below is a detailed step-by-step explanation of how the script functions. Plain Text parameters { string(name: 'NAMESPACE', defaultValue: 'default', description: 'Kubernetes namespace for the sealed secret') choice( name: 'ENVIRONMENT', choices: ['Non-Production', 'Production'], description: 'Select the target environment' ) base64File(name: 'SECRETS_YAML', description: 'Upload Secrets.yaml file to apply to the cluster') booleanParam(name: 'STORE_CERT', defaultValue: true, description: 'Store the public certificate for future use') } NAMESPACE: Specifies the target namespace in Kubernetes where the sealed secrets will be applied.ENVIRONMENT: Determines whether the pipeline operates on Non-Production (AKS) or Production (GKE and EKS).SECRETS_YAML: Accepts the Base64-encoded YAML file containing the sensitive data to be sealed.STORE_CERT: A flag indicating whether the public certificate used for sealing secrets should be archived for future use. Environment Variables Plain Text environment { WORK_DIR = '/tmp/jenkins-k8s-apply' CONTROLLER_NAMESPACE = 'kube-system' CONTROLLER_NAME = 'sealed-secrets' CERT_FILE = 'sealed-secrets-cert.pem' DOCKER_IMAGE = 'docker-dind-kube-secret' ARTIFACTS_DIR = 'sealed-secrets-artifacts' } WORK_DIR: Temporary workspace for processing files during the pipeline execution.CONTROLLER_NAMESPACE and CONTROLLER_NAME: Define the location and name of the Sealed Secrets controller in the Kubernetes cluster.CERT_FILE: Name of the public certificate file used for sealing secrets.DOCKER_IMAGE: Docker image containing the necessary tools for processing secrets (e.g., kubeseal).ARTIFACTS_DIR: Directory where artifacts (sealed secrets and metadata) are stored. Environment Setup Plain Text stage('Environment Setup') { steps { script { echo "Selected Environment: ${params.ENVIRONMENT}" def clusters = [] if (params.ENVIRONMENT == 'Production') { clusters = [ [id: 'prod-cluster-1', name: 'Production Cluster 1', credentialId: 'Production_1'], [id: 'prod-cluster-2', name: 'Production Cluster 2', credentialId: 'Production_2'] ] } else { clusters = [ [id: 'non-prod-cluster', name: 'Non-Production Cluster', credentialId: 'Stage'] ] } env.CLUSTER_IDS = clusters.collect { it.id }.join(',') clusters.each { cluster -> env["CLUSTER_${cluster.id}_NAME"] = cluster.name env["CLUSTER_${cluster.id}_CRED"] = cluster.credentialId } echo "Number of target clusters: ${clusters.size()}" clusters.each { cluster -> echo "Cluster: ${cluster.name} (${cluster.id})" } } } } Defines the clusters based on the ENVIRONMENT parameter: Non-production: Targets only the AKS cluster.Production: Targets GKE and EKS clusters.Stores cluster information (IDs, names, and credentials) in environment variables for dynamic referencing. Prepare Workspace Plain Text stage('Prepare Workspace') { steps { script { sh """ mkdir -p ${WORK_DIR} mkdir -p ${WORKSPACE}/${ARTIFACTS_DIR} rm -f ${WORK_DIR}/* || true rm -rf ${WORKSPACE}/${ARTIFACTS_DIR}/* || true """ if (params.ENVIRONMENT == 'Non-Production') { sh "rm -rf ${WORKSPACE}/${ARTIFACTS_DIR}/prod-*" } else { sh "rm -rf ${WORKSPACE}/${ARTIFACTS_DIR}/non-prod-*" } if (params.SECRETS_YAML) { writeFile file: "${WORK_DIR}/secrets.yaml.b64", text: params.SECRETS_YAML sh """ base64 --decode < ${WORK_DIR}/secrets.yaml.b64 > ${WORK_DIR}/secrets.yaml """ } else { error "SECRETS_YAML parameter is not provided" } } } } Creates temporary directories for processing secrets and cleaning up old artifacts.Decodes the uploaded Base64 Secrets.yaml file and prepares it for processing. Process Clusters Plain Text stage('Process Clusters') { steps { script { def clusterIds = env.CLUSTER_IDS.split(',') def parallelStages = [:] clusterIds.each { clusterId -> def clusterName = env["CLUSTER_${clusterId}_NAME"] def credentialId = env["CLUSTER_${clusterId}_CRED"] parallelStages[clusterName] = { stage("Process ${clusterName}") { withCredentials([file(credentialsId: credentialId, variable: 'KUBECONFIG')]) { def clusterWorkDir = "${WORK_DIR}/${clusterId}" def clusterArtifactsDir = "${WORKSPACE}/${ARTIFACTS_DIR}/${clusterId}" sh """ mkdir -p ${clusterWorkDir} mkdir -p ${clusterArtifactsDir} cp ${WORK_DIR}/secrets.yaml ${clusterWorkDir}/ """ sh """ docker run --rm \ -v \${KUBECONFIG}:/tmp/kubeconfig \ -v ${clusterWorkDir}/secrets.yaml:/tmp/secrets.yaml \ -e KUBECONFIG=/tmp/kubeconfig \ --name dind-service-${clusterId} \ ${DOCKER_IMAGE} kubeseal \ --controller-name=${CONTROLLER_NAME} \ --controller-namespace=${CONTROLLER_NAMESPACE} \ --kubeconfig=/tmp/kubeconfig \ --fetch-cert > ${clusterWorkDir}/${CERT_FILE} """ sh """ docker run --rm \ -v \${KUBECONFIG}:/tmp/kubeconfig \ -v ${clusterWorkDir}/secrets.yaml:/tmp/secrets.yaml \ -v ${clusterWorkDir}/${CERT_FILE}:/tmp/${CERT_FILE} \ -e KUBECONFIG=/tmp/kubeconfig \ --name dind-service-${clusterId} \ ${DOCKER_IMAGE} sh -c "kubeseal \ --controller-name=${CONTROLLER_NAME} \ --controller-namespace=${CONTROLLER_NAMESPACE} \ --format yaml \ --cert /tmp/${CERT_FILE} \ --namespace=${params.NAMESPACE} \ < /tmp/secrets.yaml" > ${clusterArtifactsDir}/sealed-secrets.yaml """ sh """ echo "Generated on: \$(date)" > ${clusterArtifactsDir}/README.txt echo "Cluster: ${clusterName}" >> ${clusterArtifactsDir}/README.txt """ } } } } parallel parallelStages } } } Dynamically creates parallel stages for each cluster:Fetches cluster-specific certificates using kubeseal.Encrypts the secrets for the target namespace.Executes all cluster stages concurrently to optimize time. Post-Build Actions Plain Text post { always { sh "rm -rf ${WORK_DIR}" archiveArtifacts artifacts: "${ARTIFACTS_DIR}/*/**", fingerprint: true } success { echo "Pipeline completed successfully!" } failure { echo "Pipeline failed. Check the logs for details." } } Cleans up temporary files after execution.Archives the generated artifacts (sealed-secrets.yaml and README.txt) for future reference. Key Advantages Dynamic environment setup: Adjusts automatically based on the selected environment.Parallel processing: Reduces runtime by concurrently processing clusters.Multi-cloud compatibility: Handles AKS, GKE, and EKS seamlessly.Secure pperations: Protects sensitive data using Kubernetes Sealed Secrets. This detailed explanation aligns the script with the discussion, showcasing its robust and dynamic capabilities for managing secrets across diverse Kubernetes clusters.
Building on what we started earlier in an earlier article, here we’re going to learn how to extend our platform and create a platform abstraction for provisioning an AWS EKS cluster. EKS is AWS’s managed Kubernetes offering. Quick Refresher Crossplane is a Kubernetes CRD-based add-on that abstracts cloud implementations and lets us manage Infrastructure as code. Prerequisites Set up Docker Kubernetes.Follow the Crossplane installation based on the previous article.Follow the provider configuration based on the previous article.Apply all the network YAMLs from the previous article (including the updated network composition discussed later). This will create the necessary network resources for the EKS cluster. Some Plumbing When creating an EKS cluster, AWS needs to: Spin up the control plane (managed by AWS)Attach security groups Configure networking (ENIs, etc)Access the VPC and subnetsManage API endpointsInteract with other AWS services (e.g., CloudWatch for logging, Route53) To do this securely, AWS requires an IAM role that it can assume. We create that role here and reference it during cluster creation; details are provided below. Without this role, you'll get errors like "access denied" when creating the cluster. Steps to Create the AWS IAM Role Log in to the AWS Console and go to the IAM creation page.In the left sidebar, click RolesClick Create Role.Choose AWS service as the trusted entity type.Select the EKS use case, and choose the EKS Cluster.Attach the following policies: AmazonEKSClusterPolicyAmazonEKSServicePolicyAmazonEC2FullAccessAmazonEKSWorkerNodePolicyAmazonEC2ContainerRegistryReadOnlyAmazonEKS_CNI_PolicyProvide the name eks-crossplane-cluster and optionally add tags. Since we'll also create NodeGroups, which require additional permissions, for simplicity, I'm granting the Crossplane user (created in the previous article) permission to PassRole for the Crossplane cluster role, and this permission allows this user to tell AWS services (EKS) to assume the Crossplane cluster role on its behalf. Basically, this user can say, "Hey, EKS service, create a node group and use this role when doing it." To accomplish this, add the following inline policy to the Crossplane user: JSON { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": "iam:PassRole", "Resource": "arn:aws:iam::914797696655:role/eks-crossplane-clsuter" } ] } Note: Typically, to follow the principle of Least Privilege, you should separate roles with policies: Control plane role with EKS admin permissionsNode role with permissions for node group creation. In the previous article, I had created only one subnet in the network composition, but the EKS control plane requires at least two AZs, with one subnet per AZ. You should modify the network composition from the previous article to add another subnet. To do so, just add the following to the network composition YAML, and don't forget to apply the composition and claim to re-create the network. YAML - name: subnet-b base: apiVersion: ec2.aws.upbound.io/v1beta1 kind: Subnet spec: forProvider: cidrBlock: 10.0.2.0/24 availabilityZone: us-east-1b mapPublicIpOnLaunch: true region: us-east-1 providerConfigRef: name: default patches: - fromFieldPath: status.vpcId toFieldPath: spec.forProvider.vpcId type: FromCompositeFieldPath - fromFieldPath: spec.claimRef.name toFieldPath: spec.forProvider.tags.Name type: FromCompositeFieldPath transforms: - type: string string: fmt: "%s-subnet-b" - fromFieldPath: status.atProvider.id toFieldPath: status.subnetIds[1] type: ToCompositeFieldPath We will also need a provider to support EKS resource creation, to create the necessary provider, save the following content into .yaml file. YAML apiVersion: pkg.crossplane.io/v1 kind: Provider metadata: name: provider-aws spec: package: xpkg.upbound.io/crossplane-contrib/provider-aws:v0.54.2 controllerConfigRef: name: default And apply using: YAML kubectl apply -f <your-file-name>.yaml Crossplane Composite Resource Definition (XRD) Below, we’re going to build a Composite Resource Definition for the EKS cluster. Before diving in, one thing to note: If you’ve already created the network resources using the previous article, you may have noticed that the network composition includes a field that places the subnet ID into the composition resource’s status, specifically under status.subnetIds[0]. This value comes from the cloud's Subnet resource and is needed by other XCluster compositions. By placing it in the status field, the network composition makes it possible for other Crossplane compositions to reference and use it. Similar to what we did for network creation in the previous article, we’re going to create a Crossplane XRD, a Crossplane Composition, and finally a Claim that will result in the creation of an EKS cluster. At the end, I’ve included a table that serves as an analogy to help illustrate the relationship between the Composite Resource Definition (XRD), Composite Resource (XR), Composition, and Claim. To create an EKS XRD, save the following content into .yaml file: YAML apiVersion: apiextensions.crossplane.io/v1 kind: CompositeResourceDefinition metadata: name: xclusters.aws.platformref.crossplane.io spec: group: aws.platformref.crossplane.io names: kind: XCluster plural: xclusters claimNames: kind: Cluster plural: clusters versions: - name: v1alpha1 served: true referenceable: true schema: openAPIV3Schema: type: object required: - spec properties: spec: type: object required: - parameters properties: parameters: type: object required: - region - roleArn - networkRef properties: region: type: string description: AWS region to deploy the EKS cluster in. roleArn: type: string description: IAM role ARN for the EKS control plane. networkRef: type: object description: Reference to a pre-created XNetwork. required: - name properties: name: type: string status: type: object properties: network: type: object required: - subnetIds properties: subnetIds: type: array items: type: string And apply using: YAML kubectl apply -f <your-file-name>.yaml Crossplane Composition Composition is the implementation; it tells Crossplane how to build all the underlying resources (Control Plane, NodeGroup). To create an EKS composition, save the below content into a .yaml file: YAML apiVersion: apiextensions.crossplane.io/v1 kind: Composition metadata: name: cluster.aws.platformref.crossplane.io spec: compositeTypeRef: apiVersion: aws.platformref.crossplane.io/v1alpha1 kind: XCluster resources: - name: network base: apiVersion: aws.platformref.crossplane.io/v1alpha1 kind: XNetwork patches: - type: FromCompositeFieldPath fromFieldPath: spec.parameters.networkRef.name toFieldPath: metadata.name - type: ToCompositeFieldPath fromFieldPath: status.subnetIds toFieldPath: status.network.subnetIds - type: ToCompositeFieldPath fromFieldPath: status.subnetIds[0] toFieldPath: status.network.subnetIds[0] readinessChecks: - type: None - name: eks base: apiVersion: eks.aws.crossplane.io/v1beta1 kind: Cluster spec: forProvider: region: us-east-1 roleArn: "" resourcesVpcConfig: subnetIds: [] endpointPrivateAccess: true endpointPublicAccess: true providerConfigRef: name: default patches: - type: FromCompositeFieldPath fromFieldPath: spec.parameters.region toFieldPath: spec.forProvider.region - type: FromCompositeFieldPath fromFieldPath: spec.parameters.roleArn toFieldPath: spec.forProvider.roleArn - type: FromCompositeFieldPath fromFieldPath: status.network.subnetIds toFieldPath: spec.forProvider.resourcesVpcConfig.subnetIds - name: nodegroup base: apiVersion: eks.aws.crossplane.io/v1alpha1 kind: NodeGroup spec: forProvider: region: us-east-1 clusterNameSelector: matchControllerRef: true nodeRole: "" subnets: [] scalingConfig: desiredSize: 2 maxSize: 3 minSize: 1 instanceTypes: - t3.medium amiType: AL2_x86_64 diskSize: 20 providerConfigRef: name: default patches: - type: FromCompositeFieldPath fromFieldPath: spec.parameters.region toFieldPath: spec.forProvider.region - type: FromCompositeFieldPath fromFieldPath: spec.parameters.roleArn toFieldPath: spec.forProvider.nodeRole - type: FromCompositeFieldPath fromFieldPath: status.network.subnetIds toFieldPath: spec.forProvider.subnets And apply using: YAML kubectl apply -f <your-file-name>.yaml Claim I'm taking the liberty to explain the claim in more detail here. First, it's important to note that a claim is an entirely optional entity in Crossplane. It is essentially a Kubernetes Custom Resource Definition (CRD) that the platform team can expose to application developers as a self-service interface for requesting infrastructure, such as an EKS cluster. Think of it as an API payload: a lightweight, developer-friendly abstraction layer. In the earlier CompositeResourceDefinition (XRD), we created the Kind XCluster. But by using a claim, application developers can interact with a much simpler and more intuitive CRD like Cluster instead of XCluster. For simplicity, I have referenced the XNetwork composition name directly instead of the Network claim resource name. Crossplane creates the XNetwork resource and appends random characters to the claim name when naming it. As an additional step, you'll need to retrieve the actual XNetwork name from the Kubernetes API and use it here. While there are ways to automate this process, I’m keeping it simple here, let me know via comments if there are interest and I write more about how to automate that. To create a claim, save the content below into a .yaml file. Please note the roleArn being referenced in this, that is the role I had mentioned earlier, AWS uses it to create other resources. YAML apiVersion: aws.platformref.crossplane.io/v1alpha1 kind: Cluster metadata: name: demo-cluster namespace: default spec: parameters: region: us-east-1 roleArn: arn:aws:iam::914797696655:role/eks-crossplane-clsuter networkRef: name: crossplane-demo-network-jpv49 # <important> this is how EKS composition refers the network created earlier not the random character "jpv49" from XNetwork name And apply using: YAML kubectl apply -f <your-file-name>.yaml After this, you should see an EKS cluster in your AWS console, and ensure you are looking in the correct region. If there are any issues, look for error logs in the composite and managed resource. You could look at them using: YAML -- to get XCluster detail k get XCluster demo-cluster -o yaml # look for reconciliation errors or messages, you will also find reference to managed resource -- to look for status of a managed resource, example. k get Cluster.eks.aws.crossplane.io As I mentioned before, below is a table where I attempt to provide another analogy for various components used in Crossplane: componentanalogy XRD The interface, or blueprint for a product, defines what knobs users can turn XR (XCluster) A specific product instance with user-provided values Composition The function that implements all the details of the product Claim A customer-friendly interface for ordering the product, or an api payload. Patch I also want to explain an important concept we've used in our Composition: patching. You may have noticed the patches field in the .yaml blocks. In Crossplane, a composite resource is the high-level abstraction we define — in our case, that's XCluster. Managed resources are the actual cloud resources Crossplane provisions on our behalf — for example, the AWS EKS Cluster, Nodegroup A patch in a Crossplane Composition is a way to copy or transform data from/to the composite resource (XCluster) to/from the managed resources (Cluster, NodeGroup, etc.). Patching allows us to map values like region, roleArn, and names from the high-level composite to the actual underlying infrastructure — ensuring that developer inputs (or platform-defined parameters) flow all the way down to the cloud resources. Conclusion Using Crossplane, you can build powerful abstractions that shield developers from the complexities of infrastructure, allowing them to focus on writing application code. These abstractions can also be made cloud-agnostic, enabling benefits like portability, cost optimization, resilience and redundancy, and greater standardization.
Kubernetes has become the industry-standard platform for container orchestration, offering automated deployment, scaling, and management of containerized applications. Its ability to efficiently utilize resources, abstract infrastructure complexities, and provide robust enterprise features makes it essential for modern application infrastructure. While Kubernetes can run on-premises, deploying on AWS provides significant advantages, including on-demand scaling, cost optimization, and integration with AWS services for security, monitoring, and operations. With multi-AZ high availability and a global presence in 32 regions, AWS delivers the reliability needed for mission-critical applications. Once you have decided to run your Kubernetes workload on AWS, the big question is, what are the available options, and which is the right one for me? This blog will focus on these exact questions and provide the insights to help you make the right choice 1. Amazon Elastic Kubernetes Service (EKS) Amazon EKS is a managed Kubernetes service that handles the control plane operations across three AWS Availability Zones with a 99.5% availability SLA for the Kubernetes API server. This managed approach allows you to focus on applications rather than infrastructure management while providing seamless integration with AWS services like ELB, IAM, EBS, and EFS. For the data plane, EKS offers multiple options: EC2-based self-managed node groups (you manage the infrastructure)EC2-based managed node groups (AWS handles provisioning and lifecycle)AWS Fargate for a serverless experience (no node management required) When to Choose Amazon EKS You want a fully managed Kubernetes control plane and minimal operational overhead.You need integration with other workloads running in the AWS cloud.You need enterprise-grade security and compliance.You prefer a pay-as-you-go model.Scaling is a priority. 2. Red Hat OpenShift Service on AWS (ROSA) ROSA combines Red Hat's enterprise Kubernetes platform with AWS infrastructure. It provides automated installation, upgrades, and lifecycle management with joint support from Red Hat and AWS. The service offers a 99.95% uptime SLA for the OpenShift API server, with Red Hat managing the platform, including security patches and updates. Worker nodes run on EC2 instances and integrate with both the OpenShift control plane and AWS services. ROSA includes built-in developer services such as CI/CD pipelines, container registry, and serverless capabilities. When to Choose Red Hat OpenShift Service You have existing OpenShift investments or expertise.You need enterprise-grade support for both platform and infrastructure.You require features such as integrated CI/CD, security features like image scanning, etc.You want the benefits of OpenShift's developer experience while leveraging AWS infrastructure and services. 3. VMware Tanzu on AWS For organizations heavily invested in VMware and seeking a hybrid cloud strategy, Tanzu on AWS provides consistent Kubernetes management across on-premises and AWS environments. Tanzu provides automated provisioning, scaling, and lifecycle management. VMware handles platform maintenance, including security updates and version upgrades. Tanzu leverages EC2 instances for worker nodes managed through Tanzu Mission Control or kubectl. It also provides native AWS service integration. When to Choose VMware Tanzu You have existing VMware investments or are pursuing a multi-cloud strategy.You need consistent Kubernetes management across hybrid environments.You require enterprise governance, security, and compliance features.You want VMware's application platform capabilities while utilizing AWS infrastructure. 4. EKS Anywhere on AWS What if you want to have the native EKS experience but need a hybrid setup with certain workloads running on-premises and the rest on AWS? EKS Anywhere extends Amazon EKS to on-premises infrastructure while maintaining consistency with cloud-based EKS. It implements the same Kubernetes distribution as EKS with automated deployment capabilities and lifecycle management tools. While AWS provides support options, customers manage their own infrastructure and availability requirements. EKS Anywhere supports various infrastructure platforms, including VMware vSphere and bare metal servers, and includes tools for monitoring, GitOps-based deployment, and an optional container registry. When to Choose EKS Anywhere You need to run Kubernetes workloads on-premises while maintaining operational consistency with EKS in the cloud.You have data sovereignty, latency, or regulatory requirements that necessitate on-premises infrastructure.You prefer the familiar EKS experience and tooling across all environments.You are implementing a hybrid cloud strategy and need consistent management across both environments. 5. Self-Managed Kubernetes on EC2 This option provides complete control by letting you install, configure, and operate the entire Kubernetes platform on EC2 instances. You have full responsibility for cluster deployment, upgrades, scaling, maintenance, high availability, and security. Both control plane and worker nodes run on EC2 instances that you select and configure. Despite requiring more operational effort, this approach enables full AWS service integration through APIs and SDKs. Deployment can leverage tools like kops or kubeadm. When to Choose Self-Managed Kubernetes on EC2 You require complete control over Kubernetes configurationsYou have specific security or compliance requirements that require customized deployments or specialized Kubernetes distributionsYour team has strong Kubernetes expertise and operational capabilities.You want to avoid the additional management fee associated with EKS. 6. Amazon EKS Distro (EKS-D) EKS-D is the open-source version of the Kubernetes distribution used in Amazon EKS. It provides the same binaries, configuration, and security patches as EKS, ensuring compatibility and consistency. However, you are responsible for the installation, operation, and maintenance of both the control plane and worker nodes. While AWS provides regular updates aligned with the EKS release schedule, since you are technically not running your workloads on AWS, you must implement these updates yourself without AWS SLA guarantees. EKS-D can be used with various third-party management solutions or AWS's open-source tools. When to Choose Amazon EKS Distro You want to use the same Kubernetes distribution as EKS but need to run it on non-AWS infrastructure.You require a consistent, reliable Kubernetes foundation across heterogeneous environments.You have the operational expertise to manage Kubernetes clusters yourself.You need specific deployment configurations not supported by EKS or EKS Anywhere. Making the Right Choice As you saw, there are multiple ways to deploy your Kubernetes workloads entirely on AWS or adopt a hybrid approach. The choice ultimately depends on a variety of factors such as: Operational aspectsCost and expertiseFeatures and integration requirementsUse case alignmentSecurity and compliance To make this decision easier, below is a decision matrix that evaluates the different choices across the various factors mentioned above. Based on your unique circumstances, you can score each of the choices, which will help you pick the right approach for your Kubernetes workload. Operational Aspects Aspect EKS on AWS ROSA Tanzu on AWS EKS Anywhere Self-managed K8s EKS Distro Management Overhead Low Low Medium Medium High High Control Plane Management AWS Managed Red Hat Managed VMware Managed Self-managed Self-managed Self-managed Infrastructure Management Optional¹ AWS Managed VMware Managed Customer Customer Customer Primary Support AWS Red Hat + AWS VMware + AWS AWS² None³ Community Notes: ¹ Through managed node groups ² For EKS components only ³ Unless separate support contract Cost and Expertise Aspect EKS on AWS ROSA Tanzu on AWS EKS Anywhere Self-managed K8s EKS Distro Cost Structure Control plane + compute Premium with licensing Highest (VMware licensing) Infrastructure + support Compute only Infrastructure only Required Skills AWS + K8s OpenShift + AWS VMware + K8s + AWS K8s + Infrastructure Deep K8s Deep K8s + Distribution Learning Curve Moderate Moderate-High High High Very High Very High Operational Team Size Small Small Medium Medium-Large Large Large Features and Integration Requirements Aspect EKS on AWS ROSA Tanzu on AWS EKS Anywhere Self-managed K8s EKS Distro AWS Service Integration Native Good Good Limited Manual Basic Marketplace Integration Full OpenShift + AWS VMware + AWS Limited Manual Limited Custom Configuration Limited Moderate Moderate High Full Full Automation Capabilities High High High Moderate Manual Manual Use Case Alignment Solution Best For Key Differentiator Common Use Cases EKS on AWS Cloud-native workloads AWS integration Modern applications, microservices ROSA Enterprise OpenShift users Red Hat tooling Traditional enterprise workloads Tanzu on AWS VMware shops VMware consistency VMware modernization EKS Anywhere Hybrid/Edge needs On-prem consistency Edge computing, hybrid deployments Self-managed K8s Complete control needs Full customization Specialized requirements EKS Distro Multi-cloud needs AWS alignment Custom infrastructure Security and Compliance Aspect EKS on AWS ROSA Tanzu on AWS EKS Anywhere Self-managed K8s EKS Distro Built-in Security High High High Moderate Manual Manual Compliance Certifications AWS AWS + Red Hat AWS + VMware Varies DIY DIY Update Management Automated Automated Automated Manual Manual Manual Security Responsibility Shared Shared Shared Customer Customer Customer
In modern DevOps workflows, Jenkins is the cornerstone for continuous integration and continuous deployment (CI/CD) pipelines. Because of its flexibility and wide-ranging plugins available, it's indispensable in the automation of build, test, and deployment processes. On the other hand, AWS provides Elastic Container Service and Elastic Kubernetes Service as powerful managed services for deploying and managing containerized applications. This article explores how Jenkins can be effectively integrated with both ECS and EKS clusters to optimize the CI/CD process. Why Use Jenkins With ECS/EKS Clusters? Scalability One of the advantages of using AWS ECS and EKS for Jenkins deployment is its dynamic scaling. They allow Jenkins to handle CI/CD workloads that are variably required. For example, in case the build queue goes up, it can provide additional agents on its own as such demand dictates. That ensures that at no time, especially during peak conditions, for instance, when huge projects or a lot of builds are triggered together,pipeline execution will result in hiccups. The ability to scale up and down based on workload needs ensures that Jenkins can handle both small-scale tasks and large, resource-intensive operations without interruption, making the system highly efficient and reliable in fluctuating environments. Flexibility Both ECS and EKS provide a great deal of flexibility in terms of resource allocation. In the case of both these services, Jenkins can dynamically provision agents based on demand, hence utilizing only those resources that are actually required at any given time. On ECS, agents can be deployed inside Fargate tasks, and on EKS, they can be deployed inside Kubernetes pods. That is dynamic provisioning in Jenkins: it will allocate exactly when required and deallocate as soon as the resources are not in use, therefore optimizing the overall infrastructure. Waste reduction due to on-demand scaling will keep Jenkins running efficiently so as to scale up fast into pipeline demands and keep costs under control. Cost Efficiency This is a major benefit one gets from using Jenkins with ECS, especially Fargate. The Fargate is a serverless compute engine that allows users to run containers without managing the infrastructure. Sometimes, in conventional environments, managing and scaling infrastructure manually requires a lot of resources and is pretty expensive. However, with Fargate, the user pays for only what is consumed. Thus, Fargate is a pay-as-you-go model that finds its greatest usefulness in teams where the workload fluctuates. This is a perfect model for teams operating in an environment where flexibility and scalability are needed without manual continuous interference; hence, it's cost-effective for dynamic, high-performance CI/CD environments. AWS ECS and EKS are optimal for Jenkins deployment because of the scalability and flexibility they offer. Dynamically scaling the workload demand ensures smooth execution during peak times with optimization in resource utilization. This allows teams to drastically reduce their operation costs and improve overall infrastructure efficiency by leveraging on-demand agent provisions and the pay-as-you-go model of Fargate. These benefits will make ECS and EKS a robust solution to maintain high-performance, cost-effective Jenkins pipelines as development environments keep fluctuating with dynamic workloads. Architecture Overview Infrastructure Components A Jenkins deployment on AWS relies on several key infrastructure components that work together to create an efficient, scalable, and reliable CI/CD pipeline. Below is an in-depth breakdown of each component that plays a vital role in the architecture. Jenkins Master The Jenkins Master is the central control unit of the Jenkins deployment. It is responsible for orchestrating the entire build process, managing the job queue, and scheduling the execution of tasks. In a containerized setup, the Jenkins Master runs within a container, typically deployed on AWS ECS or EKS. This containerized deployment ensures that Jenkins is scalable, isolated from other processes, and can be easily managed. The Jenkins Master also manages communication with Jenkins agents, dispatching them tasks for execution. The containerized nature of the master allows for easy updates, scaling, and management as it is isolated from the underlying infrastructure. Jenkins Agents Jenkins Agents are the worker nodes responsible for executing the build and test tasks. They are provisioned dynamically, based on the workload, and can be scaled up or down depending on the build queue. For example, when there is a high demand for builds, Jenkins will automatically spin up new agents to ensure timely execution. Conversely, when the demand decreases, agents are terminated, allowing resources to be freed. In a cloud-based deployment using AWS ECS or EKS, Jenkins agents are containerized and can be run as ECS tasks or Kubernetes pods. This dynamic provisioning of agents allows for efficient resource usage, ensuring that Jenkins always has the necessary compute power for any given workload. Persistent Storage Persistent storage is crucial for maintaining the Jenkins state, logs, build artifacts, and configuration data. Since Jenkins needs to retain historical build data and logs, it's essential to use a reliable and scalable storage solution. AWS Elastic File System (EFS) and Amazon S3 are commonly used to provide this persistence. AWS EFS is a scalable, shared file storage service that can be accessed by multiple instances, making it ideal for Jenkins master and agents that require shared access to files and artifacts. On the other hand, Amazon S3 is used to store static files, including logs, build artifacts, and backups. Both EFS and S3 ensure data integrity and availability, even during scaling operations or node failures. Monitoring To ensure that the Jenkins deployment is running smoothly, it is crucial to have robust monitoring in place. AWS CloudWatch is a powerful tool that allows for the aggregation of logs and tracking the real-time performance of Jenkins. CloudWatch can collect logs from Jenkins, including build logs, system logs, and agent activity, helping to identify issues and bottlenecks in the pipeline. Additionally, CloudWatch allows for performance metrics such as CPU usage, memory consumption, and network traffic to be monitored, which helps in proactive resource management. By setting up CloudWatch alarms, teams can be alerted when thresholds are exceeded, ensuring quick responses to potential issues. This level of visibility and monitoring ensures that Jenkins workflows remain efficient, reliable, and responsive to changes in workload. Together, these infrastructure components form a robust and scalable Jenkins architecture on AWS. The containerized Jenkins master, dynamic agent provisioning, persistent storage solutions, and integrated monitoring with CloudWatch all work in unison to create an efficient CI/CD pipeline capable of scaling with demand while maintaining high performance, reliability, and cost efficiency. This architecture makes Jenkins on AWS a powerful solution for modern DevOps workflows, where flexibility and automation are key to successful software delivery. Comparing ECS and EKS for Jenkins Deployment Choosing an appropriate container orchestration platform for deploying Jenkins will bring huge differences in the efficiency and management of your workflow. This comparison highlights the strengths of AWS ECS and EKS to help you decide which platform aligns best with your deployment needs. If comparing ECS with EKS to deploy Jenkins, the first one would work for smaller systems, whereas, if combined with Fargate, it even offers the possibility to do serverless deployment without any kind of infrastructure to manage. At the same time, EKS gives a more controlling perspective through its Kubernetes-based orchestration and thus could fit better in the case of complex workflows or multi-environment deployments within continuous integration or deployment. Setting Up Jenkins on ECS Step 1: Infrastructure Setup Set up the infrastructure that would be necessary to set up Jenkins on ECS. Now, create an ECS cluster with the appropriate networking configurations, including VPCs, subnets, and security groups. Next, define task definitions with the specification of container configuration and IAM roles required by your deployment. Step 2: Deploy Jenkins Master Once the infrastructure is available, containerize Jenkins and deploy it as an ECS service using Fargate; at this stage, create some task definitions for the Jenkins master, which will define configuration for container images, CPU, and memory, but also IAM roles, which will be applied to grant the required permissions to Jenkins. Step 3: Dynamic Agent Provisioning With regard to resource optimization, dynamic building agents can be provisioned using the Jenkins ECS plugin. It manages the ECS tasks as Jenkins agents, and based on this logic, the agents would spin only when needed, automatically terminating at the end of the task to make the entire process smoother. The following walkthrough provides detailed steps for setting up Jenkins on ECS, using AWS services like Fargate, and the Jenkins ECS plugin to simplify your continuous integration/continuous deployment pipelines. You will have a scalable setup with lesser infrastructure management and better resource utilization; hence, this is going to be a pretty robust solution for modern development workflows. Deploying Jenkins on EKS Step 1: EKS Cluster Configuration To deploy Jenkins on EKS, begin by setting up an EKS cluster with essential configurations. Create Kubernetes namespaces and define RBAC policies to manage access and permissions effectively. Additionally, configure networking settings to ensure secure communication within the cluster and with external resources. Step 2: Jenkins Deployment To install Jenkins on Kubernetes, use Helm charts, which will make the process quite easy with predefined templates. The preconstructed templates allow for the easy creation of Jenkins master and agent pods, along with Persistent Volume Claims storing Jenkins data. Because of its modularity and ease of use, Helm is an excellent choice for deploying Jenkins in a Kubernetes environment. Step 3: Persistent Storage and Logging Store Jenkins itself using the AWS Elastic Block Store or Amazon S3. Make sure data is present there for persistence and is monitored for efficiency. For the logs, one can set up AWS CloudWatch log collection and visualization from Jenkins to enable debugging-easier, and thereby monitor effectively-continuous integration/continuous deployment workflows. It gives the ability to leverage Kubernetes for scale and resilience in your continuous integration/continuous deployment by running Jenkins on EKS. You have full control over orchestration, easy integrations with any AWS service for storage and monitoring purposes, and a flexible platform for managing complex deployment scenarios. Properly configured and with the right tools, such as Helm, you will be assured of a reliable and efficient Jenkins environment that is tuned for a developer's needs. Best Practices for Jenkins on ECS/EKS Optimization of a Jenkins deployment on AWS will involve implementing resource efficiency, security enhancement, and robust monitoring and debugging. With this fine-tuning, you are able to create a resilient and cost-effective continuous integration/continuous deployment environment that supports your development workflows effectively. Optimizing resource usage: For optimal resource utilization, enable auto-scaling policies on ECS and EKS agents to scale up and down based on the workload. Enable Fargate Spot instances so that agents can provision in off-peak hours at a low cost to decrease operational costs without compromising performance.Enhancing security: Increase your Jenkins security, integrating role-based access control from your EKS setup with Secure resources to implement Jenkins-specific control in AWS Secret Manager with respect to secret keys storing encrypted confidential credentials or other data, so important configuration can be kept confidential. Real-World Use Cases The Cerebro platform was developed by Expedia Group to represent a huge leap in the company's management of databases at scale. Being a DBaaS platform, it makes rapid provisioning possible with efficient management of databases across all the infrastructures at Expedia. It is built for seamless integration with a wider technology ecosystem at Expedia, thus making it fast and consistent to manage databases at high bars of performance and security. This contains one of the primary building blocks of Cerebro, which is the extensive utilization of several AWS services to scale and meet the varied requirements of Expedia. Amazon EC2 provides a scalable computer for Cerebro to make sure that the overall system can bear a wide array of workloads whenever necessary. For storage, Amazon DynamoDB is a fully managed, high-performance, flexible NoSQL database that is suitable for most workloads that require fast and consistent access to data. Besides that, Amazon Aurora is a relational database service that provides high performance to manage databases, including automated backups, scaling, and fault tolerance, hence very suitable for the transaction-intensive operations of Expedia. AWS Glue also plays an important role in automating workflows and processing data by allowing the ETL cycle of a data ingestion process. That way, Expedia will be able to process large datasets and run analytics without setting up any complicated infrastructure. Additionally, Amazon ECS will orchestrate and manage containerized applications; hence, Expedia can easily run microservices and distributed applications. Another major element is Amazon CloudWatch, which will enable the monitoring of databases, applications, and infrastructure performance in real time. It integrates very well with Cerebro to provide insight into the health of the platform and ensures that any potential issues are identified and fixed quickly. Cerebro was designed with governance and best practices in mind to help standardize database management across the full tech stack at Expedia. By forcing operational standards consistently, it ensures best practices for security, performance, and consistency of data, thereby improving overall reliability and performance of the platform. These enable Expedia Group to reduce operational overhead and decrease the costs associated with managing databases. Adopting AWS' cloud technologies, operational flexibility in things such as rapidly scaling services during periods of high load created by peak travel seasons became much easier to implement. Such functionalities as dynamically provisioned resources, on-demand scalability for applications, and pay-only-for-what-you-use infrastructure have led to significant financial benefits. By finally harnessing the powerful infrastructure of AWS with Cerebro, Expedia is setting up for continued innovation and growth in one of the most competitive online travel industries, where speed and operating efficiency will determine the winners. Challenges and Solutions Slower Read/Write Operations With EFS As you understand by now, AWS Elastic File System (EFS) provides scalable and reliable shared storage for Jenkins. The disadvantage, however, compared to local or block storage, is slower read/write operations, which causes performance problems when workflows require more access to storage. For mitigating this, the combination of EFS with Elastic Block Store (EBS) may be used. For highly IOPS-dependent build processes, perform storage-intensive operations of temporary build files and highly frequently accessed data on EBS for low-latency access, and perform less time-sensitive things, such as logs and backups, on EFS. The frequency of direct EFS access may be further reduced by implementing a caching mechanism using ElastiCache or the natively supported caching of artifacts by Jenkins itself for better performance. Running Docker in Docker (DinD) The traditional way of running DinD Jobs is hard to operate within a containerized environment. Jenkins controller/agent running as Docker containers needs access to the Docker socket on the host. That was less viable since Docker was getting deprecated as a runtime in Kubernetes and also discouraged in the modern setup. The solution would be to use other tools instead of DinD, such as Kaniko, Buildah, or Podman, for the tasks that would normally require DinD. These tools are designed in a containerized environment and, therefore, do not need a Docker runtime, hence working nicely with Kubernetes' CRI. In addition, they even provide additional security due to the reduced possibility of exposing the Docker socket to the containerized environment. Performance Bottlenecks One of the common challenges in Jenkins deployments is performance bottlenecks, especially as workloads increase. A potential bottleneck can occur when the Jenkins master node is overwhelmed by high traffic or large numbers of concurrent builds. To mitigate this, you can use load balancing for Jenkins master nodes, distributing the load across multiple instances to ensure that no single node becomes a point of failure. Additionally, optimizing agent configurations is crucial to avoid resource exhaustion. This includes adjusting the allocated CPU, memory, and disk space for Jenkins agents to match the needs of the workloads they handle, as well as enabling dynamic provisioning of agents to spin up new instances when needed and scale down during idle periods. Agent Management Efficient agent management is critical for maintaining a smooth CI/CD pipeline. Using Jenkins plugins such as the ECS plugin for Amazon ECS or the Kubernetes plugin for EKS can streamline agent lifecycle management. These plugins automate the process of provisioning, scaling, and terminating Jenkins agents based on the workload. With the ECS plugin, for example, Jenkins can automatically launch ECS tasks as build agents and terminate them when no longer needed, optimizing resource usage and minimizing costs. Similarly, the Kubernetes plugin can manage agent pods dynamically, ensuring that only the necessary number of pods are running at any given time based on build queue demands. Conclusion In conclusion, integrating Jenkins with AWS ECS or EKS streamlines CI/CD workflows with scalable, flexible, and cost-efficient solutions. ECS allows easy deployment of Jenkins on Fargate, eliminating infrastructure management, while EKS provides Kubernetes-level control for complex setups. Moreover, benefits include dynamic scaling for fluctuating workloads, on-demand resource use to cut costs, and secure operations with features like role-based access and encrypted credentials. With AWS tools like CloudWatch for monitoring and EFS for storage, this setup ensures reliability and performance. To sum up, by adopting AWS-managed services, teams can build a robust, scalable Jenkins infrastructure to accelerate software delivery.
ChartMuseum is an open-source, self-hosted Helm Chart repository server that enables users to store and manage Helm charts efficiently. Helm is the standard package manager for Kubernetes, allowing developers to deploy applications seamlessly. While Helm provides public repositories like Artifact Hub, organizations often require private and secure repositories for managing their Helm charts internally. ChartMuseum fills this gap by offering a lightweight and flexible solution. ChartMuseum provides a robust API that allows users to interact with it programmatically, making it an essential tool for automated CI/CD pipelines. It is written in Go and can be deployed as a standalone binary, within a container, or as a Kubernetes deployment. How ChartMuseum Works ChartMuseum acts as an HTTP server that exposes endpoints to upload, retrieve, and manage Helm charts. It supports multiple storage backends, allowing organizations to choose the best option based on their infrastructure. The Helm CLI can interact with ChartMuseum just like any other Helm repository. The core functionalities of ChartMuseum include: Chart Uploading: Users can push Helm charts to the repository using HTTP POST requests.Chart Indexing: ChartMuseum automatically updates the repository index when new charts are uploaded.Chart Retrieval: Users can fetch charts using Helm commands.Authentication & Authorization: Supports authentication methods like Basic Auth, JWT, and OAuth.Multi-Tenant Support: Allows hosting multiple chart repositories within a single instance. Advantages of ChartMuseum Over Other Chart Storage Platforms Self-hosted and Secure: Unlike public Helm repositories such as Artifact Hub, ChartMuseum allows organizations to keep their charts within their infrastructure, providing better security and compliance control.Lightweight and Easy to Deploy: ChartMuseum is designed as a lightweight server that can be deployed as a Kubernetes pod, Docker container, or standalone binary, making it extremely flexible.Multiple Storage Backend Support: ChartMuseum supports a variety of storage backends, including local file systems, AWS S3, Google Cloud Storage, Azure Blob Storage, and more, providing flexibility to users.API-driven Architecture: ChartMuseum provides a RESTful API for managing Helm charts, making it easy to integrate into CI/CD pipelines and automated workflows.Integration with Kubernetes Workflows: Since ChartMuseum is built with Kubernetes in mind, it integrates well with Kubernetes-native tools and workflows.Multi-tenancy and Authentication: ChartMuseum supports authentication mechanisms such as Basic Auth and can be combined with an NGINX ingress for added security and multi-tenant capabilities.Cost-effective: Unlike some commercial Helm chart repositories that require licensing fees, ChartMuseum is open-source and free to use.Community Support and Open Source Contributions: Being open-source, ChartMuseum is actively maintained by the community, ensuring that it is regularly updated with new features and bug fixes. ChartMuseum vs JFrog Artifactory Simple Setup & Deployment: ChartMuseum is a lightweight server that can be deployed quickly in Kubernetes using a Helm chart, whereas JFrog Artifactory requires more complex configurations and additional dependencies.Minimal Resource Consumption: ChartMuseum runs efficiently with minimal memory and CPU usage, while Artifactory is a heavier solution that requires more system resources.Easier Authentication: ChartMuseum supports straightforward authentication methods like Basic Auth and JWT, while JFrog Artifactory requires detailed role-based access control (RBAC) configurations.Direct API Access: ChartMuseum provides a simple RESTful API for pushing, pulling, and managing charts, making automation easier, while JFrog Artifactory’s API is more complex and geared towards enterprise use cases.No Licensing Costs: Unlike JFrog Artifactory, which requires a paid subscription for advanced features, ChartMuseum is completely free and open-source, making it cost-effective for organizations.Kubernetes-Native Integration: ChartMuseum is designed with Kubernetes in mind, making it a seamless fit for Helm-based deployments without requiring additional plugins or connectors. Deploying ChartMuseum on Kubernetes Let’s deploy ChartMuseum in a Kubernetes cluster using the official Helm chart. Prerequisites Ensure you have the following installed: kubectlHelmKubernetes cluster Installing ChartMuseum Using Helm To enable authentication, we configure ChartMuseum to use Basic Auth and JWT. Run the following command to install ChartMuseum with authentication: helm repo add chartmuseum https://chartmuseum.github.io/charts helm repo update helm install my-chartmuseum chartmuseum/chartmuseum \ --set env.open.DISABLE_API=false \ --set env.open.BASIC_AUTH_USER=admin \ --set env.open.BASIC_AUTH_PASS=password \ --set env.open.AUTH_ANONYMOUS_GET=false This command: Enables authentication with a username (admin) and password (password).Disables anonymous access to prevent unauthorized pulls. Check Running ChartMuseum Pods kubectl get pods -l app.kubernetes.io/name=chartmuseum Internal Access to ChartMuseum To ensure that ChartMuseum is only accessible within the Kubernetes cluster and not exposed externally, create a ClusterIP service: kubectl expose deployment my-chartmuseum --type=ClusterIP --name=chartmuseum-service Adding ChartMuseum as a Helm Repo helm repo add my-chartmuseum http://chartmuseum-service.default.svc.cluster.local --username admin --password password helm repo update Pushing Charts to ChartMuseum To push a chart, first package it: helm package my-chart Now, push it using Basic Auth: curl -u admin:password --data-binary "@my-chart-0.1.0.tgz" http://chartmuseum-service.default.svc.cluster.local/api/charts Enabling JWT Authentication To enhance security, JWT authentication can be enabled by setting an environment variable. Modify your deployment to include: env: - name: AUTH_REALM value: "chartmuseum" - name: AUTH_SECRET value: "mysecretkey" - name: AUTH_ISSUER value: "myissuer" To authenticate with JWT, generate a token and use it while pushing or pulling charts: export TOKEN="$(echo '{"iss":"myissuer"}' | openssl dgst -sha256 -hmac "mysecretkey" -binary | base64)" Push a chart using JWT authentication: curl -H "Authorization: Bearer $TOKEN" --data-binary "@my-chart-0.1.0.tgz" http://chartmuseum-service.default.svc.cluster.local/api/charts Installing a Chart from ChartMuseum To install a chart: helm install my-release my-chartmuseum/my-chart --username admin --password password For JWT authentication: helm install my-release my-chartmuseum/my-chart --set global.imagePullSecrets[0].name=jwt-secret Deploying an Application Using a Helm Chart from ChartMuseum Example: Deploying a Nginx Application Assuming that we have pushed an Nginx Helm chart to ChartMuseum, we can deploy it as follows: helm install my-nginx my-chartmuseum/nginx --set service.type=ClusterIP --set replicas=2 --username admin --password password For JWT authentication: helm install my-nginx my-chartmuseum/nginx --set global.imagePullSecrets[0].name=jwt-secret Verifying the Deployment kubectl get deployments kubectl get pods -l app=my-nginx Automations Supported by ChartMuseum ChartMuseum supports several automation features: Automated Chart IndexingWebhook IntegrationCI/CD IntegrationStorage Backend AutomationAuthentication & Authorization (Basic Auth, JWT)API-driven Management References ChartMuseum GitHub RepositoryChartMuseum DocumentationHelm Documentation Now you’re ready to manage and secure your own Helm charts with ChartMuseum in Kubernetes!
Enterprise cloud architecture demands sophisticated orchestration of infrastructure, configuration, and workload management across diverse computing platforms. The traditional approach of manual provisioning and siloed tool adoption has become a bottleneck for organizations seeking cloud-native agility while maintaining operational excellence. This article explores the strategic integration of three complementary automation technologies: Terraform for infrastructure provisioning, Ansible for configuration management, and HashiCorp Nomad, which serves as a lightweight workload orchestrator, managing application deployment, scaling, and scheduling across diverse infrastructure environments with minimal operational overhead. Unlike monolithic solutions, this ecosystem approach leverages specialized tools that excel in their respective domains while maintaining platform-agnostic capabilities across AWS, Azure, Google Cloud, IBM Cloud, and hybrid environments. The convergence of Infrastructure as Code (IaC) principles with flexible orchestration platforms enables enterprises to achieve unprecedented consistency, scalability, and operational efficiency. By adopting compute platform agnostic strategies, organizations reduce vendor lock-in while optimizing for specific workload requirements across their multi-cloud infrastructure. The strategic approach is to use Terraform for Day 0 infrastructure creation, Ansible for Day 1+ configuration management and ongoing maintenance, and Nomad for Day 2+ application orchestration and workload management across your enterprise platform. Strategic Tool Positioning and Enterprise Value Core Technology Comparison ToolPrimary DomainEnterprise Value PropositionStrategic Use CasesTerraformInfrastructure ProvisioningDeclarative infrastructure definition with state managementCloud resource provisioning, network topology design, and multi-cloud consistencyAnsibleConfiguration ManagementAgentless automation with an extensive ecosystemOS hardening, application deployment, compliance enforcementNomadWorkload OrchestrationLightweight, flexible scheduling across diverse workloadsContainer orchestration, batch processing, service mesh integration Architecture Decision Framework Operational complexity: Terraform's declarative approach eliminates configuration drift at the infrastructure layer, while Ansible ensures consistent system-level configuration. Nomad provides simplified orchestration without the operational overhead of more complex platforms.Multi-cloud strategy: All three tools support cloud-agnostic deployments, enabling organizations to implement true multi-cloud architectures without platform-specific automation lock in.Team structure alignment: This toolkit naturally distributes responsibilities — infrastructure teams own Terraform modules, system administrators manage Ansible playbooks, and application teams define Nomad job specifications. Infrastructure Provisioning Excellence With Terraform Platform Agnostic Infrastructure Patterns Terraform's provider ecosystem enables consistent infrastructure patterns across cloud platforms. Organizations can define standardized network topologies, security policies, and resource configurations that adapt to platform-specific implementations while maintaining architectural consistency. Enterprise cloud architecture Network Architecture Standardization Enterprise applications require sophisticated network segmentation regardless of cloud provider. Terraform modules can abstract platform differences while implementing consistent security boundaries. Resource Lifecycle Management Complex enterprise applications often span multiple clouds for disaster recovery or cost optimization. Terraform's dependency resolution ensures coordinated provisioning across heterogeneous environments. Governance Integration Policy-as-code frameworks like Sentinel or Open Policy Agent integrate with Terraform to enforce compliance requirements automatically, regardless of the target platform. Plain Text # Enterprise VPC Foundation resource "ibm_is_vpc" "enterprise_vpc" { name = var.environment_name tags = local.common_tags } # Multi-tier subnet architecture resource "ibm_is_subnet" "application_tiers" { for_each = var.subnet_configuration name = "${var.environment_name}-${each.key}-subnet" vpc = ibm_is_vpc.enterprise_vpc.id zone = each.value.zone ipv4_cidr_block = each.value.cidr } Configuration Management With Ansible Universal System Configuration Ansible's agentless architecture and extensive module library make it ideal for managing diverse enterprise environments spanning traditional servers, containers, network devices, and cloud services across any compute platform. Security Baseline Enforcement Enterprise security policies must apply consistently across all compute platforms. Ansible playbooks codify security hardening procedures that adapt to platform-specific requirements while maintaining security standards. Application Runtime Standardization Complex enterprise applications require specific configurations regardless of the deployment target. Ansible ensures runtime environments meet application requirements across diverse platforms. Compliance Automation Regulatory requirements often mandate specific system configurations. Ansible automates compliance verification and remediation across heterogeneous infrastructure. YAML # Platform agnostic security hardening - name: Enterprise Security Baseline hosts: all become: yes tasks: - name: Configure security policies include_tasks: "security/{{ ansible_os_family | lower }.yml" - name: Apply compliance settings include_role: name: "compliance.{{ compliance_framework }" Workload Orchestration With HashiCorp Nomad Introduction to Simplified Enterprise Orchestration HashiCorp Nomad addresses enterprise workload management through a fundamentally different approach than complex orchestration platforms. While maintaining enterprise-grade features, Nomad prioritizes operational simplicity and workload diversity support. Nomad vs. Kubernetes: Strategic Comparison ASPECTNomadKubernetesArchitectureSimple, single binary (servers, clients)Complex, modular (many components: API, etcd)Workload TypesContainers, VMs, executables, legacy appsPrimarily containers (extensions for VMs)Setup & ManagementFast, easy, minimal dependenciesSteep learning curve, many moving partsResource UseLightweight, cost-effective, performantHeavier, optimized for large-scale clustersService DiscoveryIntegrates with Consul (external)Built-in (CoreDNS, Services)Secrets ManagementVault integration (external)Built-inEcosystemFocused integration with HashiCorp toolsMassive, broad, numerous plugins/toolsScalability10,000+ nodes, 2M+ tasksUp to 5,000 nodes, 300K containers per clusterPlatform SupportPlatform-agnostic, any OS, any cloudLinux only (Windows beta), mostly cloud-native Summary of Key Points Nomad’s simplicity means it can be quickly deployed and managed by smaller teams. It is perfect for enterprises that want orchestration with minimal operational complexity, regardless of the underlying compute platform.Kubernetes offers unparalleled power for container-centric workflows, especially where advanced networking, multi-cluster, and ecosystem features are critical.Nomad is better for diverse workload environments, enabling side-by-side deployment of containers, legacy binaries, and VMs. Kubernetes usually requires “containerizing everything,” or using third-party plugins to manage non-container workloads.Operational efficiency: Nomad uses fewer resources, is easier to upgrade, and requires less expertise to operate. Kubernetes offers enhanced power but demands dedicated platform engineering. Enterprise Workload Management Advantages Workload Diversity Unlike Kubernetes's container-centric approach, Nomad orchestrates containers, traditional applications, batch jobs, and system services within a unified scheduling framework. This flexibility proves crucial for enterprises with diverse application portfolios. Operational Simplicity Kubernetes complexity often becomes an operational bottleneck in enterprise environments. Nomad's streamlined architecture reduces operational burden while delivering enterprise features like multi-region federation and comprehensive security integration. Platform Flexibility Nomad runs consistently across any compute platform, enabling true workload portability without platform-specific orchestration dependencies. Resource Efficiency Advanced bin-packing algorithms and flexible resource constraints optimize infrastructure utilization across diverse workload types and compute platforms. Plain Text # Multi-workload orchestration example job "enterprise_workloads" { datacenters = ["aws-east", "azure-west", "on-premise"] group "web_services" { count = 3 task "api" { driver = "docker" # Container workload } } group "batch_processing" { count = 1 task "data_processor" { driver = "exec" # Traditional binary execution } } } Integration Architecture and Workflow Unified Automation Pipeline Design Enterprise success requires these tools to operate as an integrated ecosystem rather than isolated solutions. Effective integration leverages each tool's strengths while maintaining clear responsibility boundaries. PhasePrimary ToolKey ActivitiesIntegration PointsPlanningTerraformCross-platform resource planningGenerate inventory for AnsibleProvisioningTerraformInfrastructure creation across cloudsTrigger configuration managementConfigurationAnsibleUniversal system setupPrepare orchestration targetsDeploymentNomadMulti-platform workload schedulingIntegrate with load balancersOperationsAll ToolsCoordinated lifecycle managementUnified monitoring and alerting Platform Agnostic Pipeline Benefits Vendor independence: Organizations avoid platform-specific automation dependencies, enabling strategic cloud provider decisions based on business requirements rather than technical constraints.Consistent operations: Identical automation patterns apply across different cloud platforms, reducing operational complexity and training requirements.Cost optimization: Platform flexibility enables workload placement optimization based on cost, performance, or regulatory requirements. Enterprise Implementation Best Practices Define environments and resources in Terraform for repeatability and version control.Automate configuration with Ansible to keep systems secure and up to date.Use Nomad for portable, scalable workload orchestration—across compute platforms and cloud boundaries.Integrate monitoring and logging using observability tools; Nomad and Kubernetes both work well with Prometheus/Grafana.Plan for disaster recovery, security, and compliance: use Vault for secrets, security groups in infrastructure code, and automate backups. Continuous integration flow Organizational Excellence Cross-platform expertise: Teams develop transferable skills focused on automation principles rather than platform-specific implementations, improving organizational agility and reducing vendor dependency.Governance framework: Enterprise policies apply consistently across all platforms through code-driven enforcement, ensuring compliance regardless of deployment target.Security integration: Identity management, secrets handling, and network security policies maintain consistency across heterogeneous environments. Technical Excellence Patterns Modularity: Reusable components adapt to different platforms while maintaining functional consistency, reducing development effort, and improving maintainability.Testing strategy: Automation validation must work across multiple platforms, requiring comprehensive testing approaches that verify both platform-specific implementations and cross-platform consistency.Monitoring integration: Unified observability across diverse platforms provides consistent operational visibility regardless of underlying infrastructure. Security and Compliance Considerations Platform Agnostic Security Enterprise security requirements must apply consistently across all compute platforms. This automation ecosystem enables security policy implementation that adapts to platform capabilities while maintaining security standards. Identity integration: Authentication and authorization policies integrate with enterprise identity providers regardless of the target platform.Network security: Security group policies and network segmentation rules translate appropriately across different cloud networking models.Compliance automation: Regulatory requirements implementation adapts to platform-specific capabilities while maintaining compliance objectives. Cost Optimization and Resource Efficiency Multi-Platform Cost Strategy Platform-agnostic automation enables sophisticated cost optimization strategies that leverage pricing differences and feature variations across cloud providers. Workload placement: Applications can be deployed on optimal platforms based on cost, performance, and regulatory requirements without automation rework.Resource right-sizing: Consistent resource allocation policies apply across platforms while adapting to platform-specific instance types and pricing models.Environment management: Automated environment provisioning and deprovisioning work identically across platforms, eliminating resource waste. Performance and Scalability Enterprise Scale Considerations Geographic distribution: Workloads can be distributed across multiple cloud providers and regions based on performance requirements rather than automation limitations.Disaster recovery: Cross-platform capabilities enable sophisticated disaster recovery strategies that span multiple cloud providers.Capacity management: Dynamic scaling policies adapt to platform-specific capabilities while maintaining consistent application behavior. Future-Proofing Strategy Technology Evolution Adaptation Platform-agnostic automation approaches provide flexibility to adopt new cloud services and technologies without wholesale automation replacement. Innovation adoption: New platform capabilities can be integrated into existing automation workflows without disrupting operational patterns.Vendor negotiation: Reduced vendor lock-in improves negotiating position with cloud providers and enables strategic platform decisions.Skill investment: Team capabilities focus on transferable automation principles rather than platform-specific knowledge that may become obsolete. Conclusion The strategic integration of Terraform, Ansible, and HashiCorp Nomad represents a maturation of enterprise cloud automation that prioritizes operational excellence over technological complexity. By adopting platform-agnostic approaches, organizations achieve true cloud flexibility while maintaining operational discipline. The choice of Nomad over Kubernetes reflects enterprise priorities of operational simplicity and workload diversity over container-centric complexity. This decision enables organizations to orchestrate their complete application portfolio through unified platforms while avoiding the operational overhead associated with more complex orchestration systems. Enterprise success with cloud automation is measured by business outcomes rather than technological sophistication. This toolkit provides the foundation for achieving improved agility, reduced operational risk, and enhanced innovation capacity while maintaining the governance and compliance requirements essential for regulated environments. The platform-agnostic approach enables organizations to optimize their cloud strategies based on business requirements rather than technical constraints. This flexibility becomes a strategic asset that supports sustainable growth and competitive advantage in an increasingly digital business environment, while providing the operational foundation necessary for long-term success across diverse computing platforms.
Selenium WebDriver, Selenium Grid 4, Jenkins, and Docker Compose are popular and well-known tools. When combined, these tools are a powerful combination for web automation testing. The combination of these tools can help us set up an on-demand local infrastructure, enabling us to spin up the environment as needed for running our web automation tests at scale. Consider a scenario where we need to run multiple web automation tests on different browsers to verify the functionality and stability of the web application. Combining Selenium Grid 4 with Docker Compose can help set up browsers with a single command, allowing us to perform the required test execution smoothly with Jenkins Jobs. Prerequisites The following applications should be installed on the local machine: DockerJenkins Docker can be easily installed by downloading the application and following the instructions provided on its official website. The Jenkins, Selenium Grid 4, and the OWASP Juice Shop website (application under test) will be set up using Docker Compose. Docker Container networking is required so the containers can connect and communicate with each other. For this, we will be using the Docker Network, so the containers running Jenkins, Selenium Grid 4, and the OWASP Juice Shop website can communicate seamlessly. Next, we will install and set up Jenkins using Docker Compose. After installing Jenkins, we will set up a Jenkins Agent with Docker Compose. Docker Network Container networking refers to the capability of containers to connect and communicate with one another, as well as with non-Docker workloads. By default, networking is enabled for containers, allowing them to initiate outgoing connections. Let’s create a custom Docker network by running the following command in the terminal: Plain Text docker network create selenium-jenkins-network Decoding the Command docker network create – This command tells Docker to create a new custom network.selenium-jenkins-network – This is the name we have assigned to the custom network. (Any name can be assigned here.) We can verify if the network is created successfully by running the following command: Plain Text docker network ls With the output, it can be confirmed that the custom network selenium-jenkins-network is created successfully. Setting Up Jenkins With Docker Compose We will need a complete Jenkins setup before proceeding to create a Jenkins job to run the Selenium tests. The following is the Docker Compose file that has the setup for Jenkins and a Jenkins Agent: YAML # docker-compose.yaml version: '3.8' services: jenkins: image: jenkins/jenkins:lts privileged: true user: root ports: - 8080:8080 - 50000:50000 container_name: jenkins volumes: - /Users/faisalkhatri/jenkins_compose/jenkins_configuration:/var/jenkins_home - /var/run/docker.sock:/var/run/docker.sock networks: - selenium-jenkins-network agent: image: jenkins/ssh-agent:latest-jdk21 privileged: true user: root container_name: agent expose: - 22 environment: - JENKINS_AGENT_SSH_PUBKEY=ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIEhDGcRRY470bLQLigEKzTMvDL7zICF5CI1MAAc6PC5v [email protected] networks: - selenium-jenkins-network networks: selenium-jenkins-network: external: true Let us focus on the network part in this file, as it is important in the context of this blog. Details about other fields and values in the Docker Compose can be learnt by going through the previous blogs on Jenkins and Jenkins Agent setup. YAML networks: selenium-jenkins-network: external: true The networks key defines the network that the services in this Docker Compose file will use. selenium-jenkins-network is the name of the network that will be referenced. The external true tells Docker not to create this network, and connect to an existing custom network named selenium-jenkins-network. (We already created this custom network in the earlier step.) Another update in the Docker Compose file regarding the network is to add a networks key with the custom network name for the services. YAML services: jenkins: //.. networks: - selenium-jenkins-network agent: //.. networks: - selenium-jenkins-network This network mapping is also important, as Jenkins and Agent will be in separate containers, so they should know which network to connect. Starting Jenkins Jenkins and its agent can be started by navigating to the folder where docker-compose.yaml file is available, and running the following command in the terminal: Plain Text docker compose up -d Next, open a browser and navigate to “https://localhost:8080” to confirm that Jenkins is up and running. As Jenkins has already been installed and configured, the login screen is now displayed. If you are running for the first time, I would recommend checking the following blogs to set up Jenkins and Jenkins Agent: How to Install and Set Up Jenkins with Docker Compose How to add a Jenkins Agent with Docker Compose Setting Up Selenium Grid 4 With Docker Compose Selenium Grid 4 enables parallel execution of Selenium WebDriver tests across multiple platforms and browsers, including different browser versions. The following Docker Compose file will be used to start the Selenium Grid 4: YAML version: "3" services: chrome: image: selenium/node-chromium:latest shm_size: 2gb depends_on: - selenium-hub environment: - SE_EVENT_BUS_HOST=selenium-hub - SE_EVENT_BUS_PUBLISH_PORT=4442 - SE_EVENT_BUS_SUBSCRIBE_PORT=4443 - SE_NODE_MAX_INSTANCES=4 - SE_NODE_MAX_SESSIONS=4 - SE_NODE_SESSION_TIMEOUT=180 networks: - selenium-jenkins-network firefox: image: selenium/node-firefox:latest shm_size: 2gb depends_on: - selenium-hub environment: - SE_EVENT_BUS_HOST=selenium-hub - SE_EVENT_BUS_PUBLISH_PORT=4442 - SE_EVENT_BUS_SUBSCRIBE_PORT=4443 - SE_NODE_MAX_INSTANCES=1 - SE_NODE_MAX_SESSIONS=1 - SE_NODE_SESSION_TIMEOUT=180 networks: - selenium-jenkins-network selenium-hub: image: selenium/hub:latest container_name: selenium-hub ports: - "4442:4442" - "4443:4443" - "4444:4444" networks: - selenium-jenkins-network networks: selenium-jenkins-network: external: true It should be noted that the details regarding the Docker Network should be updated in this file to ensure the Selenium Grid 4 container connects with the Jenkins container for seamless execution of Selenium tests. The networks key is updated at the end, which tells Docker to connect to the already created custom selenium-jenkins-network Similarly, the chrome , firefox , and selenium-hub services have the networks key, so all these services connect to the same network. Starting Selenium Grid 4 The following command should be run from a new terminal from the folder path where the Selenium Grid 4 Docker Compose file is saved: Plain Text docker compose -f docker-compose-v3-seleniumgrid.yml up -d The -f <filename> part in the above command is optional. I have multiple Docker compose files in the same folder, so I have set different file names and am using the -f flag. Open a browser and navigate to “http://localhost:4444” to confirm Selenium Grid 4 is up and running with one instance of Firefox browser and four instances of Chrome browser, respectively. Check out How to setup Selenium Grid 4 with Docker Compose for a detailed understanding on Selenium Grid 4 and its components. Docker Container Logs As all of the applications are running with Docker Compose, it is necessary to know how to monitor the containers and check for logs in case anything goes wrong. We started multiple containers, each serving a different purpose. Jenkins runs under a Jenkins container, and its agent runs on another container. Similarly, for Selenium Grid 4, the Selenium hub, Chrome, and Firefox run on different containers. To check the logs for a specific container, we need to get the name of the container along with the Docker Compose file name. The following command can be run from the terminal to check the names of Docker Containers currently running: Plain Text docker container ls However, this command will provide all the details such as the container ID, command, status, creation time, ports, and so on. But to be specific, to get only the names of the containers, we can run the following command: Plain Text docker container ls --format "{{.Names}" So we now have all the container names that are currently active and running. Let's check the logs generated for the Docker Compose file “docker-compose-v3-seleniumgrid.yml”, which was used to start the Selenium Grid 4. The following command should be run: Plain Text docker compose -f docker-compose-v3-seleniumgrid.yml logs This command will show the logs of all services that were initialized using the specified Docker Compose file. If we need to check the logs for a specific service, we can run the following command by adding the <service-name> at the end: Plain Text docker compose -f docker-compose-v3-seleniumgrid.yml logs selenium-hub We can also run the container in interactive mode to get more details related to the Selenium Grid Status. The following command will run the container in interactive mode: Plain Text docker exec -it <container name> /bin/bash For example, if we need to run the “selenium-hub” container in interactive mode, the following command should be used: Plain Text docker exec -it selenium-hub /bin/bash It will start the interactive mode for the “selenium-hub” container. We can run the following curl command inside the “selenium-hub” container to check for its status: Plain Text curl http://selenium-hub:4444/status This command will output details about every registered Node. For every Node, the status includes information regarding Node availability, sessions, and slots. Check this link for more information related to the Selenium Grid 4 Endpoints. Docker Network Logs After Selenium Grid 4 and Jenkins are up and running, the Docker Network can be inspected by running the following command in the terminal: Plain Text docker network inspect selenium-jenkins-network The output of this command will display detailed information about the selenium-jenkins-network as shown in the screenshot below: The output shows that the containers for Jenkins, Jenkins Agent, Selenium hub, Chrome, and Firefox are all connected to the Docker Network. Application Under Test The Registration and Login screens of the OWASP Juice Shop demo website are under test. We will be running this website on the local machine with the following Docker Compose file named “ docker-compose-v3-juiceshop.yml”: YAML version: "3" services: juice-shop: image: bkimminich/juice-shop ports: - 3000:3000 In a new terminal screen, run the following command to start the website: Plain Text docker-compose -f docker-compose-v3-juiceshop.yml up -d After the command is successfully executed, open a browser and navigate to http://localhost:3000 to verify that the website is up and running. Test Scenarios The following two test scenarios will be used to demonstrate the Selenium test execution on Selenium Grid with Jenkins: Test Scenario 1 Navigate to the Login Screen of the Juice Shop website.Click on the “Not yet a customer?” link.Fill in the registration details and click on the “Register” button.Verify the message “Registration completed successfully. You can now log in.” Test Scenario 2 Verify the Login Page title is “Login.”Enter the registered credentials and click on the “Log in” button.Verify that the “Logout” option is displayed on successful login. Implementation Selenium WebDriver with Java is used for implementing test scenarios. A Maven project has been created, and the dependencies for Selenium WebDriver, TestNG, and DataFaker have been added to the project’s pom.xml file. The DataFaker library is used to create random test data in real-time. This will allow us to register new users hassle-free on every test run. Base Test A BaseTest class is created for browser setup and configuration. The ThreadLocal class is used to set drivers because it is thread-safe and well-suited for parallel test execution. With ThreadLocal, each thread has its own isolated variable, ensuring that threads cannot access or interfere with each other’s values, even when using the same ThreadLocal object. Java @Parameters ("browser") @BeforeClass (alwaysRun = true) public void setup (final String browser) { try { if (browser.equalsIgnoreCase ("chrome")) { final ChromeOptions chromeOptions = new ChromeOptions (); chromeOptions.setCapability ("se:name", "Test on Grid - Chrome"); setDriver (new RemoteWebDriver (new URL ("http://selenium-hub:4444"), chromeOptions)); } else if (browser.equalsIgnoreCase ("firefox")) { final FirefoxOptions firefoxOptions = new FirefoxOptions (); firefoxOptions.setCapability ("se:name", "Test on Grid - Firefox"); setDriver (new RemoteWebDriver (new URL ("http://selenium-hub:4444"), firefoxOptions)); } else if (browser.equalsIgnoreCase ("localchrome")) { setDriver (new ChromeDriver ()); } else if (browser.equalsIgnoreCase ("localfirefox")) { setDriver (new FirefoxDriver ()); } else { throw new Error ("Browser configuration is not defined!!"); } } catch (final MalformedURLException e) { throw new Error ("Error setting up browsers in Grid"); } getDriver ().manage () .window () .maximize (); getDriver ().manage () .timeouts () .implicitlyWait (Duration.ofSeconds (30)); } The RemoteWebDriver class is used since we will be running the tests on the remote machine, not on the local one. The values for the browsers will be supplied using the testng.xml file, so the browser on which the tests need to be run can be updated externally without modifying the code. Java try { if (browser.equalsIgnoreCase ("chrome")) { final ChromeOptions chromeOptions = new ChromeOptions (); chromeOptions.setCapability ("se:name", "Test on Grid - Chrome"); setDriver (new RemoteWebDriver (new URL ("http://selenium-hub:4444"), chromeOptions)); } else if (browser.equalsIgnoreCase ("firefox")) { final FirefoxOptions firefoxOptions = new FirefoxOptions (); firefoxOptions.setCapability ("se:name", "Test on Grid - Firefox"); setDriver (new RemoteWebDriver (new URL ("http://selenium-hub:4444"), firefoxOptions)); } else if (browser.equalsIgnoreCase ("localchrome")) { setDriver (new ChromeDriver ()); } else if (browser.equalsIgnoreCase ("localfirefox")) { setDriver (new FirefoxDriver ()); } else { throw new Error ("Browser configuration is not defined!!"); } } catch (final MalformedURLException e) { throw new Error ("Error setting up browsers in Grid"); } There are three conditions defined for the browsers in the setup() method; the first one is for the Chrome browser, which will start the Chrome browser session if the browser is specified as “chrome.” Similarly, if “firefox” is specified in the testng.xml, the second condition will launch the Firefox browser session. When a session starts, a new RemoteWebDriver instance is created using the URL “http://selenium-hub:4444” along with the corresponding ChromeOptions or FirefoxOptions, depending on the selected browser. This URL(“http://selenium-hub:4444”) is updated as we are running the test on Selenium Grid 4 with Docker Compose (Remember, we had set up the custom Docker network). The “se:name” sets a custom capability to label the test name in Selenium Grid 4 UI. This will help us while checking the logs and verifying which test is running in Selenium Grid 4. Test Data Generation The test data for registering a new user is generated in real-time using the DataFaker library. The Builder design pattern in Java is used in this process of test data generation. Additionally, the Lombok library is used to automatically generate boilerplate code such as getters, setters, and constructors, reducing manual coding and keeping the classes clean and concise. Java @Data @Builder public class RegistrationData { private String email; private String password; private String securityAnswer; private String securityQuestion; } The RegistrationData class contains the fields from the Registration window for which the test data needs to be supplied. Java public class RegistrationDataBuilder { public static RegistrationData getRegistrationData () { final Faker faker = new Faker (); final String password = "Pass@123"; final String securityQuestion = "Your favorite book?"; final String securityAnswer = "Agile Testing"; return RegistrationData.builder () .email (faker.internet () .emailAddress ()) .password (password) .securityQuestion (securityQuestion) .securityAnswer (securityAnswer) .build (); } } The RegistrationDataBuilder class generates and supplies random test data in real time while running the registration tests. Writing the Tests A new class, JuiceShopTests, is created to implement test scenarios for user registration and verify the user login. The first test scenario for user registration is implemented in the testRegisterUser() method: Java @Test public void testRegisterUser () { getDriver ().get ("http://host.docker.internal:3000/"); final HomePage homePage = new HomePage (getDriver ()); final LoginPage loginPage = homePage.openLoginPage (); assertEquals (loginPage.pageHeaderText (), "Login"); final RegistrationPage registrationPage = loginPage.openRegistrationPage (); assertEquals (registrationPage.pageHeaderText (), "User Registration"); registrationPage.registerNewUser (this.registrationData); assertEquals (registrationPage.registrationSuccessText (), "Registration completed successfully. You can now log in."); registrationPage.waitForSnackBarToDisappear (); } This method will navigate to the “http://host.docker.internal:3000”, which is the URL for the OWASP Juice Shop website running inside the Docker container. It will open the Login page and verify that the header text “Login” is displayed. Next, from the Login page, it will open the Registration page and verify its header text “User Registration.” The user registration steps will be performed next. After successful registration, the following message text will be verified: “Registration completed successfully. You can now log in.” The second test scenario for user login is implemented in the testUserLogin() method: Java @Test public void testUserLogin () { final LoginPage loginPage = new LoginPage (getDriver ()); assertEquals (loginPage.pageHeaderText (), "Login"); final HomePage homePage = loginPage.userLogin (this.registrationData.getEmail (), this.registrationData.getPassword ()); assertTrue (homePage.isLogoutButtonDisplayed ()); } This method will verify that the Login page header shows the text “Login.” After checking the header text, it will perform user login using the registered user's email and password. On successful login, it will verify that the “Logout” button is displayed on the screen. Testng.xml File The following testng.xml file has been created with the name “testng-seleniumgridjenkins.xml”: XML <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE suite SYSTEM "http://testng.org/testng-1.0.dtd"> <suite name="Selenium WebDriver Selenium Grid Jenkins test suite"> <test name="Selenium Grid with Jenkins tests on Chrome"> <parameter name="browser" value="chrome"/> <classes> <class name="io.github.mfaisalkhatri.seleniumgridjenkins.JuiceShopTests"> <methods> <include name="testRegisterUser"/> <include name="testUserLogin"/> </methods> </class> </classes> </test> <test name="Selenium Grid with Jenkins tests on Firefox"> <parameter name="browser" value="firefox"/> <classes> <class name="io.github.mfaisalkhatri.seleniumgridjenkins.JuiceShopTests"> <methods> <include name="testRegisterUser"/> <include name="testUserLogin"/> </methods> </class> </classes> </test> </suite> This testng.xml file will execute the same tests on two different browsers, i.e., Chrome and Firefox, on Selenium Grid 4. Setting Up the Jenkins Job to Run the Selenium Tests Let’s create a Jenkins Job for the Maven project to run the Selenium WebDriver tests on Selenium Grid 4 with the following steps: Step 1: Create a new Jenkins Job for the Maven Project Step 2: Update the configuration for the project as follows: Select “Git” in the Source Code Management Section and update the Repository URL with https://github.com/mfaisalkhatri/selenium-demo.git.Set the branch to “main.” Step 3: Update the following details in the Build section: Root POM: pom.xmlGoals and options: clean install -Dsuite-xml=testng-seleniumgridjenkins.xml The command in the “Goal and Options” field is updated with the full name of the testng suite.xml, as in my GitHub project, there are multiple testng xml files that run different tests. Check out Working with multiple testng.xml files for more details. Click on the Save and Apply button to save the configuration settings. Running the Jenkins Job There are two ways to run the Jenkins Job: By clicking on the “Build Now” button.Using webhooks to run the build as soon as the code is pushed to the remote repository. Let’s click on the “Build Now” button on the left-hand pane of the Jenkins Job screen to start the job. The details of the job, which is in progress, can be checked by clicking on the “Builds in progress” section. Let’s check the live test execution on the Selenium Grid 4 by opening a new tab and navigating to http://localhost:4444. It can be seen that the test execution has begun on the Chrome browser. Click on the “video” icon shown on the right-hand side of the Chrome browser. It will open the Sessions window, where we can see the live test execution in the Chrome browser. It will ask for the password to view the session; the password is “secret.” Similarly, the test execution for the Firefox browser can also be viewed. After all the tests are executed successfully, the details of the test execution can be viewed in the console logs of the Jenkins Job. A total of four tests were executed successfully. The details of the Jenkins Job can be checked on the Job dashboard. The dashboard displays everything in green, indicating that the job was run successfully and all the tests passed. Summary Web automation testing requires a browser setup to run the tests. Combining the powerful Selenium WebDriver, Selenium Grid 4, and Jenkins with Docker Compose, an on-demand local infrastructure can be set up with minimal effort. With Docker Compose, we started Jenkins, Selenium Grid 4, and the OWASP Juice Shop website. Docker network is a powerful feature of Docker that allows different containers to communicate easily. Using Docker network, we connected all the containers to communicate with each other, allowing us to seamlessly run Selenium WebDriver tests on Selenium Grid using a Jenkins Job. With Docker, local infrastructure can be set up at the fingertips, allowing us to create an automated CI/CD pipeline for getting faster feedback. Happy testing!
Introduction Let's talk about an uncomfortable truth: most of us are shipping Docker images that are embarrassingly large. If you're deploying ML models, there's a good chance your containers are over 2GB. Mine were pushing 3GB until recently. The thing is, we know better. We've all read the best practices. But when you're trying to get a model into production, it's tempting to just FROM pytorch/pytorch and call it a day. This article walks through the practical reality of optimizing Docker images, including the trade-offs nobody mentions. In this article, we embark on two pivotal expeditions into the world of Docker optimization. First, we'll explore the immediate and gratifying gains from choosing a leaner "slim" base image, using our slim_image project as our guide. Then, as complexities often arise when we trim the fat (especially in AI), we'll unveil the elegant power of multi-stage builds with our multistage_image project, a technique that truly separates the workshop from the showroom. The "Slim" Advantage A Lighter Foundation (Project: slim_image) It seems almost too simple, doesn't it? Like choosing a lighter frame for a vehicle to improve its mileage. The Python maintainers offer "slim" variants of their official Docker images. These are akin to their full-bodied siblings but have shed many non-essential components like documentation, development headers, and miscellaneous tools that, while useful in a general-purpose environment, are often just passengers in a dedicated application container. You'll see tags like python:3.10-slim, and also more specific ones like python:3.10-slim-bookworm or python:3.10-slim-bullseye. What's the difference? The -slim tag on its own typically points to the latest stable Debian release paired with that Python version. Using an explicit version like -slim-bookworm (Debian 12) or -slim-bullseye (Debian 11) pins the underlying operating system, giving us more predictable and reproducible builds, a cornerstone of good software practice. For our demonstration, we'll use python:3.10-slim, but we encourage you to adopt explicit distro tagging in your own projects. Consider the Dockerfile from our slim_image project: # Use the slim Python image instead of the full one. FROM python:3.10-slim WORKDIR /app COPY slim_image/requirements.txt ./requirements.txt COPY slim_image/app/ ./app/ COPY slim_image/sample_data/ ./sample_data/ RUN pip install --no-cache-dir -r requirements.txt CMD ["python", "app/predictor.py", "sample_data/sample_text.txt"] Building this using docker build -t bert-classifier:slim -f slim_image/Dockerfile slim_image/ command yields a striking difference: bert-classifier-naive: 2.54GB (56s build) bert-classifier-slim: 1.66GB (51s build) Run docker image ls If you also build the naive_image, you can compare the two images and see the difference made by -slim . docker image ls REPOSITORY TAG IMAGE ID CREATED SIZE bert-classifier-slim latest 5111f608f68b 59 minutes ago 1.66GB bert-classifier-naive latest e16441728970 About an hour ago 2.54GB Just by altering that single FROM line, we've shed 880MB and shaved 5 seconds off the build time! It’s a compelling first step, like the initial, satisfying clearing of rubble from an archaeological site. We encourage you to run dive on both of these images and see exactly where did all the 880 go. The Inevitable Catch And yet, as with many seemingly straightforward paths in the world of software, a subtlety emerges. The very leanness of -slim images can become a challenge. Many powerful AI libraries, or indeed custom components like our dummy_c_extension within the slim_image project, require compilation from C/C++ source. This compilation demands tools: a C compiler (gcc), Python development headers (python3-dev), build-essential, and sometimes more. Our svelte -slim image, by design, often lacks these. Attempting to install our dummy_c_extension directly in the simple slim_image/Dockerfile would falter. The slim_image/Dockerfile.fixed project demonstrates a common, albeit somewhat cumbersome, solution: # slim_image/Dockerfile.fixed RUN apt-get update && apt-get install -y --no-install-recommends \ # Essential build tools for C compilation build-essential \ gcc \ # Python development headers (needed for C extensions) python3-dev \ # Now install Python packages && pip install --no-cache-dir -r requirements.txt \ # Install our dummy C extension (this would fail without build tools) && pip install --no-cache-dir ./dummy_c_extension/ \ # Clean up build dependencies to keep image small && apt-get purge -y --auto-remove \ build-essential \ gcc \ python3-dev \ # Remove package lists and cache && rm -rf /var/lib/apt/lists/* \ && rm -rf /root/.cache/pip This intricate RUN command temporarily inflates the layer with build tools, performs the necessary compilations and installations, and then meticulously cleans up after itself, all within that single layer to avoid permanent bloat. The resulting bert-classifier-slim-fixed image comes in at 1.67GB, accommodating our C extension. It works, but it feels like carefully packing and then unpacking tools for every single task on an assembly line. But there must be a better way to organize our workshop. Multi-Stage Builds The Art of Separation (Project: multistage_image) Enter the multi-stage build – a concept so elegantly simple, yet so powerful, it often feels like discovering a hidden passage to a cleaner workspace. Multi-stage builds allow us to define distinct phases within a single Dockerfile, each starting with its own FROM instruction. We can name these stages (e.g., AS builder) and, crucially, copy artifacts from one stage to another (COPY --from=builder ...). Only the final stage contributes to the image you ultimately run. Let's examine the Dockerfile from our multistage_image project: # This version implements a multi-stage build to separate build-time dependencies # ====== BUILD STAGE ====== FROM python:3.10 AS builder WORKDIR /app COPY multistage_image/requirements.txt runtime_requirements.txt ./ RUN pip install --no-cache-dir -r requirements.txt COPY multistage_image/app/ ./app/ # ====== RUNTIME STAGE ====== # Use slim image for the final runtime FROM python:3.10-slim AS runtime WORKDIR /app # Copy only the runtime requirements COPY multistage_image/runtime_requirements.txt ./ RUN pip install --no-cache-dir -r runtime_requirements.txt COPY multistage_image/app/ ./app/ COPY multistage_image/sample_data/ ./sample_data/ CMD ["python", "app/predictor.py", "sample_data/sample_text.txt"] Observe the two distinct acts: The builder Stage: Here, we use python:3.10 (the full python image, suitable for compilation). It installs all dependencies from multistage_image/requirements.txt. If this requirements.txt included packages needing C compilation (or our dummy_c_extension), this is where they would be built, leveraging the comprehensive environment of the builder. The runtime Stage: This stage begins anew with python:3.10-slim. It only installs packages listed in multistage_image/runtime_requirements.txt – precisely those needed for the application to run, excluding any development tools or build-time-only Python packages that might have been in the builder's requirements.txt. The application code is then copied. This separation is profound. The builder stage acts as our fully equipped workshop, handling all the messy compilation and preparation. The runtime stage is the clean, minimalist showroom, receiving only the polished final product. If our dummy_c_extension (or any other compiled library) was built in the builder, we would then COPY the necessary compiled files (like .so files or an installed package directory from the builder's site-packages) into the runtime stage. We encourage you to experiment: try adding the dummy_c_extension to the builder stage and copy its output to the runtime stage to see this in action. Let's build this (docker build -t bert-classifier:multistage -f multistage_image/Dockerfile multistage_image/): bert-classifier:multistage: 832MB (23s build) Now lets list all images docker image ls You will get: REPOSITORY TAG IMAGE ID CREATED SIZE bert-classifier-multistage latest 7a6dd03b3310 48 minutes ago 832MB bert-classifier-slim-fixed latest fe857de45a1a 55 minutes ago 1.67GB bert-classifier-slim latest 5111f608f68b 59 minutes ago 1.66GB bert-classifier-naive latest e16441728970 About an hour ago 2.54GB The results speak volumes: an 832MB image, our smallest yet, and the fastest build time at a mere 23 seconds. A Glimpse Inside With dive We strongly encourage you to run dive bert-classifier-multistage yourself. You'll be able to confirm that the final stage is indeed based on python:3.10-slim. Critically, you will find no traces of build-essential, gcc, or other build tools in its layers. The largest layer in this final image, as seen in our own dive exploration, is the RUN pip install --no-cache-dir -r runtime_requirements.txt command, contributing 679MB – this is the home of PyTorch, Transformers, and their kin. The surrounding OS layers are minimal. The image details from dive for bert-classifier-multistage (ID 7a6dd03b3310) report a total image size of 832MB with an efficiency score of 99% and only 5.5MB of potential wasted space, likely inherent to the base slim image layers or minor filesystem artifacts. Weighing the Options Choosing a slim base image is almost always a net positive, offering significant size reduction with minimal effort, provided you are prepared to handle potential compilation needs. Multi-stage builds, while adding a little length to your Dockerfile, bring clarity, robustness, and substantial size savings by ensuring your final image is unburdened by build-time apparatus. For AI applications with their often complex dependencies, this technique is less a luxury and more a necessity for professional-grade containers.
Amazon EKS makes running containerized applications easier, but it doesn’t give you automatic visibility into JVM internals like memory usage or garbage collection. For Java applications, observability requires two levels of integration: Cluster-level monitoring for pods, nodes, and deploymentsJVM-level APM instrumentation for heap, GC, threads, latency, etc. New Relic provides both via Helm for infrastructure metrics, and a lightweight Java agent for full JVM observability. In containerized environments like Kubernetes, surface-level metrics (CPU, memory) aren’t enough. For Java apps, especially those built on Spring Boot, the real performance story lies inside the JVM. Without insight into heap usage, GC behavior, and thread activity, you're flying blind. New Relic bridges this gap by combining infrastructure-level monitoring (via Prometheus and kube-state-metrics) with application-level insights from the JVM agent. This dual visibility helps teams reduce mean time to resolution (MTTR), avoid OOMKilled crashes, and tune performance with confidence. This tutorial covers: Installing New Relic on EKS via HelmInstrumenting your Java microservice with New Relic’s Java agentJVM tuning for container environmentsMonitoring GC activity and memory usageCreating dashboards and alerts in New RelicOptional values.yaml file, YAML bundle, and GitHub repo Figure 1: Architecture of JVM monitoring on Amazon EKS using New Relic. The Java microservice runs inside an EKS pod with the New Relic JVM agent attached. It sends GC, heap, and thread telemetry to New Relic APM. At the same time, Prometheus collects Kubernetes-level metrics, which are forwarded to New Relic for unified observability. Prerequisites Amazon EKS cluster with kubectl and helm configuredA Java-based app (e.g., Spring Boot) deployed in EKSNew Relic account (free tier is enough)Basic understanding of JVM flags and Kubernetes manifests Install New Relic’s Kubernetes Integration (Helm) This installs the infrastructure monitoring components for cluster, pod, and container-level metrics. Step 1: Add the New Relic Helm repository Shell helm repo add newrelic https://helm-charts.newrelic.com helm repo update Step 2: Install the monitoring bundle Shell helm install newrelic-bundle newrelic/nri-bundle \ --set global.licenseKey=<NEW_RELIC_LICENSE_KEY> \ --set global.cluster=<EKS_CLUSTER_NAME> \ --namespace newrelic --create-namespace \ --set newrelic-infrastructure.enabled=true \ --set kube-state-metrics.enabled=true \ --set prometheus.enabled=true Replace <NEW_RELIC_LICENSE_KEY> and <EKS_CLUSTER_NAME> with your actual values. Instrument Your Java Microservice With the New Relic Agent Installing the Helm chart sets up cluster-wide observability, but to monitor JVM internals like heap usage, thread activity, or GC pauses, you need to attach the New Relic Java agent. This gives you: JVM heap, GC, thread metricsResponse times, error rates, transaction tracesGC pauses and deadlocks Dockerfile (add agent): Dockerfile ADD https://download.newrelic.com/newrelic/java-agent/newrelic-agent/current/newrelic-java.zip /opt/ RUN unzip /opt/newrelic-java.zip -d /opt/ JVM startup args: Shell -javaagent:/opt/newrelic/newrelic.jar Required environment variables: YAML - name: NEW_RELIC_APP_NAME value: your-app-name - name: NEW_RELIC_LICENSE_KEY valueFrom: secretKeyRef: name: newrelic-license key: license_key Create the secret: Shell kubectl create secret generic newrelic-license \ --from-literal=license_key=<YOUR_NEW_RELIC_LICENSE_KEY> Capture Kubernetes Metrics New Relic Helm install includes: newrelic-infrastructure → Node, pod, container metricskube-state-metrics → Kubernetes objectsprometheus-agent → Custom metrics support Verify locally: Shell kubectl top pods kubectl top nodes In New Relic UI, go to: Infrastructure → Kubernetes JVM Tuning for GC and Containers To avoid OOMKilled errors and track GC behavior, tune your JVM for Kubernetes: Recommended JVM Flags: Shell -XX:+UseContainerSupport -XX:MaxRAMPercentage=75.0 -XshowSettings:vm -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/tmp/gc.log Make sure /tmp is writable or mount it via emptyDir. Pod resources: YAML resources: requests: memory: "512Mi" cpu: "250m" limits: memory: "1Gi" cpu: "500m" Align MaxRAMPercentage with limits.memory. Why JVM Monitoring Matters in Kubernetes Kubernetes enforces resource limits on memory and CPU, but by default, the JVM doesn’t respect those boundaries. Without proper tuning, the JVM might allocate more memory than allowed, triggering OOMKilled errors. Attaching the New Relic Java agent gives you visibility into GC pauses, heap usage trends, and thread health all of which are critical in autoscaling microservice environments. With these insights, you can fine-tune JVM flags like `MaxRAMPercentage`, detect memory leaks early, and make data-driven scaling decisions. Dashboards and Alerts in New Relic Create an alert for GC pause time: Go to Alerts & AI → Create alertSelect metric: JVM > GC > Longest GC pauseSet threshold: e.g., pause > 1000 ms Suggested Dashboards: JVM heap usageGC pause trendsPod CPU and memory usageError rate and latency Use New Relic’s dashboard builder or import JSON from your repo. Forwarding GC Logs to Amazon S3 While New Relic APM provides GC summary metrics, storing full GC logs is helpful for deep memory analysis, tuning, or post-mortem debugging. Since container logs are ephemeral, the best practice is to forward these logs to durable storage like Amazon S3. Why S3? Persistent log storage beyond pod restartsUseful for memory tuning, forensic reviews, or auditsCost-effective compared to real-time log ingestion services Option: Use Fluent Bit with S3 Output Plugin 1. Enable GC logging with: Shell -Xloggc:/tmp/gc.log 2. Mount /tmp with emptyDir in your pod 3. Deploy Fluent Bit as a sidecar or DaemonSet Make sure your pod or node has an IAM role with s3:PutObject permission to the target bucket. This setup ensures your GC logs are continuously shipped to S3 for safe, long-term retention even after the pod is restarted or deleted. Troubleshooting Tips Problem Fix APM data not showing Verify license key, agent path, app traffic JVM metrics missing Check -javaagent setup and environment vars GC logs not collected Check -Xloggc path, permissions, volume mount Kubernetes metrics missing Ensure Prometheus is enabled in Helm values Check logs with: Shell kubectl logs <pod-name> --container <container-name> Conclusion New Relic allows you to unify infrastructure and application observability in Kubernetes environments. With JVM insights, GC visibility, and proactive alerts, DevOps and SRE teams can detect and resolve performance issues faster. After setting up JVM and Kubernetes monitoring, consider enabling distributed tracing to get visibility across service boundaries. You can also integrate New Relic alerts with Slack, PagerDuty, or Opsgenie to receive real-time incident notifications. Finally, use custom dashboards to compare performance across dev, staging, and production environments, helping your team catch regressions early and optimize for reliability at scale.
Yitaek Hwang
Software Engineer,
NYDIG
Marija Naumovska
Product Manager,
Microtica
Naga Santhosh Reddy Vootukuri
Principal Software Engineering Manager,
Microsoft