Cloud Architecture Resources

DZone's Featured Cloud Architecture Resources

Docker Multi-Stage Builds: Optimizing Development and Production Workflows

By Mahitha Adapa

Hey there, fellow Docker enthusiasts! If you've been containerizing applications for a while, you've probably run into this all-too-familiar frustration: your Docker images are absolutely massive, they take forever to build and deploy, and you're left wondering if there's got to be a better way. Trust me, I've been there—staring at a 1.4GB image thinking "surely this can't be right?" After years of wrestling with bloated containers (and some very unhappy DevOps teammates), I finally embraced multi-stage builds—and honestly, it's been a complete game-changer. In this article, I'll share what I've learned about this powerful but often overlooked Docker feature that could revolutionize your containerization workflow. The Problem: Those Darn Bloated Docker Images Let's face it—we've all written a Dockerfile like this at some point: Dockerfile FROM node:18 WORKDIR /app # Install build tools and dependencies COPY package*.json ./ RUN apt-get update && apt-get install -y build-essential python3 RUN npm install # Copy source code COPY . . # Build the application RUN npm run build EXPOSE 3000 CMD ["npm", "start"] Sure, it works. Your app runs. But this approach is like packing your entire workshop into your car just to change a lightbulb at a friend's house. You're lugging around: All those dev dependencies: Your production container is stuffed with packages you only needed during build time.Build tools galore: Why is Python in your Node.js production image? (We've all been there!)Source code that's already been compiled: Just dead weight at this point.A mountain of unnecessary baggage: I've actually seen Node.js apps balloon to well over 1GB this way! One project I worked on hit 1.14GB for what was essentially a simple Express server. The worst part? This isn't just about wasted disk space (though your cloud storage bill will definitely feel it). These bloated images slow down your CI/CD pipelines, make deployments feel like watching paint dry, and create a larger attack surface for potential security vulnerabilities. Plus, try explaining to your team why a "quick deployment" takes 10 minutes to download. The Solution: Multi-Stage Builds to the Rescue This is where multi-stage builds come in—and they're exactly what they sound like: building your Docker image in stages, where each stage can cherry-pick only what it needs from previous stages. Here's a simplified example of the same Node.js application, but with a multi-stage approach: Dockerfile FROM node:18 AS builder WORKDIR /app COPY package*.json ./ RUN npm install COPY . . RUN npm run build # Stage 2: Production environment FROM node:18-alpine WORKDIR /app COPY --from=builder /app/dist /app/dist # Only production dependencies COPY package*.json ./ RUN npm install --only=production EXPOSE 3000 CMD ["node", "dist/server.js"] The difference? This approach creates an image that's often 88% smaller (around 130MB), significantly more secure, and deploys in a fraction of the time. It's like Marie Kondo for your Docker images—keeping only what "sparks joy" in production. You can find this example and others in this Docker Multi-Stage Examples repository, which includes complete, runnable applications for multiple languages. How Multi-Stage Builds Work: The Magic Behind the Scenes The beauty of multi-stage builds lies in their simplicity. Here's what's happening: Multiple FROM statements: Each one starts a fresh stage in your buildAS keyword: Gives your stage a name so you can reference it laterCOPY --from: The secret sauce—this copies only specific files from a previous stageFinal FROM: Only the last stage actually produces your output image Everything else—your build tools, development dependencies, source code—gets discarded after it's served its purpose. It's like having a professional kitchen to prepare your meal, but only taking the finished dish home. I've created this diagram to show exactly how artifacts move between stages. Notice how the final image contains only what's needed to run the application, not what's needed to build it. Real-World Examples: Multi-Stage Builds Across Languages One thing I love about multi-stage builds is how well they work across different programming languages. Let me highlight a few examples from my repository: Python With Poetry For Python applications using Poetry, multi-stage builds can reduce image size by about 81%. The key is separating the Poetry installation and dependency management from the runtime environment. I've watched teams cut their Python images from over 1.3GB down to just 250MB—that's the kind of improvement that makes deployment teams do a happy dance. Check out the complete Python example in the repository to see how Poetry and multi-stage builds work together. Java With Maven Java applications often include large build tools like Maven or Gradle that aren't needed at runtime. With multi-stage builds, you can compile your application in one stage and only copy the resulting JAR file to a slim JRE image. We're talking about going from a hefty 1.4GB image down to a much more reasonable 297MB.The Java example in the repository demonstrates this approach with a Spring Boot application. Go Applications Go's compilation model makes it perfect for multi-stage builds. Since Go compiles to a single binary, you can use the empty "scratch" image as your final stage, resulting in incredibly small images. This is where things get really exciting! Dockerfile # Build stage FROM golang:1.23 AS builder WORKDIR /app COPY . . RUN CGO_ENABLED=0 GOOS=linux go build -a -installsuffix cgo -o app . # Production stage - using scratch (empty) image FROM scratch COPY --from=builder /app/app . EXPOSE 8080 CMD ["./app"] This Go example is my personal favorite—we're talking about shrinking a 1.13GB build image down to just 9.08MB. That's not a typo! That's a mind-blowing 99.2% reduction. Check out the Go example in the repository for the complete implementation. As you can see from this chart, the size reductions are dramatic across all languages. The Go example achieves a truly mind-blowing 99.2% reduction—that's the kind of optimization that makes you question everything you thought you knew about container images! Advanced Techniques: Taking Multi-Stage Builds Further Once you've mastered the basics, there are some really cool advanced techniques you can use: Using Build Arguments Across Stages Need to keep versions consistent? Pass build arguments between stages: Dockerfile ARG VERSION=latest FROM node:${VERSION} AS builder # Build stage instructions... FROM node:${VERSION}-alpine # Production stage instructions... Creating Debug Variants Sometimes you need a debug version with extra tools. With multi-stage builds, you can create multiple targets from the same Dockerfile: Dockerfile # Production stage FROM node:18-alpine AS production # Production instructions... # Debug stage - extends production FROM production AS debug RUN apk add --no-cache curl htop strace # Add debugging tools and configurations You can find this example in the advanced directory of this repository. Build the variant you need: PowerShell docker build --target production -t myapp:prod . docker build --target debug -t myapp:debug . I've used this technique countless times during late-night debugging sessions. It's honestly so much cleaner than maintaining separate Dockerfiles for every possible scenario. Real-World Results: What You Can Actually Expect Let me share the real numbers from this demonstration project. I tested multi-stage builds across four different languages, and honestly, the results still surprise me every time I look at them: LanguageTraditional SizeMulti-Stage SizeSize ReductionGo1.13GB9.08MB99.2%Node.js1.14GB130MB88.6%Python1.34GB250MB81.3%Java1.4GB297MB78.8% These aren't theoretical numbers—these are actual, reproducible results you can get by running the demo script in this repository. The Go result in particular makes people think I've made a mistake when I show it to them. But nope, we really did go from over a gigabyte down to 9 megabytes. What this means in practice: Faster deployments: Your containers download and start in seconds, not minutesHappy DevOps teams: Less bandwidth usage, faster CI/CD pipelinesLower costs: Smaller images mean reduced storage and transfer costsBetter security: Fewer packages in production = smaller attack surface The smaller images also reduced infrastructure costs significantly. When you're deploying dozens or hundreds of containers, these size differences add up fast—both in terms of money and sanity. Best Practices I've Learned the Hard Way After implementing multi-stage builds across dozens of projects, here are my top tips: Name your stages for clarity and maintainability—your future self will thank youUse specific base image tags rather than 'latest' to ensure consistencyMinimize the number of layers by combining related commandsOrder instructions by change frequency to optimize cachingUse .dockerignore to prevent unnecessary files from being copiedConsider distroless or scratch images for the final stage when possiblePin dependency versions for reproducible builds When to Use Multi-Stage Builds (And When Not To) Multi-stage builds are your best friend when you have: Applications with complex build processesProjects with large development dependenciesEnvironments where deployment speed mattersSecurity-sensitive applicationsMicroservice architectures with many containers But honestly? They might be overkill for: Simple applications with minimal dependenciesDevelopment-only containersScenarios where image size genuinely isn't a concern Here's my take: I generally recommend starting with multi-stage builds by default and simplifying only if needed. The benefits almost always outweigh the small amount of additional Dockerfile complexity. Plus, your future self will thank you when you inevitably need to optimize later. If you're still using single-stage Dockerfiles, I can't recommend multi-stage builds enough. The benefits are immediate and substantial: smaller images, faster deployments, improved security, and cleaner Dockerfiles. Best of all, implementing multi-stage builds requires minimal changes to your existing workflow. It's one of those rare optimizations that gives you massive benefits for relatively little effort. Try out this Docker Multi-Stage Examples repository to get started with working examples across multiple languages. I've included everything you need to experiment with these techniques in your own projects—just run ./scripts/demo.sh to see the magic happen! Your operations team, security team, and cloud billing department will thank you. And honestly, you'll thank yourself the next time you're not sitting around waiting for a deployment to complete. More

Tail Sampling: The Future of Intelligent Observability in Distributed Systems

By Rishab Jolly

Observability has become a critical component for maintaining system health and performance. While traditional sampling methods have served their purpose, the emergence of tail sampling represents a paradigm shift in how we approach trace collection and analysis. This intelligent sampling strategy is revolutionizing the way organizations handle telemetry data, offering unprecedented precision in capturing the most valuable traces while optimizing storage costs and system performance. Understanding the Sampling Landscape Before diving into tail sampling, it's essential to understand the broader context of sampling strategies. Traditional head-based sampling makes decisions at the beginning of a trace's lifecycle, determining whether to collect or discard telemetry data based on predetermined criteria such as sampling rates or simple rules. While effective for reducing data volume, this approach often results in the loss of critical information about error conditions, performance anomalies, or rare but important system behaviors. Tail sampling addresses these limitations by deferring sampling decisions until after a trace is complete or nearly complete. This approach enables sampling systems to make informed decisions based on the full context of a request's journey through distributed services, considering factors such as error rates, latency patterns, and business-critical indicators. The Mechanics of Tail Sampling Tail sampling operates on the principle of delayed decision-making. Instead of immediately deciding whether to keep or discard a trace, the system temporarily buffers trace data while collecting additional context. Once sufficient information is available, sophisticated algorithms evaluate the trace against multiple criteria to determine its value for observability purposes. The process typically involves several key components: trace collection and buffering, decision evaluation engines, and storage optimization mechanisms. Modern implementations leverage machine learning algorithms and statistical models to continuously improve sampling accuracy based on historical patterns and system behavior. Implementing Tail Sampling With OpenTelemetry OpenTelemetry provides robust support for tail sampling through its collector architecture. Here's a practical implementation example: YAML # OpenTelemetry Collector Configuration receivers: otlp: protocols: grpc: endpoint: 0.0.0.0:4317 http: endpoint: 0.0.0.0:4318 processors: tail_sampling: decision_wait: 30s num_traces: 50000 expected_new_traces_per_sec: 10 policies: - name: error_policy type: status_code status_code: status_codes: [ERROR] - name: latency_policy type: latency latency: threshold_ms: 1000 - name: probabilistic_policy type: probabilistic probabilistic: sampling_percentage: 10 - name: rate_limiting_policy type: rate_limiting rate_limiting: spans_per_second: 100 exporters: jaeger: endpoint: jaeger-collector:14250 tls: insecure: true service: pipelines: traces: receivers: [otlp] processors: [tail_sampling] exporters: [jaeger] This configuration demonstrates a comprehensive tail sampling setup that evaluates traces based on multiple criteria including error status, latency thresholds, and rate limiting policies. For application-level instrumentation, developers can leverage OpenTelemetry SDKs to provide rich context for sampling decisions: Python from opentelemetry import trace from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.trace.export import BatchSpanProcessor from opentelemetry.sdk.resources import Resource # Configure tracer with resource information resource = Resource.create({ "service.name": "payment-service", "service.version": "1.2.0", "environment": "production" }) trace.set_tracer_provider(TracerProvider(resource=resource)) tracer = trace.get_tracer(__name__) # Export spans to collector for tail sampling otlp_exporter = OTLPSpanExporter(endpoint="http://otel-collector:4317") span_processor = BatchSpanProcessor(otlp_exporter) trace.get_tracer_provider().add_span_processor(span_processor) # Application code with rich context def process_payment(payment_request): with tracer.start_as_current_span("process_payment") as span: span.set_attribute("payment.amount", payment_request.amount) span.set_attribute("payment.currency", payment_request.currency) span.set_attribute("customer.tier", payment_request.customer_tier) try: result = execute_payment(payment_request) span.set_attribute("payment.status", "success") return result except PaymentException as e: span.record_exception(e) span.set_status(trace.Status(trace.StatusCode.ERROR, str(e))) raise Advanced Tail Sampling Strategies Modern tail sampling implementations support sophisticated decision-making policies that go beyond simple threshold-based rules. Machine learning-enhanced sampling can adapt to changing system patterns, identifying anomalies and adjusting sampling rates dynamically based on historical data and real-time system behavior. Business-context aware sampling represents another advancement, where sampling decisions incorporate domain-specific knowledge such as customer importance, transaction value, or regulatory requirements. This approach ensures that business-critical traces are preserved regardless of their technical characteristics. Composite sampling policies enable organizations to create complex decision trees that evaluate multiple criteria simultaneously. For example, a policy might preserve all traces containing errors while applying probabilistic sampling to successful requests, with higher sampling rates for high-value customers or critical business processes. Benefits and Impact The advantages of tail sampling extend far beyond simple cost optimization. Organizations implementing tail sampling report significant improvements in mean time to detection (MTTD) and mean time to resolution (MTTR) for system issues. By preserving traces that matter most for debugging and analysis, teams can quickly identify root causes and understand system behavior patterns. Storage cost optimization represents another major benefit, with many organizations achieving 60-80% reductions in telemetry storage costs while maintaining or improving observability coverage. This efficiency enables teams to retain data for longer periods, supporting better trend analysis and capacity planning initiatives. The improved signal-to-noise ratio in observability data leads to more effective alerting and monitoring. When sampling systems intelligently preserve relevant traces while filtering out routine operations, alert fatigue decreases and incident response becomes more focused and efficient. Challenges and Considerations Despite its advantages, tail sampling introduces certain complexities that organizations must address. The buffering requirements for delayed decision-making can increase memory usage and system complexity. Proper configuration of buffer sizes and decision timeouts becomes critical for system stability. Latency in sampling decisions may impact real-time monitoring scenarios where immediate trace availability is required. Organizations must balance the benefits of intelligent sampling against the need for immediate observability data access. The stateful nature of tail sampling processors requires careful consideration of high availability and failover scenarios. Unlike stateless head sampling, tail sampling systems must maintain trace state across decision periods, necessitating robust backup and recovery mechanisms. Future Directions The evolution of tail sampling continues with emerging trends in artificial intelligence and machine learning integration. Predictive sampling models are being developed that can anticipate which traces will be valuable based on early indicators, reducing buffer requirements while maintaining sampling accuracy. Integration with AIOps platforms represents another frontier, where tail sampling decisions incorporate broader system context including infrastructure metrics, application performance indicators, and business metrics. This holistic approach promises even more intelligent sampling decisions that align with organizational priorities. Edge computing scenarios are driving development of distributed tail sampling architectures that can make intelligent decisions closer to data sources while coordinating with centralized observability platforms. These developments will enable more efficient telemetry processing in geographically distributed systems. Conclusion Tail sampling represents a fundamental advancement in observability technology, offering organizations the ability to maintain comprehensive system visibility while optimizing costs and reducing noise. As distributed systems continue to grow in complexity and scale, intelligent sampling strategies become increasingly critical for effective system management. The combination of OpenTelemetry's robust implementation support and the continuous evolution of sampling algorithms positions tail sampling as a cornerstone technology for modern observability platforms. Organizations investing in tail sampling capabilities today are building the foundation for more intelligent, efficient, and effective observability practices that will serve them well as their systems continue to evolve. The future of observability lies not in collecting more data, but in collecting the right data. Tail sampling provides the intelligence needed to make these critical distinctions, ensuring that observability systems remain valuable and actionable as they scale with organizational growth and system complexity. More

Understanding Memory Page Sizes on Arm64

By Dave Neary

Building Scalable, Resilient Workflows With State Machines on GCP

By Ravi Teja Thutari

CORE

AWS SNS (Amazon Simple Notification Service) and Spring Boot With Email as Subscriber

By Milan Karajovic

WebAssembly: From Browser Plugin to the Next Universal Runtime

For decades, the digital world has converged on a single, universal computing platform: the web browser. This remarkable piece of software, present on nearly every device, promised a "write once, run anywhere" paradigm, but with a crucial limitation, it only spoke one language natively: JavaScript. While incredibly versatile, JavaScript's nature as a dynamically typed, interpreted language created a performance ceiling. For computationally intensive tasks, developers often hit a wall, unable to achieve the raw speed of native applications. This limitation also meant that the vast, mature ecosystems of code written in languages like C++, C, and Rust were largely inaccessible on the web without cumbersome and often inefficient cross-compilation to JavaScript. Into this landscape emerged WebAssembly (Wasm). Often referred to as a fourth standard language for the web alongside HTML, CSS, and JavaScript, Wasm was not designed to replace JavaScript but to be its powerful companion. It is a binary instruction format, a low-level, assembly-like language that serves as a portable compilation target. This simple yet profound idea meant that developers could take existing code written in high-performance languages, compile it into a compact Wasm binary, and run it directly within the browser at near-native speeds. This breakthrough unlocked a new class of applications that were previously impractical for the web, from sophisticated in-browser tools to full-fledged 3D gaming engines. The design of WebAssembly was forged in the demanding and often hostile environment of the public internet, leading to a set of foundational principles that would define its destiny. It had to be fast, with a compact binary format that could be decoded and executed far more efficiently than parsing text-based JavaScript. It had to be secure, running inside a tightly controlled, memory-safe sandbox that isolated it from the host system and other browser tabs. And it had to be portable, a universal format independent of any specific operating system or hardware architecture. These very principles, essential for its success in the browser, were also the seeds of a much grander vision. This article charts the remarkable journey of WebAssembly, following its evolution from a browser-based performance booster into a foundational technology that is reshaping our approach to cloud, edge, and distributed computing, promising a future built on a truly universal runtime. Beyond the Browser With the WebAssembly System Interface (WASI) WebAssembly's potential was too significant to remain confined within the browser. Developers and architects quickly recognized that a portable, fast, and secure runtime could be immensely valuable for server-side applications. However, a critical piece of the puzzle was missing. Wasm modules running in the browser can interact with its environment through a rich set of Web APIs, allowing it to fetch data, manipulate the screen, or play audio. Server-side applications have a completely different set of needs: they must read and write files, access environment variables, open network sockets, and interact with the system clock. Without a standardized way to perform these basic operations, server-side Wasm would be a collection of incompatible, proprietary solutions, shattering its promise of portability. The solution is the WebAssembly System Interface (WASI), an evolving set of APIs. It's crucial to understand that WASI is not a single, monolithic standard but is currently in a significant transition, from the stable but limited WASI Preview 1 (which lacks standardized networking) to the fundamentally redesigned WASI Preview 2. This newer version is built upon the still-in-proposal Component Model and introduces modular APIs for features like HTTP and sockets. Looking ahead, the next iteration, WASI Preview 3, is anticipated for release in August 2025, promising further advancements such as native async and streaming support. This layer of abstraction is the key to preserving Wasm's "write once, run anywhere" superpower. The WASI standard allows developers to write code in their preferred programming language (including Rust, C/C++, C#, Go, JavaScript, TypeScript, and Python), compile it into a single Wasm binary, and run it on any operating system or CPU architecture using a compliant runtime.. In the browser, the JavaScript engine acts as the host runtime; outside the browser, this role is filled by standalone runtimes such as Wasmtime, Wasmer, or WasmEdge, which implement the WASI standard to provide secure access to system resources. More than just enabling server-side execution, WASI introduced a fundamentally different and more secure way for programs to interact with the system. Traditional applications, following a model established by POSIX, typically inherit the permissions of the user who runs them. If a user can access a file, any program they run can also access that file, which creates a broad and implicit grant of authority. WASI, in contrast, implements a capability-based security model. By default, a Wasm module running via WASI can do nothing. It has no access to the filesystem, no ability to make network connections, and no visibility into system clocks or environment variables. To perform any of these actions, the host runtime must explicitly grant the module a 'capability'. For example, to allow a module to read files, the host must grant it a capability for a specific directory. The module receives a handle to that directory and can operate only within its confines. Any attempt to access a path outside of it will fail at the runtime level with a 'permission denied' error, even if the user running the process has permissions for that file. This enforces the Principle of Least Privilege at a granular level, a stark contrast to the traditional POSIX model where a process inherits all the ambient permissions of the user. This "deny-by-default" posture represents a paradigm shift in application security. The decision to build WASI around a capability-based model was not merely a technical convenience; it was a deliberate architectural choice that transformed Wasm from a simple performance tool into a foundational building block for trustworthy computing. The browser sandbox provided an implicit security boundary designed to protect users from malicious websites. Simply mirroring traditional OS permissions on the server would have compromised this security-first ethos. Instead, by externalizing permission management from the application to the host runtime, WASI makes security an explicit, auditable contract. This has profound implications, making Wasm uniquely suited for scenarios where the code being executed cannot be fully trusted. This includes multi-tenant serverless platforms running customer-submitted functions, extensible applications with third-party plugin systems, and edge devices executing logic from various sources. WASI did not just allow Wasm to run on the server; it defined how it would run: securely, with granular permissions, and by default, with no authority at all. A Different Kind of Isolation: Wasm vs. Containers For many developers today, the container has become the default unit of application deployment, a standardized box for packaging and running software. The rise of WebAssembly has introduced a new model, prompting a comparison that is less about which technology is superior and more about understanding two fundamentally different philosophies for achieving portability and isolation. The container philosophy centers on porting the entire environment. A container image, such as one built with Docker, packages an application along with a complete slice of its user-space operating system: a filesystem, system libraries, configuration files, and all other dependencies. It achieves isolation from the host and other containers by leveraging OS-level virtualization features, primarily Linux namespaces and control groups (cgroups), which create the illusion of a private machine. The container's promise is that this self-contained environment will run consistently everywhere a container engine is installed. The WebAssembly philosophy, in contrast, is about porting only the application logic. A Wasm module is a single, self-contained binary file containing just the compiled application code. It brings no operating system, no filesystem, and no system bundled libraries. Instead, it relies on the host runtime to provide a standardized environment and to mediate access to system resources through the WASI interface. Wasm's promise is that the application logic, compiled once, will run consistently everywhere a compliant Wasm runtime is present. This philosophical divergence leads to significant practical trade-offs in size, speed, and security. Because a container must package a slice of an operating system, its image size is measured in (hundreds of) megabytes, even for simple applications. A Wasm module, containing only the application code, is orders of magnitude smaller, typically measured in kilobytes or a few megabytes. This dramatic difference impacts everything from storage costs and network transfer times to the density of workloads that can run on a single machine. The most critical distinction, particularly for modern cloud architectures, is startup speed. A container must initialize its packaged environment: a process that involves setting up namespaces, mounting the filesystem, and booting the application. This "cold start" can take hundreds of milliseconds, or even several seconds. A Wasm module, on the other hand, is instantiated by an already-running runtime, a process that can take less than a millisecond (for compiled languages like Rust, C or Go). This near-instantaneous startup effectively eliminates the cold start problem, making Wasm an ideal technology for event-driven, scale-to-zero architectures like serverless functions, where responsiveness is paramount. The security models also differ profoundly. Containers provide isolation at the OS kernel level. This means all containers on a host share the same kernel, which represents a large and complex attack surface. Security vulnerabilities often center on kernel exploits or misconfigurations that allow a process to "escape" its container and gain access to the host system. WebAssembly introduces an additional, finer-grained layer of isolation: the application-level sandbox. The attack surface is not the entire OS kernel, but the much smaller and more rigorously defined boundary of the Wasm runtime and the WASI interface. Combined with its capability-based security model, this makes Wasm "secure by default" and a far safer choice for running untrusted or third-party code. FeatureWebAssembly (WASM)ContainersUnit of PortabilityApplication Logic (a .wasm binary)Application Environment (an OCI image with an OS filesystem)Isolation ModelApplication-level Sandbox (deny-by-default)OS-level Virtualization (namespaces, cgroups)Security BoundaryWasm Runtime & WASI Interface (small, well-defined)Host OS Kernel (large, complex attack surface)Startup TimeSub-millisecond ("zero cold start")Hundreds of milliseconds to seconds ("cold start" problem)Size / FootprintKilobytes to MegabytesMegabytes to GigabytesPlatform DependencyRuntime-dependent (any OS/arch with a Wasm runtime)OS and Architecture-dependent (e.g. linux/amd64)Ideal Use CaseServerless functions, microservices, edge computing, plugin systemsLift-and-shift legacy apps, complex stateful services, databases Ultimately, these two technologies are not adversaries but complements. It is common to run Wasm workloads inside containers as a first step toward integrating them into existing infrastructure. Each technology is optimized for different scenarios. Containers excel at lifting and shifting existing, complex applications that depend on a full POSIX-compliant environment, such as databases or legacy monolithic services. WebAssembly shines in the world of greenfield, cloud-native development, offering a lighter, faster, and more secure foundation for building the next generation of microservices and serverless functions. New Foundations for Platform Engineering: The Cloud and the Edge For WebAssembly to fulfill its potential as a server-side technology, it must integrate seamlessly into the dominant paradigm for cloud infrastructure management: Kubernetes. This integration is not just possible; it is already well underway, enabled by the extensible architecture of the cloud-native ecosystem. At its core, Kubernetes orchestrates workloads by communicating with a high-level container runtime, such as containerd, on each of its worker nodes. This high-level runtime is responsible for managing images and container lifecycles, but it delegates the actual task of running a process to a low-level runtime. For traditional Linux containers, this runtime is typically runc. The key to running Wasm on Kubernetes lies in replacing this final link in the chain. Projects like runwasi provide a "shim", a small piece of software that acts as a bridge, allowing containerd to communicate with a WebAssembly runtime (like Wasmtime or WasmEdge) just as it would with runc. This makes the Wasm runtime appear to Kubernetes as just another way to run workloads. The final piece of the integration is a Kubernetes object called a RuntimeClass, which acts as a label. By applying this label to a workload definition, developers can instruct the Kubernetes scheduler to deploy that specific workload to nodes configured with the Wasm shim, enabling Wasm modules and traditional containers to run side-by-side within the same cluster. Projects like SpinKube are emerging to automate this entire setup process, making it easier for organizations to adopt Wasm without rebuilding their infrastructure from scratch. This deep integration enables new and more efficient approaches to platform engineering: the discipline of building and managing the internal platforms that development teams use to ship software. In this pattern, the platform team provides standardized components that encapsulate common, cross-cutting concerns like logging, metrics, network access, and security policies. Application developers, in turn, focus solely on writing a "user" component that contains pure business logic. At deployment time, these two pieces are composed into a single, tiny, and secure Wasm binary. This creates a powerful separation of concerns. Developers are freed from boilerplate code and infrastructure details, while the platform team can enforce standards, patch vulnerabilities, and evolve the platform's capabilities centrally and transparently, without requiring application teams to rebuild or redeploy their code. While these patterns are transforming the cloud, it is at the network's edge where WebAssembly's advantages become not just beneficial, but essential. Edge computing involves moving computation away from centralized data centers and closer to where data is generated and consumed: on IoT devices, in factory machinery, at retail locations, or within telecommunication networks. These environments are often severely resource-constrained, with limited CPU, memory, and power, making heavyweight containers impractical or impossible to run. WebAssembly is a near-perfect fit for this world. Its incredibly small binary size and minimal resource footprint allow it to run on devices where containers cannot. Its near-instantaneous startup times are critical for the event-driven, real-time processing required in many edge scenarios. And its true platform independence, the ability for a single compiled binary to run on any CPU architecture, be it x86, ARM, or RISC-V, is a necessity in the heterogeneous hardware landscape of the edge. This has unlocked a new wave of applications, from running machine learning inference models to executing dynamic logic within Content Delivery Networks (CDNs) with ultra-low latency. The ability of WebAssembly to operate seamlessly across these diverse environments reveals its most profound impact. Historically, software development has been siloed; building for the browser, the cloud, and embedded devices required different tools, different languages, and different deployment models. Containers helped unify deployment in the cloud, but they are foreign to the browser and too cumbersome for much of the edge. WebAssembly is the first technology to provide a single, consistent application runtime that spans this entire compute continuum. The true strength of WebAssembly lies in how its ecosystem bridges the historically separate worlds of the browser, cloud, and edge. While the final .wasm module is often tailored for its specific environment, Wasm as a standard provides a common compilation target. This allows developers to deploy applications across a vast spectrum: from a rich user interface in a web browser, to large-scale processing orchestrated by Kubernetes, and even to tiny, resource-constrained IoT devices. This reality enables a future where developers write their core business logic once and can deploy it to the most appropriate location: close to the user for low latency, in the cloud for heavy computation, or in the browser for interactivity without needing to rewrite or repackage it. This capability breaks down the architectural barriers that have long defined distributed systems, paving the way for a truly fluid and unified model of computation. The Future is Composable: The WebAssembly Component Model Despite its portability and security, a final, fundamental challenge has historically limited WebAssembly's potential: true interoperability. While a single Wasm module is a self-contained unit, getting multiple modules to communicate with each other effectively has been remarkably difficult. The core Wasm specification only allows for the passing of simple numeric types, integers and floats, between modules. Exchanging more complex data structures like strings, lists, or objects requires developers to manually manage pointers and memory layouts, a process that is deeply tied to the conventions of the source language and compiler. This "impedance mismatch" means that a Wasm module compiled from Rust cannot easily call a function in a module compiled from Go, as they represent data in fundamentally incompatible ways. This has been the primary barrier to creating a vibrant, language-agnostic ecosystem of reusable Wasm libraries, forcing developers into fragile, language-specific linking models where modules must share a single linear memory space. The WebAssembly Component Model is the ambitious proposal designed to solve this final challenge. It is critical, however, to understand its current status: the Component Model is an active proposal under development, not a finalized W3C standard. While tooling and runtimes are rapidly implementing it, the specification is still subject to change. It is an evolution of the core standard that elevates Wasm from a format for individual, isolated modules into a system for building complex applications from smaller, interoperable, and language-agnostic parts. The most effective analogy for the Component Model is that it turns Wasm modules into standardized "LEGO bricks". Each component is a self-contained, reusable piece of software with well-defined connection points, allowing them to be snapped together to build something larger. Two key concepts make this possible: WIT and “worlds”. The WebAssembly Interface Type (WIT) is an Interface Definition Language (IDL) used to describe the "shape" of the connectors on these metaphorical LEGO bricks. A WIT file defines the high-level functions and rich data types such as strings, lists, variants, and records that a component either exports (provides to others) or imports (requires from its environment). Crucially, the standard WASI interfaces themselves (e.g. for filesystems or sockets) are also defined using WIT. This means developers can use the exact same language to extend the default system capabilities with their own domain-specific interfaces, creating a unified and powerful way to describe any interaction. A "world" is a WIT definition that describes the complete set of interfaces a component interacts with, effectively declaring all of its capabilities and dependencies. Tooling built around the Component Model, such as wit-bindgen, then automatically generates the necessary "binding code" for each language. This code handles the complex task of translating data between a language's native representation (e.g., a Rust String or a Python list) and a standardized, language-agnostic memory layout known as the Canonical ABI. The result is seamless interoperability: a component written in C++ can call a function exported by a component written in TinyGo, passing complex data back and forth as if they were native libraries in the same language, without either needing any knowledge of the other's internal implementation. This enables a fundamentally different approach to software composition compared to the container world. Container-based architectures are typically composed at design time. Developers build discrete services, package them into containers, and then define how they interact, usually over a network via APIs, using orchestration configurations like Kubernetes manifests or Docker Compose files. This is a model for composing distributed systems. The WebAssembly Component Model enables granular composition at runtime. Components communicate through fast, standardized in-memory interfaces rather than network protocols, allowing them to be linked together within the same process. This creates a model for building applications from secure, sandboxed, and interchangeable parts. A prime example is wasmCloud. In this platform, components (called actors) declare dependencies on abstract interfaces, like a key-value store. At runtime, they are dynamically linked to providers that offer concrete implementations (e.g. a Redis provider). The key advantage is that these links can be changed on the fly. You can swap the Redis provider for a different one without restarting or recompiling the application, perfectly realizing the goal of building flexible systems from truly interchangeable parts. This shift from source-level libraries to compiled, sandboxed components as the fundamental unit of software reuse represents a paradigm shift. It is the technical realization of architectural concepts like Packaged Business Capabilities (PBCs), where distinct business functions are encapsulated as autonomous, deployable software components. A Wasm component provides a near-perfect implementation of a PBC: it is a compiled, portable, and secure artifact that encapsulates specific logic. The Component Model, therefore, is not just a technical upgrade for linking code. It is the foundation for a future where software is no longer just written, but composed. Developers will be able to assemble applications from a universal ecosystem of secure, pre-built components that provide best-of-breed solutions for specific tasks, fundamentally altering the nature of the software supply chain and accelerating innovation across all languages and platforms. Conclusion: From a Faster Web to a Universal Runtime WebAssembly's journey has been one of remarkable and accelerating evolution. Born from the practical need to overcome performance bottlenecks in the web browser, its core principles of speed, portability, and security proved to be far more powerful than its creators may have initially envisioned. What began as a way to run C++ code alongside JavaScript has grown into a technology that is fundamentally reshaping our conception of software. The introduction of the WebAssembly System Interface (WASI) was the pivotal moment, transforming Wasm from a browser-centric tool into a viable, universal runtime for server-side computing. Its capability-based security model offered a fresh, "secure-by-default" alternative to traditional application architectures. This new foundation allowed Wasm to emerge as a compelling counterpart to containers, offering an unparalleled combination of lightweight footprint, near-instantaneous startup, and a hardened security sandbox that is ideally suited for the demands of serverless functions and the resource-constrained world of edge computing. Today, Wasm is not just a technology for the browser, the cloud, or the edge; it is the first to provide a single, consistent runtime that spans this entire continuum, breaking down long-standing silos in software development. Now, with the advent of the Component Model, WebAssembly is poised for its next great leap. By solving the final, critical challenge of language-agnostic interoperability, it lays the groundwork for a future where applications are not monoliths to be built, but solutions to be composed from a global ecosystem of secure, reusable, and portable software components. WebAssembly is more than just a faster way to run code; it is a foundational shift toward a more modular, more secure, and truly universal paradigm for the next era of computing.

By Graziano Casto

CORE

Amadeus Cloud Migration on Ampere Altra Instances

“You might not be familiar with Amadeus, because it is a B2B company [but] when you search for a flight or a hotel on the Internet, there is a good chance that you are using an Amadeus-powered service behind the scenes,” according to Didier Spezia, a cloud architect for Amadeus. Amadeus is a leading global travel IT company, powering the activities of many actors in the travel industry: airlines, hotel chains, travel agencies, airports, among others. One of Amadeus’ activities is to provide shopping services to search and price flights for travel agencies and companies like Kayak or Expedia. Amadeus also supports more advanced capabilities, such as budget-driven queries and calendar-constrained queries, which require pre-calculating multi-dimensional indexes. Searching for suitable flights with available seats among many airlines is surprisingly difficult. Getting the optimal solution is considered an NP-hard problem, so to provide a best-effort answer, Amadeus uses a combination of brute force, graph algorithms, and heuristics. It requires large-scale, distributed systems and consumes a lot of CPUs, running on thousands of machines today on Amadeus’ premises. To fulfill customer requests, Amadeus operates multiple on-prem facilities worldwide and also runs workloads on multiple cloud service providers. The Project A few years ago, Amadeus began a large, multi-year project to migrate most of Amadeus’ on-prem resources to Azure. For this specific use case, Amadeus worked jointly with Microsoft to validate Ampere ARM-based virtual machines (VMs). During the discussion, Mo Farhat from Microsoft commented: “From our position… [Microsoft] wants to give our customers a choice. We're not driving [them] towards one architecture versus another… or one CPU versus another. We want to provide a menu of options and provide trusted advice…” Initially, as part of the transition, Amadeus was not necessarily interested in introducing a different architecture. According to Spezia, “We only introduce a different architecture because we expect some benefit. […] We are very interested in the performance/price ratio we can get from Ampere. […] We want the capability to mix machines with traditional x86 CPUs and machines with Ampere CPUs and run workloads on the CPUs best suited for that workload.” They chose a large, distributed, compute-intensive C++ application as the first one to run on Ampere, as they felt that this would provide the greatest comparative benefit over x86. “We thought ARM-based machines could be a good match, but of course, we needed to validate and confirm our assumptions. We started by running a number of synthetic benchmarks. […] The results were positive, but synthetic benchmarks are not extremely relevant. Since introducing a new CPU architecture in the ecosystem is not neutral, we needed a better guarantee and decided to benchmark with real application code. […] The application is a large C++ code base. It depends on a good number of low-level open-source libraries, plus some Amadeus middleware libraries, and finally, the functional code itself. A subset of this code has been isolated for the benchmark to run in a testbed.” One of the factors that enabled the project to be successful was the ability of the Amadeus team to obtain Ampere servers early in the project. “To start, Amadeus installed a couple of machines with Ampere Altra CPUs on-prem. They were used for the initial porting work, and they still run our CI/CD today. Since we are in the middle of a migration to the public cloud and very much in the hybrid model with a complex ecosystem, we appreciated the flexibility to deploy some machines on-prem, with the same CPU architecture as the VM delivered in Azure by Microsoft. We found it invaluable to use machines running the target architecture for CI/CD and testing, rather than doing cross-compilation," according to Didier. The application’s CI continues to run on an Ampere server in the Amadeus lab. Challenges “Porting our code started by recompiling everything using an Arm64 compatible toolchain (Aarch64 target), with implications on our CI/CD.” The porting process of getting this code working on Ampere went very smoothly, although some issues were revealed. Some platform-specific compiler behavior, such as whether the “char” data type is signed or unsigned, was different on x86 and Arm64, and the application made assumptions about the behavior. To compile their large C++ code base, Amadeus uses both the GCC and clang C++ compilers. Among the changes required as part of the port, a number of open source dependencies required upgrades to take advantage of improved Arm64 support. Some of those upgrades involved API or behavior changes that required further code changes. In addition, several latent issues in the code base which had not revealed themselves on x86 were exposed as part of the migration, related to undefined or platform-defined behavior, were exposed and fixed as part of the migration. Deployment In the cloud, Amadeus applications are deployed on OpenShift clusters (Red Hat’s Kubernetes-based container platform). To be operated in production, the applications require a full middleware ecosystem (enterprise service bus, logging and monitoring facilities, etc.), which is also hosted in OpenShift. Amadeus did not want to migrate their entire application infrastructure to Arm64. Red Hat, another trusted partner, has delivered a Kubernetes feature enabling heterogeneous hardware architectures in a single cluster into OpenShift as a supported feature. Concretely, this means a single OpenShift cluster can include both x86 and ARM compute nodes. By defining node sets with both x86 and Arm64 nodes, and using labels and “taints” for containers to be deployed, the developers can easily decide the type of VMs the pods are scheduled on. The supporting components of the Amadeus application infrastructure can therefore run on traditional x86 VMs, while the application pods that Amadeus decides to run for cost and performance reasons on Arm64 can run on Azure Dps v5 VMs powered by Ampere Altra CPUs. Heterogeneous clusters are instrumental in supporting an incremental migration and avoiding doubling the number of OpenShift clusters to be operated. Results Obviously, before moving into production, Amadeus wanted to validate their assumptions with some benchmarking. With the cpubench1a synthetic benchmark, with 32 vCPUs VMs, a single Ampere Altra VM (D32ps_v5) delivered 20% higher raw throughput, and a 50% performance/price improvement over equivalent Intel VMs, and 13% raw throughput and 27% performance/price throughput over equivalent AMD VMs. When benchmarking with the realistic shopping application benchmark, there was a tradeoff between throughput and response time. The higher the throughput, the more response time was impacted. The Ampere Altra VMs yielded a 47% performance/price improvement, with an acceptable degradation of 11% in mean response time over Intel VMs, and 37% performance/price with a 9% degradation in average response time over AMD. Amadeus has now ported enough application components to run the real application (not just benchmarks). The company is currently completing integration tests and validating the last bits of the platform. Once done, Amadeus will begin ramping up the production environment in multiple Azure regions. Check out the full Ampere article collection here.

By Craig Hardy

Self-Managed Keycloak for App Connect Dashboard and Designer Authoring

With the release of the IBM® App Connect Operator version 12.1.0, you can now use your existing Keycloak instance to configure authentication and authorization for App Connect Dashboard and Designer Authoring. Building on top of the capability to use Keycloak, which was first available in IBM® App Connect Operator version 11.0.0, this feature extends the supported platforms from Red Hat® OpenShift® Container Platform (OCP) only to also include Kubernetes. It has in addition removed the dependencies on the IBM® Cloud Pak foundational services and IBM® Cloud Pak for Integration operators. It is worth noting that this new feature is only available with App Connect licenses. This article contains a tutorial on how to use your Keycloak instance to manage authentication and authorization for App Connect Dashboard and Designer Authoring. There are two scenarios. Scenario 1 covers how to configure App Connect Dashboard with your Keycloak instance on Kubernetes, whilst scenario 2 covers configuring App Connect Designer Authoring on Kubernetes. While you can follow this tutorial for OCP with the kubectl command-line tool, the App Connect documentation and a related tutorial on how to use Keycloak with IBM® App Connect Operator version 11.0.0 provide further guidance on how to use your Keycloak instance from the Red Hat UI. Prerequisites Install the IBM® App Connect Operator version 12.1.0 or later.Use App Connect licenses only (such as AppConnectEnterpriseProduction).Use App Connect Dashboard and Designer Authoring versions 12.0.12.3-r1 or later.Kubernetes version 1.25, 1.27, 1.28 or 1.29.Install the kubectl command-line tool.Install a Keycloak instance and obtain the following information: The URL of Keycloak endpoint.The Certificate Authority (CA) certificate from Keycloak.The URL and credentials to access the Keycloak Admin Console. For this tutorial, we configured a keycloak instance with the Keycloak operator version 24.0.3 on Kubernetes. If you do not have an existing Keycloak instance and would like to create one to complete this tutorial, you can follow the documentation for the Keycloak operator. You MUST skip versions 25.0.0 and 25.0.1, which introduced a defect, where user client roles are not returned by token introspection from Keycloak. When you are following that documentation to set up a database for Keycloak, you must modify the default values for POSTGRES_USER and POSTGRES_PASSWORD in the example yaml. Article index Scenario 1: Create and access App Connect Dashboard with your Keycloak instance on Kubernetes. Part 1: Create a keycloak client for App Connect Dashboard.Part 2: Create Keycloak related secrets on your Kubernetes cluster.Part 3: Create an App Connect Dashboard to use your Keycloak instance.Part 4: Access your App Connect Dashboard.Scenario 2: Create and access App Connect Designer Authoring with your Keycloak instance on Kubernetes. Part 1: Create a keycloak client for App Connect Designer Authoring.Part 2: Create Keycloak related secrets on your Kubernetes cluster.Part 3: Create an App Connect Designer Authoring to use your Keycloak instance.Part 4: Access your App Connect Designer Authoring.Troubleshooting. Invalid parameter: redirect_uri.Something went wrong: initial connection from App Connect Dashboard or Designer UI to Keycloak.Something went wrong: error validating Keycloak client roles. Note: In this article, resource names are highlighted in dark red. Keywords that are displayed on a UI are highlighted in bold. Scenario 1: Create and Access App Connect Dashboard With Your Keycloak Instance on Kubernetes Part 1: Create a Keycloak Client for App Connect Dashboard Let's create and configure a Keycloak client, so that your Keycloak instance can authenticate incoming requests from the App Connect Dashboard. From your Keycloak admin console, use the navigation pane to select a realm from the drop-down list. In this tutorial, we have set up a realm called exampleRealm. Next, on the navigation pane, select Clients, and then click Create client to create a Keycloak client. Set Client ID and click Next. In this tutorial, the client ID for App Connect Dashboard is set to dash-ace-keycloak-k8s-example-iam-11111. It contains a number of parts to make it uniquely identifiable, such as the App Connect resource type (dash for App Connect Dashboard), the namespace where the App Connect resource will be created, the name of the App Connect resource and a random five digits at the end. Toggle to enable Client authentication and Authorization. Click Next. Click Save to create the client. You will come back to set Valid redirect URIs and Valid post logout redirect URIs later, so that Keycloak can redirect you to the App Connect Dashboard UI after a successful login, and to the landing page after logout respectively. Note that logout is only available on App Connect Operator version 12.4.0 or later and operand version 13.0.1.0-r1 or later. When using a Keycloak version from v24.0.0 onwards, you should make use of light-weight access tokens. Without light-weight access tokens you may see problems logging in when a user is assigned a large number of roles. On the navigation pane, select Clients. Click the client ID you have created from the Client ID column. Click Advanced, then jump to the Advanced settings section. Toggle on the setting for Always use lightweight access token and then click Save. Next you need to configure the client with the available roles and required client scope for App Connect Dashboard. On the navigation pane, select Clients. Click the client ID you have created from the Client ID column. To create roles, click Roles, then click Create role. There are two roles available for App Connect Dashboard, which are dashboard-viewer and dashboard-admin. The former gives you a view-only access to the Dashboard, which means you can only view resources. The latter enables you to perform administrative tasks, such as creating an IntegrationRuntime and uploading a BAR file. In this tutorial, we will create both roles for the Keycloak client. To create a viewer role, enter dashboard-viewer in Role name and click Save. Next, repeat this step to create a dashboard-admin role. Now you will add a required mapper to the Keycloak client. Click Client scopes, then click Add client scope. Next, click the client scope named dash-ace-keycloak-k8s-example-iam-11111-dedicated. Click Add mapper and select By configuration. This displays a table of predefined mappings. From the table, click to select User Client Role. From the Add mapper editing window for the User Client Role mapper type: Set Name to a name of your choice. In this tutorial, we set it to effective-client-role.From the Client ID drop-down list, select your Keycloak client.Set Token Claim Name to effective-roles, which is a required value for the App Connect Dashboard and Designer Authoring to validate user roles.Toggle to enable Multivalued, Add to ID token, Add to access token, Add to lightweight access token, Add to userinfo and Add to token introspection.Finally, click Save to complete this mapper. Note: The Red Hat Keycloak interface might not have the toggle option for Add to token introspection. In that case, it is enabled by default. Part 2: Create Keycloak Related Secrets on Your Kubernetes Cluster To enable Transport Layer Security (TLS) between Keycloak and App Connect resources (Dashboard and Designer Authoring), you need to provide a couple of credentials on your Kubernetes cluster. The credentials should be stored as a Kubernetes Secret resource, and in a namespace that is accessible by your App Connect resources. Create a namespace named ace-keycloak-k8s. In this tutorial, this namespace will be used to install these secrets as well as App Connect resources. To create the namespace with the kubectl command, you need to log into your Kubernetes cluster from a terminal, and then run the following command:kubectl create namespace ace-keycloak-k8sSecret 1: Keycloak client secret - Create a secret to store credentials of the Keycloak client. This secret must contain two key-value pairs. The keys must be named CLIENT_ID and CLIENT_SECRET. Copy the following YAML template into a file named kcClientSecret.yaml. YAML kind: Secret apiVersion: v1 metadata: name: dash-ace-keycloak-k8s-example-iam-11111 namespace: ace-keycloak-k8s labels: app: keycloak data: CLIENT_ID: ZGFzaC1hY2Uta2V5Y2xvYWstazhzLWV4YW1wbGUtaWFtLTExMTEx CLIENT_SECRET: modify-this-value type: Opaque The value of CLIENT_ID is a base64-encoded value of the client ID that you created in part 1. The base64-encoded value can be obtained by running the following command in a terminal:echo -n "dash-ace-keycloak-k8s-example-iam-11111" | base64 Note: You should change metadata.name and the CLIENT_ID values accordingly, when you are creating a client secret for App Connect Designer Authoring. You must replace the value of CLIENT_SECRET with the following steps: Select your Keycloak client from the Keycloak admin console, and then click Credentials. Copy the client secret from the Client Secret field. Note that the following example shows a Keycloak client for App Connect Dashboard, you should change the CLIENT_SECRET value accordingly, when you are creating a client secret for Designer Authoring. Base64 encode the copied value. For example, run the following command in a terminal:echo -n "client-secret-value" | base64Use the base64-encoded value to set CLIENT_SECRET in the yaml file.Create the secret on your cluster by running the following command:kubectl apply -f kcClientSecret.yaml -n ace-keycloak-k8s Secret 2: CA certificate Secret - Create a secret to store the CA certificate from Keycloak. This secret must contain a key-value pair. The name of the key is not fixed, but is default to ca.crt. You can specify your own key name, such as myca.crt. In that case, you must specify it in the App Connect Dashboard or Designer Authoring CR fields spec.authentication.integrationKeycloak.tls.caCertificate and spec.authorization.integrationKeycloak.tls.caCertificate. Copy the following YAML template into a file called kcCASecret.yaml. YAML kind: Secret apiVersion: v1 metadata: name: example-tls-secret namespace: ace-keycloak-k8s labels: app: keycloak data: ca.crt: modify-this-value type: Opaque You must replace the value for ca.crt with the following steps: See your Certification Authority to obtain the CA certificate.Base64 encode the CA certificate. For example, in a terminal, run the following command:echo -n "-----BEGIN CERTIFICATE----- abcdefg -----END CERTIFICATE-----" | base64Use the base64-encoded value to set the value for ca.crt in the yaml file.Create the secret on your cluster by running the following command:kubectl apply -f kcCASecret.yaml -n ace-keycloak-k8s Part 3: Create an App Connect Dashboard To Use Your Keycloak Instance On Kubernetes, create an ingress resource for your App Connect Dashboard. Follow the documentation to create one and note down the spec.tls.hosts in the ingress YAML file. Note: With App Connect Operator 12.8.0 or later, an App Connect Dashboard at version 13.0.2.1-r1 or later supports ingress out of the box on IBM Cloud Kubernetes Service. So, to automatically create an ingress resource for your Dashboard, simply set spec.ingress.enabled to true in the Dashboard CR, as described in the Creating ingress resources for your App Connect Dashboard and integration runtimes out of the box on IBM Cloud Kubernetes Service blog. Copy the following YAML template into a file named dashboard_iam.yaml. YAML apiVersion: appconnect.ibm.com/v1beta1 kind: Dashboard metadata: name: example-iam-dash labels: backup.appconnect.ibm.com/component: dashboard namespace: ace-keycloak-k8s spec: api: enabled: true license: accept: true license: L-XRNH-47FJAW use: AppConnectEnterpriseProduction pod: containers: content-server: resources: limits: memory: 512Mi requests: cpu: 50m memory: 50Mi control-ui: resources: limits: memory: 512Mi requests: cpu: 50m memory: 125Mi imagePullSecrets: - name: ibm-entitlement-key switchServer: name: default authentication: integrationKeycloak: auth: clientSecretName: dash-ace-keycloak-k8s-example-iam-11111 enabled: true endpoint: 'https://example-keycloak.test.com' realm: exampleRealm tls: secretName: exmaple-tls-secret ingressHost: example-iam.example-keycloak.test.com authorization: integrationKeycloak: auth: clientSecretName: dash-ace-keycloak-k8s-example-iam-11111 enabled: true endpoint: 'https://example-keycloak.test.com' realm: exampleRealm tls: secretName: exmaple-tls-secret ingressHost: example-iam.example-keycloak.test.com storage: size: 5Gi type: persistent-claim class: ibmc-file-gold-gid displayMode: IntegrationRuntimes replicas: 1 version: '12.0.12.3-r1' Set spec.authentication.integrationKeycloak.auth.clientSecretName and spec.authorization.integrationKeycloak.auth.clientSecretName to dash-ace-keycloak-k8s-example-iam-11111. This is the client secret that was created in part 2.Ensure spec.authentication.integrationKeycloak.enabled and spec.authorization.integrationKeycloak.enabled are set to true, which enables authentication and authorization for App Connect Dashboard.Set spec.authentication.integrationKeycloak.endpoint and spec.authorization.integrationKeycloak.endpoint to the URL of Keycloak endpoint. You can find the value in the KC_HOST environment variable in your Keycloak pod. Run the following command to get the value:kubectl get pod <keycloak-pod-name> -n <namespace-for-keycloak-pod> -o=jsonpath='{.spec.containers[0].env[?(@.name == "KC_HOSTNAME")].value}' Note: If the endpoints are not provided, whilst authentication and authorization are enabled, the IBM® Cloud Pak foundational services must be installed to provide authentication and authorization for App Connect Dashboard and Designer Authoring. This is supported on OCP only.Set spec.authentication.integrationKeycloak.realm and spec.authorization.integrationKeycloak.realm to the Keycloak realm, where the Keycloak client dash-ace-keycloak-k8s-example-iam-11111 exists. In this tutorial, it is exampleRealm.Set spec.authentication.integrationKeycloak.tls.secretName and spec.authorization.integrationKeycloak.tls.secretName to example-tls-secret. This is the CA secret that was created in part 2. Because we used the default name ca.crt for the CA secret, we do not need to specify spec.authorization.integrationKeycloak.tls.caCertificate and spec.authorization.integrationKeycloak.tls.caCertificate. Therefore the caCertificate fields are not included in the example CR dashboard_iam.yaml.Set spec.authentication.integrationKeycloak.tls.ingressHost and spec.authorization.integrationKeycloak.tls.ingressHost to the spec.tls.hosts value obtained in step 1. Note: Skip step 8 if you are creating a Dashboard instance at version 13.0.2.1-r1 or later and have set spec.ingress.enabled to true in the Dashboard CR. Follow the documentation on entitlement key to create a ibm-entitlement-key Secret. This enables you to download the required images for App Connect Dashboard.Follow the documentation on Dashboard storage to set spec.storage.class.(Optional) Set spec.version to 12.0 to pick up the latest App Connect Dashboard operand version.Create the App Connect Dashboard resource with the following command:kubectl apply -f dashboard_iam.yaml -n ace-keycloak-k8sOnce the App Connect dashboard deployment is ready, you can navigate to part 4 to access it. Part 4: Access your App Connect Dashboard From your Keycloak admin console, use the navigation pane to select a realm from the drop-down list. In this tutorial, we have set up a realm called exampleRealm. Next, on the navigation pane, select Clients and click the client dash-ace-keycloak-k8s-example-iam-11111. Set Valid redirect URIs to <ACE_INGRESS_HOSTNAME>/oauth/callback. ACE_INGRESS_HOSTNAME is the URL of the App Connect Dashboard UI. You can obtain the value of ACE_INGRESS_HOSTNAME with the following command:kubectl get configmap example-iam-dash-dash -o=jsonpath='{.data.ACE_INGRESS_HOSTNAME}' -n ace-keycloak-k8s The name of the configmap resource is in the format of <dashboard metadata.name>-dash.If you are on App Connect Operator version 12.4.0 or later and Dashboard operand version 13.0.1.0-r1 or later, set Valid post logout redirect URIs to https://<ACE_INGRESS_HOSTNAME>.Now you need to create a user to log in to the Dashboard. To do so, you can follow steps 6 to 10 in the Create a user and configure user roles section of the Keycloak tutorial for IBM® App Connect Operator version 11.0.0.In a Web browser, navigate to <ACE_INGRESS_HOSTNAME>. As a result, a request is sent to the control-ui container in the Dashboard pod. With information on the Keycloak client and Keycloak endpoint, the request is redirected to Keycloak to provide authentication and authorization for App Connect Dashboard. You can use the user created in step 3 to log in. If you are directed to an error page, refer to the troubleshooting section. Keycloak validates the user information, and forwards the request to the Valid redirect URIs that you configured in the Keycloak client. As a result, congratulations, you are now logged into the App Connect Dashboard. If you are directed to an error page, refer to the troubleshooting section. Scenario 2: Create and Access App Connect Designer Authoring With Your Keycloak Instance on Kubernetes Part 1: Create a Keycloak Client for App Connect Designer Authoring Let's create and configure a Keycloak client, so that your Keycloak instance can authenticate incoming requests from the App Connect Designer Authoring. You can follow Part 1: Create a Keycloak client for App Connect Dashboard in scenario 1, with variations as follows: Step 2: Set the client ID to designer-ace-keycloak-k8s-example-iam-11111.Step 5 and 6: Ensure the client ID designer-ace-keycloak-k8s-example-iam-11111 is selected.Step 7: Instead of creating Dashboard specific roles, you need to create a role for App Connect Designer Authoring. There is one role available, which is designerauthoring-admin. The role enables you to perform administrative tasks, such as creating and importing a flow.Step 8: In addition to adding a mapper named User Client Role, which is also required by App Connect Designer Authoring, you need to add three new mappers for App Connect Designer Authoring as follows: Add a mapper of the User Attribute type: Click Add mapper and select By configuration. Click User Attribute from the table of predefined mappings.Set Name to a name of your choice. In this tutorial, we set it to LDAP_ID.Set User Attribute to LDAP_ID.Set Token Claim Name to ldap_id, which is a required value for the App Connect Designer Authoring to validate user roles.Toggle to enable Add to ID token, Add to access token, Add to lightweight access token and Add to userinfo.Finally, click Save to complete this mapper.Add a mapper of the User Session Note type: Click Add mapper and select By configuration. Click User Session Note from the table of predefined mappings.Set Name to a name of your choice. In this tutorial, we set it to identity_provider.Set User Session Note to identity_provider.Set Token Claim Name to identity_provider, which is a required value for the App Connect Designer Authoring to validate user roles.Toggle to enable Add to ID token, Add to access token, Add to lightweight access token and Add to userinfo.Finally, click Save to complete this mapper.Add a mapper of the User Session Note type: Click Add mapper and select By configuration. Click User Session Note from the table of predefined mappings.Set Name to a name of your choice. In this tutorial, we set it to identity_provider_identity.Set User Session Note to identity_provider_identity.Set Token Claim Name to identity_provider_identity, which is a required value for the App Connect Designer Authoring to validate user roles.Toggle to enable Add to ID token, Add to access token, Add to lightweight access token and Add to userinfo.Finally, click Save to complete this mapper. Part 2: Create Keycloak Related Secrets on Your Kubernetes Cluster Follow Part 2: Create Keycloak related secrets on your Kubernetes cluster to create the required secrets, with variations as follows: If you have already completed Scenario 1 in the same Kubernetes environment, you can skip the creation of the ace-keycloak-k8s namespace, and the secret to store the CA certificate from Keycloak.You need to use the client ID for App Connect Designer Authoring, which is designer-ace-keycloak-k8s-example-iam-11111. The CLIENT_SECRET should be obtained from this client ID. Part 3: Create an App Connect Designer Authoring To Use Your Keycloak Instance On Kubernetes you need to create an ingress route for your Designer Authoring. Follow the documentation to create one and note down the spec.tls.hosts in the ingress yaml. Note: With App Connect Operator 12.9.0 or later, an App Connect Designer Authoring at version 13.0.2.2-r1 or later supports ingress out of the box on IBM Cloud Kubernetes Service. So, to automatically create an ingress resource for your Designer Authoring, simply set spec.ingress.enabled to true in the Designer Authoring CR, as described in the Creating ingress resources for your App Connect Designer Authoring and switch servers out of the box on IBM Cloud Kubernetes Service blog. Copy the following YAML template into a file named designer_iam.yaml. YAML apiVersion: appconnect.ibm.com/v1beta1 kind: DesignerAuthoring metadata: name: example-iam-designer labels: backup.appconnect.ibm.com/component: designerauthoring namespace: ace-keycloak-k8s spec: imagePullSecrets: - name: ibm-entitlement-key license: accept: true license: L-XRNH-47FJAW use: AppConnectEnterpriseProduction couchdb: storage: size: 10Gi type: persistent-claim class: ibmc-file-gold-gid replicas: 1 designerMappingAssist: incrementalLearning: schedule: Every 15 days enabled: false authentication: integrationKeycloak: auth: clientSecretName: designer-ace-keycloak-k8s-example-iam-11111 enabled: true endpoint: 'https://example-keycloak.test.com' realm: exampleRealm tls: secretName: example-tls-secret ingressHost: example-iam-designer.example-keycloak.test.com authorization: integrationKeycloak: auth: clientSecretName: designer-ace-keycloak-k8s-example-iam-11111 enabled: true endpoint: 'https://example-keycloak.test.com' realm: exampleRealm tls: secretName: example-tls-secret ingressHost: example-iam-designer.example-keycloak.test.com designerFlowsOperationMode: local replicas: 1 version: '12.0.12.3-r1' Set spec.authentication.integrationKeycloak.auth.clientSecretName and spec.authorization.integrationKeycloak.auth.clientSecretName to designer-ace-keycloak-k8s-example-iam-11111. This is the client secret that was created in part 2.Ensure spec.authentication.integrationKeycloak.enabled and spec.authorization.integrationKeycloak.enabled are set to true, which enables authentication and authorization for App Connect Designer Authoring.Set spec.authentication.integrationKeycloak.endpoint and spec.authorization.integrationKeycloak.endpoint to the URL of Keycloak endpoint. You can find the value in the KC_HOST environment variable in your Keycloak pod. Run the following command to get the value:kubectl get pod <keycloak-pod-name> -n <namespace-for-keycloak-pod> -o=jsonpath='{.spec.containers[0].env[?(@.name == "KC_HOSTNAME")].value}' Note: If the endpoints are not provided, whilst authentication and authorization are enabled, the IBM® Cloud Pak foundational services must be installed to provide authentication and authorization for App Connect Dashboard and Designer Authoring. This is supported on OCP only.Set spec.authentication.integrationKeycloak.realm and spec.authorization.integrationKeycloak.realm to the Keycloak realm, where the Keycloak client designer-ace-keycloak-k8s-example-iam-11111 exists. In this tutorial, it is exampleRealm.Set spec.authentication.integrationKeycloak.tls.secretName and spec.authorization.integrationKeycloak.tls.secretName to example-tls-secret. This is the CA secret that was created in part 2. Because we used the default key name ca.crt for the CA secret, we do not need to specify spec.authorization.integrationKeycloak.tls.caCertificate and spec.authorization.integrationKeycloak.tls.caCertificate. Therefore the caCertificate fields are not included in the example CR designer_iam.yaml.Set spec.authentication.integrationKeycloak.tls.ingressHost and spec.authorization.integrationKeycloak.tls.ingressHost to the spec.tls.hosts value obtained in step 1. Note: Skip step 8 if you are creating a Designer Authoring instance at version 13.0.2.2-r1 or later and have set spec.ingress.enabled to true in the Designer Authoring CR. Follow the documentation on entitlement key to create a ibm-entitlement-key Secret. This enables you to download the required images for the App Connect Designer Authoring.Follow the documentation on Designer Authoring storage to set spec.storage.class.(Optional) Set spec.version to 12.0 to pick up the latest App Connect Designer Authoring operand version.Create the Designer Authoring resource with the following command:kubectl apply -f designer_iam.yaml -n ace-keycloak-k8sOnce the Designer Authoring deployment is ready, you can navigate to part 4 to access the App Connect Designer Authoring resource. Part 4: Access Your App Connect Designer Authoring From your Keycloak admin console, use the navigation pane to select a realm from the drop-down list. In this tutorial, we have set up a realm called exampleRealm. Next, on the navigation pane, select Clients and click the client designer-ace-keycloak-k8s-example-iam-11111. Set Valid redirect URIs to <FIREFLY_ROUTE_UI>/auth/icp/callback, where FIREFLY_ROUTE_UI specifies the URL of the App Connect Designer UI. You can get the value of FIREFLY_ROUTE_UI with the following command:kubectl get configmap example-iam-designer-designer-env -o=jsonpath='{.data.FIREFLY_ROUTE_UI}' -n ace-keycloak-k8s The name of the configmap resource is in the format of <designer authoring metadata.name>-designer-env.If you are on App Connect Operator version 12.4.0 or later and Designer Authoring operand version 13.0.1.0-r1 or later, set Valid post logout redirect URIs to <FIREFLY_ROUTE_UI>.Now you need to create a user to log in to App Connect Designer Authoring. To do so, you can follow steps 6 to 10 in the Create a user and configure user roles section of the Keycloak tutorial for IBM® App Connect Operator version 11.0.0.In a Web browser, navigate to <FIREFLY_ROUTE_UI>. As a result, a request is sent to the ui container in the Designer Authoring pod. With information on the Keycloak client and Keycloak endpoint, the request is redirected to Keycloak to provide authentication and authorization for App Connect Designer Authoring. You can use the user information created in step 3 to log in. If you are directed to an error page, refer to the troubleshooting section. Keycloak validates the user information, and forwards the request to Valid redirect URIs that you configured in the Keycloak client. As a result, congratulations, you are logged into the App Connect Designer Authoring. If you are directed to an error page, refer to the troubleshooting section. Conclusion The IBM® App Connect Operator (version 12.1.0 or later) offers enhanced Keycloak support, which enables you to use an existing Keycloak instance to configure authentication and authorization for App Connect Dashboard and Designer Authoring. This new feature is available on both OCP and Kubernetes. Troubleshooting We are sorry: Invalid parameter: redirect_uri.How to recreate this problem? You entered the URL of the App Connect Dashboard UI or Designer UI on a Web browser. It directed you to the following error page, before reaching the Keycloak UI. This could indicate one of the following: The client secret, which you created in part 2, does not contain the correct name or credential for the Keycloak client. You need to verify that the secret contains the expected keys and correct values. You then need to update the secret otherwise. If you updated the client secret, you must recreate the related App Connect Dashboard or Designer Authoring to pick up the change.The Keycloak client does not contain a correct Valid redirect URIs. You need to verify this parameter on your Keycloak client.Something went wrong: initial connection from App Connect Dashboard or Designer UI to Keycloak.How to recreate this problem? You entered the URL of the App Connect Dashboard UI or Designer UI on a Web browser. It directed you to the following error page, before reaching the Keycloak UI. This could indicate one of the following: Check the logs from your App Connect Dashboard or Designer Authoring pod. Run the following commands:kubectl logs <dashboard pod name> -c control-ui | grep -i "InternalOAuthError: Failed to obtain access token"kubectl logs <designer authoring ui pod name> -c <designer-authoring-name>-ui | grep -i "InternalOAuthError: Failed to obtain access token"If the InternalOAuthError: Failed to obtain access token error message is in the pod log, it indicates that the value of ca.crt, which you set as a key-value pair in the CA certificate secret in part 2 in scenario 1, is incorrect. Check the value in the secret (which is named example-tls-secret in this tutorial). If you have specified your own key name, you must specify it in spec.authorization.integrationKeycloak.tls.caCertificate and spec.authorization.integrationKeycloak.tls.caCertificate in the App Connect Dashboard or Designer Authoring CR. You can run the following command to verify it matches the root CA certificate:openssl s_client -showcerts -verify 5 -connect example-keycloak.test.com:443 < /dev/null Note: You must replace example-keycloak.test.com with your Keycloak endpoint. This command returns a certificate chain with a depth of five. Use the root CA from the chain, which is the last entry in the output.The CA certificate is copied into /tmp/certs.crt for the App Connect Dashboard and Designer Authoring pods to access. You can verify that the content matches the root CA certificate you obtained from the openssl command above. The following shows how to exec into a Dashboard pod to check this file, from the Kubernetes UI. The following screen shows how to exec into a Designer Authoring UI pod to check this file, from the Kubernetes UI. If you updated the client secret, you must recreate the related App Connect Dashboard or Designer Authoring to pick up the change. If you are using a Keycloak version from v24.0.0 onwards, check the logs from your App Connect Dashboard or Designer Authoring pod. Run the following commands:kubectl logs <dashboard pod name> -c control-ui | grep -i "Cannot read properties of null (reading 'exp')"kubectl logs <designer authoring ui pod name> -c <designer-authoring-name>-ui | grep -i "Cannot read properties of null (reading 'exp')"If the Cannot read properties of null (reading 'exp') error message is in the pod log, it indicates that the size of the access token is too large for a cookie. Ensure you have followed step 5 in part 1 in scenario 1, and toggled on the setting for Always use lightweight access token. Something went wrong: error validating Keycloak client roles.How to recreate this problem? You entered the URL of the App Connect Dashboard UI or Designer UI on a Web browser, which took you to the Keycloak UI to log in as follows. After you entered the username and password, and then clicked Sign In, you arrived at the following error page. Check the logs from your App Connect Dashboard or Designer Authoring pod. Run the following commands:kubectl logs <dashboard pod name> -c control-ui | grep -i "cannot find the highest role"kubectl logs <designer authoring ui pod name> -c <designer-authoring-name>-ui | grep -i "cannot find the highest role"If the cannot find the highest role error message is in the pod log, you need to ensure the User Client Role mapper, which you added in step 7 of part 1 in scenario 1, was added to your Keycloak client. Ensure the Token Claim Name is set to effective-roles, and Add to Token Introspection is enabled.Otherwise you can change the string after grep -i to failed to obtain access token or InternalOAuthError. If one of these error message is found in the pod log, you need to ensure the App Connect Dashboard or Designer Authoring has been recreated, if you have updated the secret containing the CA certificate.

By Shanna Xu

Docker Offload: One of the Best Features for AI Workloads

As I mentioned in my previous post about Docker Model Runner and why it's a game-changing feature. I also mentioned that the best is yet to come, and Docker finally announced during the "WeAreDevelopers" event in Berlin, about their new feature, "Docker Offload." In this article, I will explain what exactly Docker Offload is and why we need it as developers, and why I say it's one of the best features released by Docker in recent times. What Is Docker Offload? If you are like me, who struggled to try out those cool AI models or data processing pipelines locally but were unable to do so due to the limitations of not having a GPU or a powerful machine to run them on, then continue reading. I always end up utilizing cloud resources, which often come with a hefty price tag. This is where the Docker team saw an opportunity and came up with yet another cool feature, which is the Docker Offload. Docker Offload is a fully managed service that enables users to execute Docker builds and run containers on cloud infrastructure, while preserving the local development experience. This drastically reduces the load on your local machine. It seems and feels like magic behind the scenes, as you will still continue to run the same Docker commands that you are familiar with; however, Docker Desktop creates a secure SSH tunnel to a Docker daemon running in the cloud. It provides local experience; however, your containers are created, and all workloads are executed on cloud infrastructure. Why Use Docker Offload? Run containers that are compute-intensive and require more resources than your local machine.Leverage cloud infrastructure for offloading heavy builds.Docker Offload is ideal for high-velocity development workflows that need the flexibility of the cloud without sacrificing local development experience.Instant access to GPU-powered environments.Accelerate both development and testing without worrying about cloud infrastructure setup.Develop efficiently in restricted environments, such as VDIs. How to Get Started Prerequisites Docker Desktop 4.43.0 version or above.Docker Hub Account with Docker Offload access: Docker Offload is currently in beta, so you'll need to sign up for access. Visit https://www.docker.com/products/docker-offload/ to sign up for beta access. You will receive an email after which you can enable Docker Offload in your Docker Desktop.No restrictive proxy or firewall blocking traffic to Docker Cloud. How to Start Docker Offload Once you receive the access, you can enable Docker Offload in two ways: either using Docker Desktop or via Terminal. Start via Docker Desktop In the screenshot below, you can see a toggle button at the top to enable Docker Offload. As soon as you do that, Docker Desktop color will be changed to purple, and you'll see a cloud icon in the header, which means Docker Desktop is now securely connected to a cloud environment that mimics your local experience. Now, when you run builds or containers, they execute remotely, but as a user, you will not see any difference. Start via Terminal Open a terminal and execute: PowerShell docker offload start The command will give you a prompt, as shown in the image below, to choose which account you want to use for Docker Offload. In the next prompt, it will ask whether you need GPU support or not. Once you select the options, it will display a "New Docker context created: docker-cloud" message, and the color of the Docker Desktop will change to purple. Note: Docker is currently giving 300 free GPU minutes to get started! After credits expire, usage is priced at $0.015 per GPU minute. For more details on billing and usage, refer to https://docs.docker.com/offload/usage/. How to Run a Container With Docker Offload Before running a container, make sure to verify Docker Offload is running properly. Execute the command below in the terminal to check the status. PowerShell docker offload status You can also see a cloud icon with Offload + GPU running at the bottom left of your Docker Desktop window. To get more information about Docker Offload status, run the below command, which gives more details about the daemon and status, and if there were any failed reasons when running Docker Offload, as shown in the screenshot below. PowerShell docker offload diagnose How to Build a Container With Docker Offload For this article demo, I have forked a Docker Offload demo app created by Ajeet Raina, a developer advocate at Docker. Ajeet created this awesome demo application, which I couldn't resist sharing here. PowerShell git clone https://github.com/sunnynagavo/docker-offload-demo.git cd .\docker-offload-demo\ Once you are inside the folder, let's run the Docker build command as shown below and see the magic. The build happens on cloud infrastructure instead of your local machine, which you can also see in the logs as shown in the screenshot below. You can also click on the Build tab on Docker Desktop to see the details of the build you just ran. In order to run the application with GPUs, execute the following command in the terminal. PowerShell docker run --rm --gpus all -p 3000:3000 docker-offload-demo You can see the output in the logs where it leveraged GPU (NVIDIA L4 by default) on the Docker cloud. Navigate to the URL http://localhost:3000 to see the demo application as shown in the screenshot below. The demo application shows a comprehensive view of your system's performance by providing confirmation that you are utilizing Docker Offload. The app clearly shows the stats about what GPU hardware (NVIDIA L4) is used in running, along with the resources used in running the application. How to Stop Docker Offload Just like starting Docker Offload, there are two ways to stop Docker Offload. Stop via Docker Desktop You can use the same toggle button at the top to stop Docker Offload and color changes back to your regular theme. Stop via Terminal Open a terminal and execute: PowerShell docker offload stop Once it's stopped, all the previously built images and containers will be cleaned up. When you run the above commands, you will now build images and run containers locally. Conclusion Docker Offload closes the gap between ease of local development and the power of cloud scalability. It helps developers to run compute-intensive workloads in the cloud without losing the experience of working locally. Docker Offload is like adding a supercomputer brain to your local machine. I highly encourage developers to give it a try and experience the power of Docker Offload. What are you waiting for? Go ahead and submit the request for getting access to Docker Offload. For more details, check out Docker's official documentation on Docker Offload at https://docs.docker.com/offload/quickstart/.

By Naga Santhosh Reddy Vootukuri

CORE

WAN Is the New LAN!?!?

For decades, the Local Area Network (LAN) was the heart of enterprise IT. It represented the immediate, high-speed connectivity within an office or campus. But in today's cloud-first, globally distributed world, the very definition of "local" has expanded. The Wide Area Network (WAN) was considered to be the most expensive link. However, its high agility and intelligent fabric make it more reliable and help make LAN expand globally. The paradigm shift is clear: "WAN is the new LAN". This transformation hasn't happened overnight. A lot of research hours went into this innovation, and it took more than 2 decades for the evolution. It's a journey that began with the limitations of traditional Multiprotocol Label Switching (MPLS) infrastructure, evolved through the revolutionary capabilities of Software-Defined Wide Area Networking (SD-WAN), and is now culminating in the promise of hyper-scale Cloud WAN. The Reign of MPLS In the early 2000s, MPLS was the undisputed king of enterprise WANs. All enterprises heavily relied on MPLS-based circuits to expand the connectivity between their data centers and branch offices with guaranteed Quality of Service (QoS). In this method, you know the path that you are going to take, meaning packets travel along predefined, high-speed routes, ensuring reliability and high performance for mission-critical applications like voice and video that need real-time streaming. However, the MPLS has its own significant challenges: High Costs: MPLS circuits are way too expensive for mid-size startups to adopt. Bandwidth upgrade is also an expensive affair.Lack of Flexibility: Adding new sites or increasing capacity was a lengthy, complex, and often manual process, involving weeks or even months of provisioning.4 This rigidity made it difficult for businesses to adapt to rapid changes.Cloud Incompatibility: As applications migrated to the cloud (SaaS, IaaS), MPLS's hub-and-spoke architecture forced cloud traffic to "backhaul" through a central data center.5 This introduced latency, negated cloud benefits, and created bottlenecks.Limited Visibility and Control: Enterprises often lack granular control over the MPLS Network, and since the path is predefined, if there is a failure in the path, service providers need to help troubleshoot the drop traffic. The rise of cloud computing and the distributed workforce exposed MPLS's limitations, paving the way for a more dynamic solution. SD-WAN: The Agile Overlay Revolution The mid-2010s ushered in the era of Software-Defined Wide Area Networking (SD-WAN), a game-changer that addressed many of MPLS's shortcomings. SD-WAN decouples the network control plane from the underlying hardware, creating a virtual overlay that can intelligently utilize various transport methods – including cheap broadband internet, LTE, and even existing MPLS circuits. Key advantages of SD-WAN over traditional MPLS include: Cost Efficiency: SD-WAN uses readily available and less expensive Internet broadband for communication. This technique significantly reduced WAN costs by 50-60% when compared to MPLS.Enhanced Agility and Flexibility: Centralized, software-based management allowed for rapid deployment of new sites, quick policy changes, and dynamic traffic steering. New branches could be brought online in days, not months, often with zero-touch provisioning.Optimized Cloud Connectivity: SD-WAN does destination based routing and prioritizes cloud bound traffic directly to the Internet, instead of data center routing improving application performance and reducing the round trip time. It understood individual applications and their SLA requirements, ensuring optimal traffic delivery.Improved Performance and Resiliency: SD-WAN actively monitors network conditions across multiple links, automatically selecting the best path for applications and providing sub-second failover in case of an outage. This built-in redundancy dramatically increased network resilience.Centralized Management and Visibility: A single pane of glass provided comprehensive visibility into network performance, application usage, and security policies, empowering IT teams with greater control. SD-WAN quickly moved from emerging tech to mainstream, with nearly 90% of enterprises rolling out some form of it by 2022. It became the enabling technology for a cloud-centric world, making the public internet the new enterprise WAN backbone. Fast Forward to Cloud WAN: The Planet-Scale Network as a Service While SD-WAN brought immense benefits, the increasing complexity of multi-cloud environments, distributed workforces, and the burgeoning demands of AI and IoT workloads have led to the next evolution: Cloud WAN. Cloud WAN represents a shift from managing fragmented network components to consuming a fully managed, globally distributed network as a service. Hyperscale cloud providers, like Google Cloud, are now extending their massive private backbone networks, traditionally used for their own services, directly to enterprises. Google Cloud WAN, for example, leverages Google's planet-scale network encompassing millions of miles of fiber and numerous subsea cables to provide a unified, high-performance, and secure enterprise backbone. It's designed to simplify global networking by: Unified Global Connectivity: Connecting geographically dispersed data centers, branch offices, and campuses over a single, highly performant backbone, acting as a modern alternative to traditional WANs.Simplified Management: Abstracting the underlying network complexity and providing a policy-based framework for declarative management. This means enterprises can focus on business requirements rather than intricate technical configurations.Optimized for the AI Era: Designed to handle the low-latency, high-throughput demands of AI-powered workloads and other data-intensive applications, offering up to 40% faster performance compared to the public internet.Cost Savings: By consolidating network infrastructure and leveraging a managed service, Cloud WAN can offer significant total cost of ownership (TCO) reductions (e.g., up to 40% savings over customer-managed solutions).Integrated Security: Cloud WAN solutions often come with integrated, multi-layered security capabilities, ensuring consistent security policies across the entire network.Flexibility and Choice: While a managed service, Cloud WAN platforms often integrate with leading SD-WAN and SASE (Secure Access Service Edge) vendors, allowing enterprises to protect existing investments and maintain consistent security policies. The "WAN is the new LAN" paradigm isn't just about faster connections; it's about a fundamental shift in how enterprises approach their global network. It's about consuming connectivity as a seamless, software-defined service that adapts to business needs rather than a static, hardware-centric infrastructure. As businesses continue their digital transformation journeys, embracing hybrid and multi-cloud strategies and leveraging advanced technologies like AI, the evolution of WAN to Cloud WAN will be critical to unlocking their full potential. The network is no longer just a utility; it's a strategic enabler, performing with the speed, agility, and intelligence once reserved for the most localized of networks.

By Harika Rama Tulasi Karatapu

AI-Powered AWS CloudTrail Analysis: Using Strands Agent and Amazon Bedrock for Intelligent AWS Access Pattern Detection

Background/Challenge AWS CloudTrail logs capture a comprehensive history of API calls made within an AWS account, providing valuable information about who accessed what resources and when. However, these logs can be overwhelming to analyze manually due to their volume and complexity. Security teams need an efficient way to: Identify unusual access patternsDetect potential security threatsUnderstand resource usage patternsGenerate human-readable reports from technical log data My approach combines AWS native services with generative AI to transform raw log data into actionable security insights. By leveraging the power of Amazon Bedrock and the Strands Agent framework, I have created a scalable, automated system that significantly reduces the manual effort required for CloudTrail analysis while providing more comprehensive results than traditional methods. Solution Overview This solution leverages AWS CloudTrail logs, Strands Agents, and Amazon Bedrock's generative AI capabilities to automatically analyze access patterns and generate insightful reports. The system queries CloudTrail logs, performs pattern analysis, and uses Anthropic Claude (via Amazon Bedrock) to transform raw data into actionable security insights. Prerequisites AWS ResourcesAWS account with CloudTrail enabledIAM permissions (Add more as needed): CloudTrail:LookupEventsBedrock:InvokeModelPython EnvironmentPython 3.12+Required packages: boto3Strands Agents SDK (for agent framework)ConfigurationAWS credentials configured locally (via AWS CLI or environment variables)Amazon Bedrock access to Claude model (us.anthropic.claude-3-5-sonnet-20241022-v2:0) Solution Architecture Overview Set Up the Environment Follow the quickstart guide to create a Strands agent project. Once your environment is ready, replace the agent.py with trailInsightAgent.py and add files as shown in the image below. The solution consists of two main components: 1. Orchestration Layer (trailInsightAgent.py) Uses the Strands Agent framework to manage the workflowRegisters the `trail_analysis` tool (decorated with '@tool' in queryCloudTrail.py)AI-Powered Insight Generation executes the analysis and displays results Connects to Amazon BedrockSends the analysis data to Claude with a specialized promptProcesses the AI-generated responseReturns formatted insights # trailInsightAgent.py Python from strands import Agent, tool from queryCloudTrail import trail_analysis def main(): # Initialize the agent with the trail_analysis tool agent = Agent(tools=[trail_analysis]) # Define the prompt for CloudTrail analysis prompt = """Review the cloudtrail logs for the last 3 days and provide a report in a tabular format. \ Focus on identifying unusual access patterns and security concerns, and give remediation to address any findings.""" # Execute the agent with the message response = agent(prompt) # Print the response print(response) if __name__ == "__main__": main() 2. CloudTrail Log Retrieval (queryCloudTrail.py) This component has three functions as follows. The first function, query_cloudtrail_logs, retrieves CloudTrail events using the AWS SDK (boto3). #queryCloudTrail.py Python import boto3 from datetime import datetime, timedelta from strands import tool region="us-west-2" #read the region from environment variable def query_cloudtrail_logs( days=7, max_results=10 ): # Create CloudTrail client client = boto3.client('cloudtrail', region_name=region) # Calculate start and end time end_time = datetime.now() start_time = end_time - timedelta(days=days) # Query parameters params = { 'StartTime': start_time, 'EndTime': end_time, 'MaxResults': max_results } # Execute the query response = client.lookup_events(**params) return response['Events'] The second function, analyze_access_patterns, processes CloudTrail events to identify patterns. Most frequent API callsMost active usersMost accessed AWS servicesMost accessed resources #Access Pattern Analysis (queryCloudTrail.py) Python def analyze_access_patterns(events): # Initialize counters event_counts = {} user_counts = {} resource_counts = {} service_counts = {} for event in events: # Count events by name event_name = event.get('EventName', 'Unknown') event_counts[event_name] = event_counts.get(event_name, 0) + 1 # Count events by user username = event.get('Username', 'Unknown') user_counts[username] = user_counts.get(username, 0) + 1 # Extract service name from event source event_source = event.get('EventSource', '') service = event_source.split('.')[0] if '.' in event_source else event_source service_counts[service] = service_counts.get(service, 0) + 1 # Count resources accessed if 'Resources' in event: for resource in event['Resources']: resource_name = resource.get('ResourceName', 'Unknown') resource_counts[resource_name] = resource_counts.get(resource_name, 0) + 1 return { 'event_counts': event_counts, 'user_counts': user_counts, 'service_counts': service_counts, 'resource_counts': resource_counts } The third function, trail_analysis, ties everything together: Retrieves CloudTrail logs for the last 3 daysAnalyzes the access patternsReturns the formatted insightsAdd error logic to extend this function # Trail_analysis Tool (queryCloudTrail.py) Python @tool def trail_analysis() -> str: # Query CloudTrail logs (customize parameters as needed) events = query_cloudtrail_logs( days=3, # Look back 3 days max_results=10 # Get up to 100 results ) # Analyze access patterns analysis = analyze_access_patterns(events) return analysis Verify It To test this solution, run the following command in a terminal window. Make sure you are inside the logAgent directory. python3 trailInsightAgent.py Summary In this post, I showed you how this architecture automates the AWS CloudTrail log analysis process, reducing manual effort and improving security insights. The solution combines CloudTrail data retrieval, pattern analysis, and generative AI to transform complex log data into actionable security recommendations. By leveraging Amazon Bedrock and the Strands Agent framework, I have created a system that addresses concerns regarding the complexity and volume of CloudTrail logs while providing meaningful security insights. Try out this approach for your own AWS environments and share your feedback and questions in the comments. You can extend this solution by hosting it in AWS Lambda and exposing it using API Gateway, adding scheduled execution, integrating with security information and event management (SIEM) systems, or customizing the analysis for your specific security requirements. Cost Consideration While this solution offers automated analysis capabilities, costs can be managed effectively through several strategies: Adjust query frequency: Schedule analyses at appropriate intervals rather than running on-demandOptimize query size: Limit the ‘max_results’ parameter to retrieve only necessary dataFine-tune bedrock usage: Adjust token limits based on required detail levelUse targeted filters: Apply specific filters (username, event type) to focus on relevant data The primary cost drivers are: CloudTrail storage Amazon Bedrock API calls Remember to delete all resources after implementing this architecture if you are only validating the solution, to prevent incurring unnecessary costs.

By Anil Malakar

From Raw Data to Model Serving: A Blueprint for the AI/ML Lifecycle With Kubeflow

Are you looking for a practical, reproducible way to take a machine learning project from raw data all the way to a deployed, production-ready model? This post is your blueprint for the AI/ML lifecycle: you’ll learn how to use Kubeflow and open-source tools such as Feast to build a workflow you can run on your laptop and adapt to your own projects. We’ll walk through the entire ML lifecycle — from data preparation to live inference — leveraging the Kubeflow platform to create a cohesive, production-grade MLOps workflow. Project Overview The project implements a complete MLOps workflow for a fraud detection use case. Fraud detection is a critical application in financial services, where organizations need to identify potentially fraudulent transactions in real-time while minimizing false positives that could disrupt legitimate customer activity. Our fraud detection system leverages machine learning to analyze large volumes of transaction data, learn patterns from historical behavior, and flag suspicious transactions that deviate from normal patterns. The model considers various features such as transaction amounts, location data, merchant information, and user behavior patterns to make predictions. This makes fraud detection an ideal use case for demonstrating MLOps concepts because it requires: Real-time inference: Fraud detection decisions must be made instantly as transactions occurFeature consistency: The same features used in training must be available during inference to ensure model accuracyScalability: The system must handle high transaction volumesContinuous learning: Models need regular retraining as fraud patterns evolveCompliance and auditability: Financial services require comprehensive model tracking and governance The workflow ingests raw transaction data, proceeds through data preparation and feature engineering, then model training and registration, and finally deploys the model as a production-ready inference service that can evaluate transactions in real-time. The entire workflow is orchestrated as a Kubeflow Pipeline, which provides a powerful framework for defining, deploying, and managing complex machine learning pipelines on Kubernetes. Here is a high-level overview of the pipeline: A Note on the Data The pipeline assumes that the initial datasets (train.csv, test.csv, etc.) are already available. For readers who wish to follow along or generate their own sample data, a script is provided in the synthetic_data_generation directory. This script was used to create the initial data for this project, but is not part of the automated Kubeflow pipeline itself. Why Kubeflow? This project demonstrates the power of using Kubeflow to abstract away the complexity of Kubernetes infrastructure, allowing AI Engineers, Data Scientists, and ML engineers to focus on what matters most: the data and model performance. Key Benefits Infrastructure Abstraction Instead of manually managing Kubernetes deployments, service accounts, networking, and storage configurations, the pipeline handles all the infrastructure complexity behind the scenes. You define your ML workflow as code, and Kubeflow takes care of orchestrating the execution across your Kubernetes cluster. Focus on AI, Not DevOps With the infrastructure automated, you can spend your time on the activities that directly impact model performance. Experimenting with different feature engineering approachesTuning hyperparameters and model architecturesAnalyzing prediction results and model behaviorIterating on data preparation and validation strategies Reproducible and Scalable The pipeline ensures that every run follows the same steps with the same environment configurations, making your experiments reproducible. When you’re ready to scale up, the same pipeline can run on larger Kubernetes clusters without code changes. Production-Ready From Day One By using production-grade tools like KServe for model serving, Feast for feature management, and the Model Registry for governance, your development pipeline is already structured for production deployment. Portable and Cloud-Agnostic The entire workflow runs on standard Kubernetes, making it portable across different cloud providers or on-premises environments. What works on your laptop will work in production. This approach shifts the cognitive load from infrastructure management to data science innovation, enabling faster experimentation and more reliable production deployments. Getting Started: Prerequisites and Cluster Setup Before diving into the pipeline, you need to set up your local environment. This project is designed to run on a local Kubernetes cluster using kind. Prerequisites A container engine, like Podman or Docker.Python (3.11 or newer).uv: A fast Python package installer.kubectlkindmc (MinIO Client) Note: This setup was tested on a VM with 12GB RAM, 8 CPUs, and 150GB of disk space. 1. Create a Local Kubernetes Cluster First, create a kind cluster. The following command will set up a new cluster with a specific node image compatible with the required components: Shell kind create cluster -n fraud-detection-e2e-demo --image kindest/node:v1.31.6 2. Deploy Kubeflow Pipelines With your cluster running, the next step is to deploy Kubeflow Pipelines. For this project, the standalone installation is recommended, as it’s lighter and faster to set up than a full Kubeflow deployment. Follow the official Kubeflow Pipelines standalone installation guide for the latest instructions. 3. Upload the Raw Data to MinIO MinIO is an open-source, S3-compatible object storage system. In this project, MinIO is used to store raw datasets, intermediate artifacts, and model files, making them accessible to all pipeline components running in Kubernetes. Before uploading, you need to port-forward the MinIO service so it’s accessible locally. Run the following command in a separate terminal window: Shell kubectl port-forward --namespace kubeflow svc/minio-service 9000:9000 Next, generate the synthetic data and copy it to feature_engineering/feature_repo/data/input/ if you haven’t done yet. The synthetic data generation script creates the raw_transaction_datasource.csv file that serves as the primary input for the pipeline. Shell cd synthetic_data_generation uv sync source .venv/bin/activate python synthetic_data_generation.py cp raw_transaction_datasource.csv ../feature_engineering/feature_repo/data/input cd .. You should see an output similar to the following. The generation may take a few minutes depending on your hardware. Shell Using CPython 3.11.11 Creating virtual environment at: .venv Resolved 7 packages in 14ms Installed 6 packages in 84ms + numpy==2.3.0 + pandas==2.3.0 + python-dateutil==2.9.0.post0 + pytz==2025.2 + six==1.17.0 + tzdata==2025.2 loading data... generating transaction level data... 0 of 1,000,000 (0%) complete 100,000 of 1,000,000 (10%) complete 200,000 of 1,000,000 (20%) complete 300,000 of 1,000,000 (30%) complete 400,000 of 1,000,000 (40%) complete 500,000 of 1,000,000 (50%) complete 600,000 of 1,000,000 (60%) complete 700,000 of 1,000,000 (70%) complete 800,000 of 1,000,000 (80%) complete 900,000 of 1,000,000 (90%) complete Next, install and configure the MinIO Client (mc) if you haven’t already. Then, set up the alias and upload the datasets: Shell mc alias set minio-local http://localhost:9000 minio minio123 mc mb minio-local/mlpipeline mc cp -r feature_engineering/feature_repo/data/input/ minio-local/mlpipeline/artifacts/feature_repo/data/ mc cp feature_engineering/feature_repo/feature_store.yaml minio-local/mlpipeline/artifacts/feature_repo/ This will create the required bucket and directory structure in MinIO and upload your raw datasets, making them available for the pipeline. Once the upload is complete, you can stop the port-forward process. 4. Install Model Registry, KServe, Spark Operator, and Set Policies While the datasets are uploading to MinIO, you can proceed to install the remaining Kubeflow components and set up the required Kubernetes policies. The following steps summarize what’s in setup.sh: Install Model Registry Shell kubectl apply -k "https://github.com/kubeflow/model-registry/manifests/kustomize/overlays/db?ref=v0.2.16" Install KServe Shell kubectl create namespace kserve kubectl config set-context --current --namespace=kserve curl -s "https://raw.githubusercontent.com/kserve/kserve/release-0.15/hack/quick_install.sh" | bash kubectl config set-context --current --namespace=kubeflow Install Kubeflow Spark Operator Shell helm repo add --force-update spark-operator https://kubeflow.github.io/spark-operator helm install spark-operator spark-operator/spark-operator \ --namespace spark-operator \ --create-namespace # Make sure the Spark Operator is watching all namespaces: helm upgrade spark-operator spark-operator/spark-operator --set spark.jobNamespaces={} --namespace spark-operator Apply Service Accounts, Roles, Secrets, and Serving Runtime The manifests/ directory contains several YAML files that set up the necessary service accounts, permissions, secrets, and runtime configuration for both KServe and Spark jobs. Here’s what each file does: kserve-sa.yaml: Creates a service account for KServe, referencing the MinIO secret.kserve-minio-secret.yaml: Creates a secret with MinIO credentials and endpoint info, so KServe can access models and artifacts in MinIO.kserve-role.yaml: Defines a ClusterRole allowing management of KServe InferenceService resources.kserve-role-binding.yaml: Binds the above ClusterRole to the pipeline-runner service account in the kubeflow namespace, so pipeline steps can create/manage inference services.serving-runtime.yaml: Registers a custom ServingRuntime for ONNX models, specifying the container image and runtime configuration for model serving.spark-sa.yaml: Creates a service account for Spark jobs in the kubeflow namespace.spark-role.yaml: Defines a Role granting Spark jobs permissions to manage pods, configmaps, services, secrets, PVCs, and SparkApplication resources in the namespace.spark-role-binding.yaml: Binds the above Role to both the spark and pipeline-runnerservice accounts in the kubeflow namespace.kustomization.yaml: A Kustomize manifest that groups all the above resources for easy application. Apply all of these with: Shell kubectl apply -k ./manifests -n kubeflow These resources ensure that KServe and Spark jobs have the right permissions and configuration to run in your Kubeflow environment. Building and Understanding the Pipeline Images In Kubeflow pipelines, each step of a pipeline runs inside a container. This containerized approach provides several key benefits: isolation between steps, reproducible environments, and the ability to use different runtime requirements for different stages of your pipeline. While Kubeflow Pipelines provides default images for common tasks, most real-world ML projects require custom images tailored to their specific needs. Each pipeline component in this project uses a specialized container image that includes the necessary dependencies, libraries, and code to execute that particular step of the ML workflow. This section covers how to build these custom images. For detailed information about what each image does and how the code inside each container works, refer to the individual pipeline step sections that follow. Note: You only need to build and push these images if you want to modify the code for any of the pipeline components. If you’re using the project as-is, you can use the prebuilt images referenced in the pipeline. The pipeline uses custom container images for the following components: Image Locations data_preparation/Containerfilefeature_engineering/Containerfilepipeline/Containerfilerest_predictor/Containerfiletrain/Containerfile How to Build You can build each image using Podman or Docker. For example, to build the data preparation image: Shell cd data_preparation podman build -t fraud-detection-e2e-demo-data-preparation:latest . # or # docker build -t fraud-detection-e2e-demo-data-preparation:latest . You can also refer to the build_images.sh script in the project root to see how to build all images in sequence. Repeat this process for each component, adjusting the tag and directory as needed. Entry Points data_preparation: python main.pyfeature_engineering: python feast_feature_engineering.pypipeline: Used for orchestrating the pipeline steps (see fraud-detection-e2e.py)rest_predictor: python predictor.pytrain: python train.py Pushing Images After building, push the images to a container registry accessible by your Kubernetes cluster. Update the image references in your pipeline as needed. The Kubeflow Pipeline The main pipeline definition is in pipeline/fraud-detection-e2e.py. This file is the entry point for the Kubeflow pipeline and orchestrates all the steps described below. With your environment and permissions set up, you’re ready to run the end-to-end pipeline. Let’s walk through each stage of the workflow and see how Kubeflow orchestrates the entire machine learning lifecycle — from data preparation to real-time inference. 1. Data Preparation With Spark Apache Spark is a powerful open-source engine for large-scale data processing and analytics. In this project, we use Spark to efficiently process and transform raw transaction data before it enters the ML pipeline. To run Spark jobs on Kubernetes, we use the Kubeflow Spark Operator. The Spark Operator makes it easy to submit and manage Spark applications as native Kubernetes resources, enabling scalable, distributed data processing as part of your MLOps workflow. Container Image for Data Preparation This pipeline step uses a custom container image built from data_preparation/Containerfile. The image includes: PySpark and dependencies: Required libraries for distributed data processingMinIO client libraries: For reading from and writing to object storageCustom data processing code: The main.py script that implements the data transformation logic The container runs with the entry point python main.py, which orchestrates all the data preparation tasks within the Spark job. The pipeline begins by launching a Spark job that performs several key data preparation steps, implemented in data_preparation/main.py: Combining Datasets The job reads the raw train.csv, test.csv, and validate.csv datasets, adds a set column to each, and combines them: Shell train_set = spark.read.csv(INPUT_DIR + "train.csv", header=True, inferSchema=True) test_set = spark.read.csv(INPUT_DIR + "test.csv", header=True, inferSchema=True) validate_set = spark.read.csv(INPUT_DIR + "validate.csv", header=True, inferSchema=True) train_set = train_set.withColumn("set", lit("train")) test_set = test_set.withColumn("set", lit("test")) validate_set = validate_set.withColumn("set", lit("valid")) all_sets = train_set.unionByName(test_set).unionByName(validate_set) Type Conversion and Feature Engineering It converts certain columns to boolean types and generates unique IDs: Shell all_sets = all_sets.withColumn("fraud", col("fraud") == 1.0) all_sets = all_sets.withColumn("repeat_retailer", col("repeat_retailer") == 1.0) all_sets = all_sets.withColumn("used_chip", col("used_chip") == 1.0) all_sets = all_sets.withColumn("used_pin_number", col("used_pin_number") == 1.0) all_sets = all_sets.withColumn("online_order", col("online_order") == 1.0) w = Window.orderBy(lit(1)) all_sets = ( all_sets .withColumn("idx", row_number().over(w)) .withColumn("user_id", concat(lit("user_"), col("idx") - lit(1))) .withColumn("transaction_id", concat(lit("txn_"), col("idx") - lit(1))) .drop("idx") ) Timestamping The job adds created and updated timestamp columns: Shell for date_col in ["created", "updated"]: all_sets = all_sets.withColumn(date_col, current_timestamp()) Point-in-Time Feature Calculation Using the raw transaction history, the Spark job calculates features such as the number of previous transactions, average/max/stddev of previous transaction amounts, and days since the last/first transaction. Shell def calculate_point_in_time_features(label_dataset: DataFrame, transactions_df: DataFrame) -> DataFrame: # ... (see full code in data_preparation/main.py) # Aggregates and joins features for each user at each point in time Output The final processed data is saved as both a CSV (for entity definitions) and a Parquet file (for feature storage) in MinIO: Shell entity_df.write.option("header", True).mode("overwrite").csv(entity_file_name) df.write.mode("overwrite").parquet(parquet_file_name) All of this logic is orchestrated by the prepare_data component in the pipeline, which launches the Spark job on Kubernetes. 2. Feature Engineering With Feast Feast is an open-source feature store that enables you to manage and serve features for both training and inference, ensuring consistency and reducing the risk of training/serving skew. In machine learning, a “feature” is an individual measurable property or characteristic of the data being analyzed — in our fraud detection case, features include transaction amounts, distances from previous transactions, merchant types, and user behavior patterns that help the model distinguish between legitimate and fraudulent activity. Container Image for Feature Engineering This pipeline step uses a custom container image built from feature_engineering/Containerfile. The image includes: Feast feature store: The complete Feast installation for feature managementPython dependencies: Required libraries for feature processing and materializationFeature repository definition: The repo_definition.py file that defines the feature views and entitiesMinIO client libraries: For uploading the materialized features and online store to object storage The container runs with the entry point python feast_feature_engineering.py, which handles the Feast operations including applying feature definitions, materializing features, and uploading the results to MinIO. After data preparation, the pipeline uses Feast to register, materialize, and store features for downstream steps. This process starts with defining the features you want to use. For example, in feature_repo/repo_definition.py, you’ll find a FeatureView that lists features like distance_from_home and ratio_to_median_purchase_price: Shell transactions_fv = FeatureView( name="transactions", entities=[transaction], schema=[ Field(name="user_id", dtype=feast.types.String), Field(name="distance_from_home", dtype=feast.types.Float32), Field(name="ratio_to_median_purchase_price", dtype=feast.types.Float32), # ... other features ], online=True, source=transaction_source, ) Once the features are defined, the pipeline runs two key Feast commands. First, it applies the feature definitions to the store: Shell subprocess.run(["feast", "apply"], cwd=feature_repo_path, check=True) Then, it materializes the computed features from the Parquet file into Feast’s online store, making them available for real-time inference: Shell subprocess.run(["feast", "materialize", start_date, end_date], cwd=feature_repo_path, check=True) Finally, the resulting feature data and the online store database are uploaded to MinIO, so they’re accessible to the rest of the pipeline: Shell client.fput_object(MINIO_BUCKET, object_path, local_file_path) By using Feast in this way, you ensure that the same features are available for both model training and real-time predictions, making your ML workflow robust and reproducible. 3. Model Training With the features materialized in Feast, the next step is to train the fraud detection model. The pipeline’s train_model component retrieves the processed features and prepares them for training. The features used include behavioral and transaction-based signals such as distance_from_last_transaction, ratio_to_median_purchase_price, used_chip, used_pin_number, and online_order. Container Image for Model Training This pipeline step uses a custom container image built from train/Containerfile. The image includes: Machine learning libraries: TensorFlow/Keras for neural network training, scikit-learn for data preprocessingONNX Runtime: For converting and exporting the trained model to ONNX formatPySpark: For loading and processing the feature data from Parquet filesMinIO client libraries: For downloading features and uploading the trained model artifacts The container runs with the entry point python train.py. The training script loads the features, splits the data into train, validation, and test sets, and scales the input features for better model performance: Shell train_features = features.filter(features["set"] == "train") validate_features = features.filter(features["set"] == "valid") test_features = features.filter(features["set"] == "test") # ... select and scale features ... It then builds and trains a neural network model using Keras, handling class imbalance and exporting the trained model in ONNX format for portable, high-performance inference. Shell model = build_model(feature_indexes) model.fit(x_train, y_train, epochs=2, validation_data=(x_val, y_val), class_weight=class_weights) save_model(x_train, model, model_path) # Exports to ONNX By structuring the training step this way, the pipeline ensures that the model is trained on the same features that will be available at inference time, supporting a robust and reproducible MLOps workflow. 4. Model Registration Once the model is trained, it’s important to track, version, and manage it before deploying to production. This is where the Kubeflow Model Registry comes in. The Model Registry acts as a centralized service for managing machine learning models and their metadata, making it easier to manage deployments, rollbacks, and audits. Container Image for Model Registration This pipeline step uses a custom container image built from pipeline/Containerfile. The image includes: Kubeflow Pipelines SDK: For pipeline orchestration and component definitionsModel Registry client: Python libraries for interacting with the Kubeflow Model RegistryPipeline orchestration code: The core pipeline definition and component functions The container is used as the base image for the register_model component, which executes the model registration logic inline within the pipeline definition. This approach allows the registration step to run lightweight operations without requiring a separate, specialized container image. In the pipeline, the register_model component takes the trained model artifact and registers it in the Model Registry. This process includes: Assigning a unique name and version: The model is registered with a name (e.g., "fraud-detection") and a version, which is typically tied to the pipeline run ID for traceability.Storing metadata: Along with the model artifact, metadata such as the model format, storage location, and additional tags or descriptions can be stored for governance and reproducibility.Making the model discoverable: Registered models can be easily found and referenced for deployment, monitoring, or rollback. Here’s how the registration step is implemented in the pipeline: Shell @dsl.component(base_image=PIPELINE_IMAGE) def register_model(model: Input[Model]) -> NamedTuple('outputs', model_name=str, model_version=str): from model_registry import ModelRegistry registry = ModelRegistry( server_address="http://model-registry-service.kubeflow.svc.cluster.local", port=8080, author="fraud-detection-e2e-pipeline", user_token="non-used", is_secure=False ) model_name = "fraud-detection" model_version = "" registry.register_model( name=model_name, uri=model.uri, version=model_version, model_format_name="onnx", model_source_class="pipelinerun", model_source_group="fraud-detection", model_source_id="", model_source_kind="kfp", model_source_name="fraud-detection-e2e-pipeline", ) return (model_name, model_version) By registering the model in this way, you ensure that every model deployed for inference is discoverable, reproducible, and governed — an essential part of any production-grade MLOps workflow. 5. Real-Time Inference With KServe The final stage of the pipeline is deploying the registered model as a real-time inference service using KServe. KServe is an open-source model serving platform for Kubernetes that standardizes how you deploy, scale, and manage machine learning models in production. Container Image for Real-Time Inference This pipeline step uses a custom container image built from rest_predictor/Containerfile. The image includes: KServe Python SDK: For building custom model serving endpointsONNX Runtime: For running the trained model in ONNX formatFeast feature store client: For retrieving real-time features during inferenceModel Registry client: For downloading the registered model artifactsCustom predictor code: The predictor.py script that implements the inference logic The container runs with the entry point python predictor.py. The pipeline’s serve component creates a KServe InferenceService using this custom Python predictor. This is done by creating a Kubernetes custom resource (CR) of kind InferenceService, which tells KServe how to deploy and manage the model server. The resource specifies the container image, command, arguments, and service account to use for serving the model. Here’s how the InferenceService is defined and created in the pipeline: Shell inference_service = kserve.V1beta1InferenceService( api_version=kserve.constants.KSERVE_GROUP + "/v1beta1", kind="InferenceService", metadata=client.V1ObjectMeta( name=model_name + "-" + job_id, namespace=kserve.utils.get_default_target_namespace(), labels={ "modelregistry/registered-model-id": model.id, "modelregistry/model-version-id": model_version.id }, ), spec=kserve.V1beta1InferenceServiceSpec( predictor=kserve.V1beta1PredictorSpec( service_account_name="kserve-sa", containers=[ V1Container( name="inference-container", image=rest_predictor_image, command=["python", "predictor.py"], args=["--model-name", model_name, "--model-version", model_version_name] ) ] ) ), ) ks_client = kserve.KServeClient() ks_client.create(inference_service) The custom predictor does more than just run the model: it also integrates directly with the Feast online feature store. When a prediction request arrives with a user_id, the predictor first fetches the user’s latest features from Feast and then feeds them to the ONNX model for inference. Here’s a simplified view of the predictor’s logic: Python class ONNXModel(kserve.Model): def load(self): # ... download model and initialize Feast feature store ... self.feature_store = FeatureStore(repo_path=feature_repo_path) self.model = ort.InferenceSession("/app/model") self.ready = True async def predict(self, payload: Dict) -> Dict: user_id = payload.get("user_id") feature_dict = self.feature_store.get_online_features( entity_rows=[{"user_id": user_id}], features=features_to_request, ).to_dict() input_data = np.array([ [ feature_dict["distance_from_last_transaction"][0], feature_dict["ratio_to_median_purchase_price"][0], feature_dict["used_chip"][0], feature_dict["used_pin_number"][0], feature_dict["online_order"][0], ] ], dtype=np.float32) result = self.model.run(None, {self.model.get_inputs()[0].name: input_data}) Note: By default, KServe supports several model serving runtimes, including Triton Inference Server (often used via the kserve-tritonserver runtime). However, the official Triton server does not support macOS/arm64, which is why this project uses a custom Python predictor for local development and demonstration. If you are running on a supported platform (such as x86_64 Linux), you may want to use the kserve-tritonserver runtime for production workloads, as it offers high performance and native ONNX support. If you want to use Feast for online feature retrieval at inference time, a custom Python predictor (like the one in this repo) is the most straightforward approach. If you use the standard kserve-tritonserver runtime, you would need to implement feature fetching as a Triton Python backend or as a pre-processing step outside of Triton, since Triton itself does not natively integrate with Feast. By structuring the inference step this way, the pipeline ensures that the deployed model always uses the freshest features for each prediction, supporting robust, real-time fraud detection. Importing and Running the Pipeline Once your environment is set up and the data is uploaded, you’re ready to run the pipeline. Import the Pipeline Open the Kubeflow Pipelines UI (usually at http://localhost:8080 if you used the default port-forward).Click Pipelines in the sidebar, then click Upload pipeline.Upload the compiled pipeline YAML file (e.g., pipeline/fraud-detection-e2e.yaml). Run the Pipeline After uploading, click on your pipeline in the list.Click Create run.Optionally customize the run name and description (the defaults work fine), then click Start. You can monitor the progress and view logs for each step directly in the UI. Testing the Live Endpoint With the inference service running, you can now interact with your deployed model in real time. Let’s see how to send prediction requests and interpret the results. Before sending requests, port-forward the inference pod so the service is accessible locally. Run this command in a separate terminal window: Shell kubectl -n kubeflow get pods -l component=predictor -o jsonpath="{.items[*].metadata.name}" | tr ' ' '\n' | grep '^fraud-detection' | head -n1 | xargs -I {} kubectl port-forward -n kubeflow pod/{} 8081:8080 With the port-forward active, you can now send a request to the model: Shell curl -X POST http://localhost:8081/v1/models/onnx-model:predict \ -H "Content-Type: application/json" \ -d '{"user_id": "user_0"}' The service retrieves features for user_0, runs a prediction, and returns the fraud probability. Shell {"user_id":"user_0","prediction":[[0.8173668384552002]]} Note: The result of the prediction may vary depending on the initial raw data you uploaded. Try sending requests with a few different user_id values (e.g., "user_1", "user_2", etc.) to see how the predictions change. Conclusion This post has walked you through a complete, reproducible AI/ML workflow — from raw data to a live model serving endpoint — using Kubeflow and open-source tools. Along the way, you’ve seen how to prepare data with Spark, manage features with Feast, train and register models, and deploy real-time inference services with KServe, all orchestrated in a portable pipeline you can run on your own laptop. By following this blueprint, you can adapt and extend the process for your own machine learning projects, whether you’re working locally or scaling up to production. Kubeflow’s modular platform and ecosystem make it possible to manage the entire ML lifecycle in a consistent, automated, and open way. Ready to try it yourself? The complete source code for this project is available on GitHub.

By Helber Belmiro

CORE

Choosing Between GCP Workflows, AWS Step Functions, and Temporal for Stateful Workflow Orchestration

Stateful workflow orchestration tools help engineers reliably coordinate multi-step processes across services. Google Cloud Workflows (GCP Workflows) and AWS Step Functions are fully managed cloud services for defining workflows as a series of steps/states, whereas Temporal is an open-source orchestration engine that developers can self-host or use via a managed offering . All three aim to handle long-running, stateful sequences of tasks with built-in reliability. This article compares GCP Workflows, AWS Step Functions, and Temporal from a senior engineer’s perspective, focusing on developer usability and experience. We examine their workflow modeling approaches, error handling capabilities, observability, cost and scalability considerations, and deployment models. The goal is to help you choose the right tool for your use case in a vendor-neutral way. Workflow Modeling Approach (Visual vs. Code-Based) GCP Workflows and AWS Step Functions – Declarative Definitions: Both GCP Workflows and AWS Step Functions use a declarative syntax to model workflows, though in different flavors. AWS Step Functions expresses workflows in the Amazon States Language (JSON-based, with YAML support via tools). Each state machine (workflow) in Step Functions is defined by states and transitions in JSON, and you can design it visually using AWS’s Workflow Studio. Google Cloud Workflows uses its own YAML-based DSL (or JSON) to describe a sequence of steps that execute in order . In GCP’s YAML, steps implicitly flow to the next unless directed otherwise, similar to a coding style . Both systems support conditional branches (e.g. AWS Choice state, GCP switch statements) and loops, but you author them as config rather than writing general-purpose code. Example: A simple “Hello World” workflow illustrates the syntax difference. In AWS Step Functions JSON, every state must declare its Type and explicit Next state or end state. In GCP Workflows YAML, steps are listed sequentially with optional next labels. For instance, an AWS Step Functions definition might start like this in JSON: JSON { "StartAt": "Hello", "States": { "Hello": { "Type": "Pass", "Result": "Hello", "Next": "World" }, "World": { "Type": "Pass", "Result": "World", "End": true } } } Whereas the equivalent in GCP Workflows YAML is more concise: YAML main: steps: - Hello: next: World - World: return: "World" Both achieve the same result (passing “Hello” to “World”), but the YAML feels closer to scripting with implicit ordering, while the JSON is an explicit state machine. The developer experience here depends on preference: some engineers find YAML/JSON definitions straightforward for simple flows, especially with a visual editor. AWS’s console will render a state machine graph from the JSON, letting you visually design and trace executions. GCP’s console allows deploying and viewing executions, though its editing is primarily text-based (some community tools can visualize the YAML). These declarative approaches require a mindset of describing workflows declaratively and handling data via JSON paths or variables, which can be less familiar than writing code. The benefit is that the orchestrator handles the flow for you, and in the case of Step Functions, many service integrations are abstracted with minimal code (for example, invoking AWS services by referencing special ARN patterns) . GCP Workflows similarly provides built-in connectors to call Google Cloud APIs easily, treating them as steps without writing full HTTP calls . Temporal – Code-First Workflows: In contrast, Temporal takes a programmatic approach. Developers write workflow definitions in a general-purpose programming language (Go, Java, Python, etc.) using Temporal’s SDK . A Temporal workflow is essentially a function annotated/registered as a workflow, and you orchestrate by calling activities (tasks) within code. This means your workflow logic lives in standard code with loops, conditionals, and function calls, rather than a JSON/YAML graph. For example, in a Java-based Temporal workflow you might write: Java // Inside a Temporal Workflow method: String data = activities.fetchData(); // call an activity try { String result = activities.processData(data); activities.sendResult(result); } catch (ActivityFailure e) { activities.handleFailure(data); throw e; } Here, the sequence and error handling are expressed with normal try/catch and method calls. Temporal’s engine will persist each step’s state behind the scenes, so even though it looks like regular code, it survives process crashes or restarts. The developer experience is akin to writing a standard service: you get compile-time type checking and can use IDE refactoring, unit tests, and debugging techniques on your workflow code . This can be a huge advantage for complex logic – as one engineer noted, large workflows can be easier to manage in code than in “10x lines of JSON” . However, code-first workflows require understanding Temporal’s rules (e.g. deterministic execution) and running worker processes. There’s no built-in graphical diagram of the flow, though Temporal provides a web UI to view workflow histories. In summary, visual vs. code-based boils down to preference and complexity. AWS Step Functions (and to an extent GCP Workflows) shine for visually orchestrating cloud services with minimal code, using declarative definitions and a console that highlights execution paths. Temporal shines for developers who prefer writing workflow logic as code for maximum flexibility and integration with normal development tools. Each approach impacts how you design and maintain workflows – for simpler service integrations, a visual/config model might suffice, whereas for intricate business logic with lots of conditional paths, Temporal’s code approach can offer more flexibility. Error Handling, Retries, and Compensation Logic Robust error handling is critical in workflow orchestration, and all three solutions provide mechanisms to handle failures and implement retry or compensation logic for reliability. In an ideal world, every task would succeed on the first try – in reality, tasks can fail or need retries, and workflows must gracefully recover or roll back as needed. AWS Step Functions: Step Functions has built-in primitives for error handling in its state definitions. Each state (especially Task states) can include a Retry field specifying retry policies (e.g. which error types to retry, how many times, exponential backoff) and a Catch field specifying fallbacks on failure . For example, you might define a state with: JSON "DoPayment": { "Type": "Task", "Resource": "<Lambda ARN>", "Retry": [{ "ErrorEquals": ["TransientError"], "IntervalSeconds": 5, "MaxAttempts": 3, "BackoffRate": 2.0 }], "Catch": [{ "ErrorEquals": ["States.ALL"], "Next": "PaymentFailed" }], "Next": "Shipping" } This means “try the DoPayment task up to 3 times on TransientError with exponential backoff, and if any error (States.ALL) still occurs, go to a PaymentFailed state.” Such definitions let you declaratively encode resilient behavior. Step Functions also supports compensation logic by chaining states – e.g., after a failure you might invoke specific “undo” Lambda functions in the Catch path. However, implementing a saga (transaction compensation across multiple steps) requires designing the state machine to track which steps succeeded and call the appropriate compensating actions, which can become complex. (Older versions of Step Functions made saga patterns tricky , though new features like dynamic parallelism and updated SDKs have made it more feasible.) GCP Workflows: Google Workflows likewise has structured error handling using a try/except block within the YAML. You can wrap one or multiple steps in a try clause and then provide an except clause to catch the error as a variable and execute recovery steps . For instance, a Workflow step can be written as: YAML - call_api: try: call: http.get args: { url: "https://example.com/api" } result: api_response except: as: e steps: - log_error: call: sys.log args: { message: ${"API failed: " + e.message} } - compensate: call: someCompensationTask - raise: ${e} # rethrow if needed This will attempt the HTTP call and, on any error, capture the error info in e, log it, perform a compensation step, and then rethrow the error (which could be caught by an outer scope or cause the workflow to fail). GCP Workflows also supports a retry attribute for steps or uses predefined retry policies that can be referenced for convenience . Both GCP and AWS let you control retry intervals, max attempts, and backoff. A nice feature in GCP Workflows is the ability to define a named retry policy (with rules for max attempts, etc.) and reuse it across steps , avoiding repetition. In practice, GCP and AWS offer very similar error-handling capabilities – they ensure that if a task fails, you can catch the exception, retry it a certain way, or route the workflow to alternate steps to handle the failure . These mechanisms provide the “plumbing” for building reliable workflows. Temporal: Temporal’s philosophy is that error handling is part of your code, enriched by built-in reliability from the platform. Each Temporal activity invocation can have an automatic retry policy configured (e.g. retry 5 times with exponential backoff) via the SDK, which Temporal will execute for you if the activity fails. In your workflow code, you can also use standard try/catch logic to implement custom compensation. For example, it’s straightforward to implement the Saga pattern: after each successful activity, register a compensating activity to be executed later if needed . The Temporal Java SDK even provides a Saga class utility – you can add compensation callbacks as you go, and if an error occurs, call saga.compensate() to run all the accumulated compensations in reverse order . This is powerful because your compensation logic can be as dynamic as needed (e.g. only compensate the steps that actually ran). Temporal’s approach essentially gives you full flexibility: since you’re writing code, you can handle errors with any logic you want – retries can be done manually or via annotations, different exceptions can trigger different code paths, etc. Moreover, Temporal guarantees exactly-once execution of each activity or marks workflows as failed, so you don’t get partial ambiguity – you either handle the failure in code or let the workflow fail and possibly fix forward. One thing to note: Step Functions and GCP Workflows also guarantee at-least-once or exactly-once execution semantics within their domain , but as a developer you configure those via the JSON/YAML. With Temporal, these guarantees are built-in and you handle the “what to do on failure” in code. In summary, all three platforms support robust error handling and retries. AWS Step Functions and GCP Workflows provide declarative knobs for retries and catch/fallback states, making it easy to configure common policies (with Step Functions going to “insane lengths” internally to ensure no step is silently dropped on error ). Temporal, being code-driven, offers more flexibility for complex compensation logic and fine-grained error handling – developers can utilize familiar try/catch patterns and even update running workflows or roll back state programmatically, which is harder or not possible in the managed services (you can’t modify a running Step Functions execution) . Choosing between them may depend on how complex your failure recovery needs are: simple retries and catches are equally covered in all, but if you anticipate elaborate rollback sequences or dynamic error responses, implementing those in Temporal might be more straightforward in code. Observability and Monitoring Runtime Visibility is a key part of the developer experience for workflows – you need to know what’s happening inside your workflows, where failures occur, how long steps take, etc. The three platforms approach observability in different ways, but all aim to provide insight into workflow executions. AWS Step Functions: One of Step Functions’ strengths is its visual execution tracing in the AWS Console. When you run a state machine, the console shows a graphical diagram of the workflow with each state highlighted as it executes or fails, and you can click states to see input/output data and error details. This makes debugging intuitive: you literally see which step failed (colored red) in the context of the whole flow. Under the hood, Step Functions emits execution history events (state entered, succeeded, failed, etc.) that you can also get via APIs or CloudWatch Logs if you enable logging. AWS provides CloudWatch metrics for Step Functions executions as well – for example, metrics like ExecutionsStarted, ExecutionsSucceeded, ExecutionsFailed, and execution time are published automatically . You can set CloudWatch alarms on these (e.g. alert if a workflow fails). Each state machine can be configured to send its detailed execution history to CloudWatch Logs, which allows aggregating logs or tracing issues across many runs. In practice, the combination of the AWS Console’s live visualization and CloudWatch metrics/logs gives a pretty robust observability toolkit. Many third-party tools (Datadog, Dynatrace, etc.) also integrate with Step Functions metrics . For distributed tracing, Step Functions doesn’t natively integrate with AWS X-Ray for the state machine itself (as of now), but the services it calls (like Lambda) can be traced, so you might need to correlate that manually. GCP Workflows: Google Cloud Workflows similarly integrates with Google’s Cloud Operations suite (formerly Stackdriver). By default, Workflows will send execution logs to Cloud Logging and metrics to Cloud Monitoring . Every workflow execution’s steps and results can be viewed in the GCP console; you can inspect the history of execution steps to see inputs/outputs for each step . Cloud Monitoring provides metrics such as number of executions, execution latencies, and errors. In fact, GCP Workflows defines a set of metric types (e.g., step counts, callback counts, etc.) that are recorded per workflow . Developers can set up dashboards or alerts on these metrics just like they would for any Google Cloud service. For logging, each step’s execution (and any explicit sys.log calls in the workflow) can be found in Cloud Logging, making it straightforward to troubleshoot failed runs by reading the error stack or custom logs. While GCP’s UI might not have the same graphical state machine diagram as AWS, it lists the steps in sequence with their status. In short, observability is baked in: you don’t have to do much to get basic monitoring of workflows – the platform will record logs and metrics. This is valuable for understanding performance (e.g., which step is slow) and reliability (error rates, etc.). Temporal: Observability in Temporal is more developer-driven, but quite powerful. Since Temporal workflows run in your application environment, you have the freedom to instrument them with standard tools. Temporal itself provides a few layers of visibility: Workflow History: Every Temporal workflow has a complete event history (stored by the Temporal server) that you can query. Temporal’s Web UI exposes this – you can see each event (started, each activity scheduled/completed, any timer or signal events) and drill into details. This is great for debugging a specific execution, albeit more low-level than a high-level flowchart.Logging: You control logging within activities and workflows (using your language’s logging framework). Temporal ensures logs can include workflow identifiers, etc., so you can trace logs for a given execution. There’s also support in some SDKs to tag logs or send them to external systems.Metrics & Tracing: Temporal exposes metrics from the server (and clients) that can be collected by Prometheus or other monitoring systems . These include task queue latency, workflow execution counts, etc. Additionally, you can instrument workflows with OpenTelemetry for distributed tracing if you want to trace through workflows and the services they call . In essence, you have to set up the observability plumbing (since you’re hosting Temporal or using its cloud, you would connect it to your monitoring solution), but all the hooks are there. Temporal Cloud, for example, offers built-in metrics and a UI for some monitoring out of the box.Testing and Debugging: A big plus for Temporal’s developer experience is that you can unit test workflows in code (using the Temporal testing libraries) and even step through them in a debugger (with some limitations) as if they were regular code. This is not “monitoring” in production, but it greatly improves confidence and debuggability during development. Overall, AWS and GCP’s managed services provide turnkey observability – just use the platform and inspect the console or CloudWatch/Cloud Logging for insights. This is convenient and requires little effort to set up. Temporal provides a high degree of insight as well, but it relies on you to leverage its tools and integrate with your own monitoring stack (which a seasoned team might prefer, as they can plug in existing observability systems). Temporal’s advantage is the ability to do deep debugging (replaying workflow execution, for instance) and fine-grained monitoring, since you can emit custom metrics or logs from your workflow code easily. When choosing, consider how important a built-in visual trace is (AWS Step Functions excels there ), and whether you’re comfortable setting up observability for a self-run system (Temporal) versus using what the cloud provides out of the box. Deployment Model (Managed Service vs. Self-Hosted) The deployment model is a fundamental difference among these tools. AWS Step Functions and GCP Workflows are fully managed services provided by AWS and Google Cloud respectively. There are no servers for you to manage, no upgrades or patches – you simply use the service via API/console, and the cloud provider runs the underlying orchestration engine in a highly available way. This offers great convenience: your focus is on developing the workflow definitions and code for tasks (like Lambda functions or Cloud Functions) and not on the reliability of the orchestrator itself. Both AWS and Google automatically handle redundancy, failover, and scaling of the workflow engine. The trade-off is that you are tied to that cloud environment. Step Functions is accessible only in AWS regions and operates with AWS identities (IAM); GCP Workflows likewise in GCP projects using GCP IAM. Temporal, by contrast, started as a self-hosted solution (an evolution of Uber’s Cadence project). Using Temporal typically means you will deploy the Temporal server (which includes multiple microservices and a database) in your environment – be it on-premises, on your own cloud infrastructure, or even on your laptop for dev. This gives you a lot of control: you can run workflows on any cloud or hybrid environment, and integrate with services across boundaries. It’s open-source, so you’re not beholden to a single vendor’s cloud. The obvious downside is operational overhead. You need to ensure Temporal servers are running, durable storage is configured (Temporal relies on a persistent store like Cassandra or MySQL/Postgres), and you handle scaling and updates. Teams choosing Temporal often do so because they need the flexibility (or want to avoid cloud lock-in) and are willing to invest in operating it, or because they require workflows that run in their own data center or across multiple clouds. There is now Temporal Cloud, a managed service by Temporal Technologies, which attempts to offer the best of both worlds (Temporal’s capabilities delivered as a hosted service). That can eliminate most ops burden, but it introduces a new vendor relationship (Temporal as the provider) rather than your primary cloud provider. When deciding, consider these points: Integration with existing infrastructure: If your stack is already all-in on AWS, using Step Functions is a natural fit – it works seamlessly with AWS security (IAM roles) and other services. Similarly, GCP Workflows fits neatly into GCP’s ecosystem. If you have a multi-cloud strategy or on-prem requirements, Temporal stands out because you can run it anywhere and even orchestrate across clouds. For example, Temporal can call AWS and GCP services from the same workflow (since you write the code to do so), whereas Step Functions can call AWS services easily but would need something like an HTTP task to call outside AWS (and GCP Workflows can call AWS APIs via HTTP but lacks direct AWS integrations).Maintenance and Reliability: Using the managed services means AWS or Google is responsible for uptime of the orchestrator. These services come with high SLAs and you don’t need to worry about patching bugs in the workflow engine. Temporal gives you more responsibility here – you’ll have to upgrade it periodically and monitor the health of the Temporal cluster. That said, Temporal is built to be fault-tolerant (e.g., if a worker crashes, workflows don’t lose state).Serverless vs. Not: Step Functions and GCP Workflows are often described as serverless orchestration – you don’t think about servers at all, and they can scale from zero to whatever is needed. Temporal (self-hosted) isn’t serverless in the operational sense; it’s more like running a distributed system (although the workflows you write are “serverless” from the application perspective). If you want a purely serverless architecture where you write minimal infrastructure code, the cloud services have an edge . For example, an AWS developer can string together Lambdas with Step Functions and never manage a single VM or container. In contrast, adopting Temporal would introduce at least a few persistent services to run (unless using Temporal’s own cloud service). In a comment comparing the two approaches, a Temporal founder noted: “Step Functions are hosted…and great for some use cases and can be a huge cost and performance burden for others. If you want more control over your application deployment due to security or other reasons, the open source nature of Temporal is very valuable.” . This captures it well: fully managed is great for agility and simplicity, but self-hosted can be valuable for control, customization, and potentially performance/cost optimization in specific scenarios. To sum up, if you prefer not to run your own infrastructure and are comfortable with the cloud vendor tie-in, GCP Workflows or AWS Step Functions provide a hassle-free deployment model – just an API endpoint in your cloud account. They are the natural choice for cloud-centric deployments and fast onboarding. If your organization demands an on-prem solution, needs to orchestrate across various environments, or wants the flexibility of an open-source tool that you can extend, Temporal is attractive – just be prepared to handle its deployment or use its specialized cloud service. It’s a strategic decision: convenience and managed stability versus control and flexibility. Conclusion: Choosing the Right Tool When choosing between GCP Workflows, AWS Step Functions, and Temporal, there is no one-size-fits-all answer – each tool shines in different scenarios. The decision should hinge on your specific use case, environment, and priorities in developer experience: If you are deeply invested in a single-cloud ecosystem, the respective managed service is often the path of least resistance. On AWS, Step Functions integrates effortlessly with AWS Lambda and other AWS services (no need to reinvent integration or authentication). On GCP, Workflows makes it easy to orchestrate Cloud Functions, Cloud Run, and Google APIs with minimal boilerplate. These services are fully managed, meaning you get scalability, reliability, and security out of the box with zero infrastructure to maintain. They are excellent for event-driven workflows that glue together cloud services – e.g., processing files when they land in storage, orchestrating a data pipeline, or managing long-running transactions using cloud functions and services. The visual nature of AWS Step Functions is a plus for quickly understanding and communicating workflow logic. GCP Workflows, while not visual, keeps things simple and readable in YAML. If your workflows are not extremely complex (say, under a few hundred steps with moderate logic), and you want quick development with managed reliability, these are strong choices. Cost-wise, expect a pay-per-use model: great for sporadic workloads (you pay pennies for a few runs) but consider costs if you plan millions of invocations (though even then, the overhead per execution is small, and designing workflows to do more work in each step can mitigate costs).If your workflows are highly complex, long-running (months+), or span multiple cloud environments or on-prem systems, Temporal might be the better fit. Temporal provides a developer-friendly programming model for workflows that can handle virtually any requirement since you’re writing code (loops, complex if/else logic, dynamic parallelism, human-in-the-loop waits, etc.). It shines for scenarios where business logic doesn’t map cleanly to a state machine diagram or when you need the utmost flexibility in error handling and state management. Temporal has been used for things like migrating thousands of servers with many failure contingencies, coordinating multi-step sagas in microservice architectures, and other “complex, long-running, multi-cloud, or hybrid workflows.” In these cases, the ability to update workflows, use versioned code, and test them like regular software is invaluable . You should also consider Temporal if vendor lock-in is a concern – it’s essentially infrastructure you control. The trade-off is the operational overhead and learning curve. Your team should be ready to run (or purchase) the Temporal service and handle its integration. Once set up, it can replace a lot of home-grown workflow logic and even substitute simpler cases that Step Functions or Workflows handle – but it especially pays off as complexity grows. As one community recommendation puts it: use Step Functions for simpler workflows tightly integrated with AWS services, and use Temporal for the more complex, long-running processes that need ultimate flexibility. In terms of developer experience, ask what your team is more comfortable with: writing JSON/YAML and using a visual orchestrator, or writing workflow code in a programming language. Each approach has its fans. If your team values quick iteration with code and the ability to leverage IDE tools, Temporal will feel natural. If they prefer configuration-driven approaches or are already used to cloud formation/IaC style development, Step Functions or GCP Workflows will seem straightforward. Finally, consider maintenance and ecosystem. If you want the least ops burden and are fine with what the cloud offers, the managed services are hard to beat. If you need an open solution with no per-use fees (and perhaps you have the platform engineering capacity to run it), Temporal offers a compelling, proven technology. All three options are powerful and capable. Often, it’s not about which is “better” overall, but which is a better fit for your constraints: Choose AWS Step Functions if you’re in AWS and need a reliable workflow engine with rich service integrations and a user-friendly console – especially for orchestrating serverless apps and AWS services with minimal custom code.Choose GCP Workflows if you’re on Google Cloud and want a straightforward, serverless orchestrator for Google Cloud services and APIs, with a simple YAML syntax and no infrastructure fuss.Choose Temporal if your workflows demand the full expressiveness of code, need to run outside a single cloud, or require features like state persistence with custom business logic, and you’re prepared to manage (or pay for) the infrastructure. It gives you ultimate control to build “invincible” long-running workflows in software. By evaluating your use case against these factors – modeling approach, error handling needs, observability expectations, cost profile, and deployment preferences – you can confidently select the workflow orchestration tool that will serve your engineering team best. Each of these technologies has a successful track record; the key is aligning their strengths to your project’s requirements. With the right choice, you’ll be able to build reliable, maintainable, and scalable workflow automation that makes complex processes easier for developers to manage and for systems to execute.

By Ravi Teja Thutari

CORE

The Developer's Guide to Cloud Security Career Opportunities

Your organization's entire infrastructure moved to the cloud last year, but your security team is still thinking like it's 2015. They're applying traditional network security controls to cloud environments, creating bottlenecks that slow down your deployments and leave massive security gaps. Meanwhile, you're getting blamed when security incidents happen, even though you never had input on the security architecture in the first place. If this sounds familiar, you're not alone. The cloud security skills gap is creating unprecedented opportunities for developers who understand both sides of the equation. Organizations desperately need professionals who can code secure applications AND understand cloud infrastructure security. The question isn't whether you should consider cloud security — it's how quickly you can position yourself to take advantage of these opportunities. Why Developers Are Perfectly Positioned for Cloud Security Traditional security professionals often struggle with cloud environments because everything is software-defined. Network configurations, access controls, monitoring systems — they're all managed through code. This is where your development background becomes invaluable. When you're already comfortable with APIs, infrastructure as code, and automated deployments, learning cloud security becomes a natural extension of your existing skillset. You understand how applications actually work, which gives you insights that traditional security teams often miss. You know that securing the build pipeline is just as important as securing the runtime environment. More importantly, you understand the business pressure to ship features quickly. Security solutions that slow down development or create friction get ignored. Your dual perspective allows you to design security controls that actually get adopted by development teams instead of being circumvented. The Career Paths That Are Opening Up Cloud security isn't just one job — it's an entire ecosystem of specializations that didn't exist five years ago. Each path offers different opportunities depending on your interests and current skill set. Cloud Security Engineer roles focus on designing and implementing security controls across cloud infrastructure. You'd work with services like AWS Config, Azure Security Center, and Google Cloud Security Command Center to build automated security monitoring and compliance systems. These positions typically pay US$86,144 - US$101,705 and require understanding both cloud platforms and security principles. DevSecOps Engineer positions blend development, operations, and security responsibilities. You'd integrate security testing into CI/CD pipelines, automate vulnerability scanning, and build security guardrails that prevent developers from deploying insecure code. The automation aspect makes this particularly appealing for developers who enjoy building tools and systems. Cloud Security Architect roles involve designing security solutions for large-scale cloud environments. You'd make strategic decisions about identity management, network security, and compliance frameworks. These positions command around US$191,850 a year and require deep technical knowledge combined with business acumen. Application Security Engineer positions focus specifically on securing cloud-native applications. You'd work with container security, serverless security, and API security. Your development background gives you a huge advantage here because you understand how applications are built and deployed. The Technical Skills That Matter Most The cloud security landscape prioritizes practical skills over theoretical knowledge. Organizations need people who can implement solutions, not just identify problems. Here's what actually matters in the current market: Infrastructure as Code proficiency is essential. Whether it's Terraform, CloudFormation, or ARM templates, you need to understand how to define security controls in code. This includes implementing security groups, IAM policies, and network configurations that can be version-controlled and automated. Container and Kubernetes security knowledge is increasingly critical. Most cloud applications run in containers, and securing containerized workloads requires understanding image scanning, runtime security, and network policies. Tools like Twistlock, Aqua Security, and Falco are becoming standard requirements. Cloud-native monitoring and incident response capabilities distinguish experienced practitioners from newcomers. You need to understand how to use tools like AWS CloudTrail, Azure Monitor, and Google Cloud Logging to detect and respond to security incidents in real-time. Identity and Access Management expertise is fundamental across all cloud platforms. This includes understanding how to implement the principle of least privilege, manage service accounts, and integrate with external identity providers. Single sign-on, multi-factor authentication, and privileged access management are core competencies. Strategic Certifications That Accelerate Your Career While experience matters more than certifications, the right credentials can open doors and validate your expertise to hiring managers. Focus on certifications that demonstrate practical skills rather than theoretical knowledge. AWS Certified Security - Specialty is the gold standard for AWS environments. It covers data protection, logging and monitoring, infrastructure security, and incident response. The exam requires hands-on experience with AWS security services, making it valuable for demonstrating practical capability.Certified Cloud Security Professional (CCSP) provides broad coverage of cloud security concepts across multiple platforms. It's vendor-neutral and covers cloud architecture, design, operations, and legal considerations. Many organizations prefer candidates with CCSP because it demonstrates platform-agnostic security knowledge.Google Cloud Professional Cloud Security Engineer is particularly valuable as more organizations adopt multi-cloud strategies. It covers Google Cloud-specific security services and demonstrates expertise in a rapidly growing platform. As John Berti, who helped create the CCSP certification for ISC2 and is co-founder at Destination Certification, explains, "The key to advancing in cloud security isn't collecting every possible certification — it's developing deep practical skills that solve real business problems. It's important to know that certifications aren't golden tickets to career advancement. Organizations value professionals who can implement security solutions that actually work in production environments." The Market Reality You Need to Understand The demand for cloud security professionals is growing faster than the supply of qualified candidates. According to ISC2's 2024 Cybersecurity Workforce Study, the global cybersecurity workforce gap is 4 million professionals, with cloud security being one of the most acute shortages. This shortage is creating opportunities for career changers who might not have traditional security backgrounds. Organizations are willing to hire developers with cloud experience and train them in security principles. The key is demonstrating that you understand both the technical and business aspects of security. Compensation reflects this market reality. Entry-level cloud security positions typically start around $90,000 to $120,000, while experienced practitioners can earn $ 200,000 or more in major markets. Remote work opportunities are abundant because organizations compete nationally for talent. Building Your Skills While You Work You don't need to quit your current job to transition into cloud security. Start by improving the security of your current applications and infrastructure. Implement automated security testing in your CI/CD pipelines. Learn to use cloud security tools in your development environment. Volunteer for security-related projects within your organization. When security incidents occur, get involved in the response process. This gives you practical experience and demonstrates your interest to management. Contribute to open-source security projects. Many cloud security tools are open source, and contributing to projects like Falco, Open Policy Agent, or Terraform security modules builds your reputation and network. The Path Forward Cloud security represents one of the fastest-growing career paths in technology. Your development background gives you unique advantages in a field that's increasingly defined by automation and code-driven solutions. The organizations that succeed in cloud security will be those that integrate security into their development processes rather than treating it as an afterthought. As a developer with security knowledge, you can be the bridge between these traditionally separate worlds. The question isn't whether cloud security is a good career choice — it's whether you're ready to take advantage of the opportunities available right now. The market is moving fast, and the best positions are going to professionals who can demonstrate both technical skills and business understanding. Start building your cloud security skills today. Your future self will thank you for positioning yourself in one of the most in-demand specializations in technology.

By Philip Piletic

CORE

Disaster Recovery Risks and Solutions

Understanding Disaster Recovery in Data Management Disaster recovery (DR) is a structured plan designed to restore critical systems, applications, and data in the event of disruptions. For data analysts, DR is the difference between seamless access to information and complete analytical paralysis. When data disappears or becomes corrupted, decision-making halts, reports become unreliable, and entire strategies can crumble. Here’s what can go wrong: Server crashes – A hardware failure wipes out key datasets.Data corruption – Errors in storage or transmission lead to unusable data.Cyberattacks – Ransomware locks analysts out of critical files.Natural disasters – Floods, fires, or earthquakes destroy physical data centers. A well-executed disaster recovery strategy ensures that even in the worst-case scenario, analysts can still access the data they need, maintain data integrity, and keep workflows moving. No scrambling, no lost insights — just business as usual. Risks of Not Having a Disaster Recovery Plan In case something goes wrong, without a plan in place, the consequences pile up quickly. Let’s take a look at some of them: Downtime disrupts analysis. When systems go down, analysts are left in the dark. Without access to real-time and historical data, reporting stalls, forecasts become unreliable, and executives are forced to make decisions based on guesswork.Lost data, lost insights. A single outage can erase months or years of valuable historical trends. Without that context, analysts can’t spot patterns, fine-tune strategies, or validate business assumptions.Regulatory compliance at risk. Many industries require strict data protection measures. Failure to recover lost data can lead to GDPR, HIPAA, or CCPA violations, which can result in fines, legal issues, and loss of customer trust.Reputation on the line. A data failure isn’t just an internal issue. Clients, partners, and stakeholders expect reliability. If reports are delayed, errors occur, or data is lost, confidence in the business weakens, sometimes permanently. “Today’s sophisticated cyber threats specifically target backup systems before primary data, rendering traditional disaster recovery approaches dangerously inadequate,” according to Alex Lekander, Owner and Editor in Chief at Cyber Insider. “Your disaster recovery strategy isn’t merely about business continuity. It’s now a critical component of your overall security posture.” Overall, having a disaster recovery plan doesn’t mean avoiding problems; it means preventing them from turning into long-term setbacks. Disaster Recovery Solutions for Data Analysts When systems encounter failure, the whole decision-making engine of the business is affected. Data analysts are at the heart of this engine, and a solid DRaaS solution ensures that the essential data required for critical decisions is always accessible, no matter the obstacles. Implementing a comprehensive disaster recovery and backup solution can significantly enhance your organization’s resilience. Here’s what a top-tier disaster recovery strategy must include to ensure no vital insight is left behind. Identifying critical data and workflows Not all data is mission-critical. Pinpoint the datasets, tools, and workflows that drive decisions, so recovery efforts focus on what truly matters. If a disruption happens, teams shouldn’t waste time restoring irrelevant files while essential data remains inaccessible. Understanding system dependencies is just as crucial — when one piece fails, you need to know what else is at risk. Defining recovery objectives Establishing clear recovery point objectives (RPO) and recovery time objectives (RTO) prevents guesswork during a crisis: RPO determines how much data loss is acceptable before it impacts operations.RTO sets the maximum downtime allowed before recovery must be completed. Implementing automated and secure backups Backups should be frequent, encrypted, and automatic — no manual work, no human error. On-premises backups offer fast restores, while cloud copies provide an extra layer of security. Geo-redundancy prevents a single point of failure, and AI-driven anomaly detection spots corruption or cyber threats before they spread. Enabling real-time data replication Backups are essential, but real-time replication keeps downtime near zero. When primary systems fail, replicated data takes over instantly, preventing business disruptions. Compression and deduplication optimize replication speed without overloading network resources. Hybrid cloud replication ensures resilience beyond on-premises infrastructure, giving businesses the flexibility to recover wherever and whenever needed. Securing analyst access Data recovery is useless if analysts can’t retrieve what they need. Multi-factor authentication (MFA) and role-based access control (RBAC) restrict entry to authorized users only. Virtual desktops or secure VPNs enable remote work without exposing sensitive data. Every access attempt should be logged and monitored to detect suspicious activity before it turns into a full-blown security breach. Testing, monitoring, and adapting It’s not recommended to use a “set and forget” approach when working with a DR plan. Regular testing ensures systems recover as expected. Disaster drills help teams practice real-world recovery scenarios, while automated compliance checks keep businesses audit-ready with minimal effort. After every incident, analyze what went wrong, update the strategy, and stay ahead of future threats. Disaster Recovery Best Practices It’s worth remembering that data analysts aren’t just passive users in disaster recovery. They play a crucial role in ensuring data remains accessible and actionable when disruptions arise. Beyond relying on IT teams, analysts must take proactive steps to safeguard their workflows and minimize downtime. Key actions include: Aligning with IT teams to ensure DR plans consider analytical workflows. Generic disaster recovery plans often overlook analytics. Analysts must ensure critical BI tools, data pipelines, and external dependencies are prioritized in recovery strategies. Without this, restored systems may lack key data sources, delaying insights.Tracking backup frequency and prioritizing crucial datasets. Real-time dashboards, compliance reports, and financial models need frequent, geo-redundant backups. Historical archives can follow a relaxed schedule, but all backups must include raw data, processed outputs, and reports to prevent workflow gaps.Undergoing DR training to navigate recovery tools efficiently. Analysts must know how to retrieve lost data without waiting for IT. Learning how to use recovery tools, versioning systems, and cloud failover ensures quick, independent restoration. In addition, regular DR drills reinforce readiness.Regularly reviewing DR plans to keep them relevant. New tools, cloud migrations, and evolving regulations require ongoing DR updates. Analysts should audit backups, test recovery scenarios, and work with IT to close gaps before disaster strikes. Outcome: When analysts take ownership of disaster recovery best practices, they reduce downtime, maintain analysis continuity, and prevent costly data loss. Conclusion: The Strategic Value of Disaster Recovery Disruptions are inevitable, but losing access to critical data doesn’t have to be. A solid disaster recovery plan ensures analysts can keep delivering insights, businesses stay compliant, and decisions remain data-driven. Organizations that prioritize DR are making important steps in safeguarding their ability to act fast and stay ahead.

By Olivia Cox

Cloud Architecture

DZone's Featured Cloud Architecture Resources

Top Cloud Architecture Experts

The Latest Cloud Architecture Topics