IoT Resources

DZone's Featured IoT Resources

A Guide to Using Browser Network Calls for Data Processing

By Rajesh Vakkalagadda

It was a good sunny day in Seattle, and my wife wanted to have the famous viral Dubai Chocolate Pistachio Shake. With excitement, we decided to visit the nearest Shake Shack, and to our surprise, it was sold out, and we were told to call them before visiting. There is no guarantee that it will be available the next day as well because of limited supply. Two days later, I went there again to see if there would be any, and again I was faced with disappointment. I didn't like the way, I either have to call them to check for an item or go to their store to check if it's available. This led to a few ideas for me: What if there was an AI that would do the calls for me and update me when I need a reservation, or wait in that long line for customer calls, and connect with me when I am ready to talk with someone? Some companies are already working on automating this.Is there a way to be notified when an item is available at Shake Shack? If it's not there, can I build a cloud infrastructure for this? Continuing on my thoughts on my second idea, I started looking into their website. It is possible to check if an item is available and add it to the cart for online ordering, which means there are some network calls through which we can identify if the Dubai Chocolate Pistachio Shake is available. Implementation To get the information about availability, we need a few data points. Is there a way to get the store information?How to differentiate whether a store has a shake or not? For getting store information, when we open inspect element and look at network calls, when we select Washington, we see a few interesting calls. Washington state has seven locations, and we need to know which one of these has the shake. If we look at the response of the region information, we were able to get all the state information. Shell curl 'https://ssma24.com/production-location/regions?isAlpha=true' \ -H 'accept: */*' \ -H 'accept-language: en-US,en;q=0.9' \ -H 'authorization: Basic removedSecretCodeHere==' \ -H 'cache-control: no-cache' \ -H 'origin: https://shakeshack.com' \ -H 'platform-os: macos' \ -H 'platform-version: 1.71.20' \ -H 'pragma: no-cache' \ -H 'priority: u=1, i' \ -H 'referer: https://shakeshack.com/' \ -H 'sec-ch-ua: "Google Chrome";v="131", "Chromium";v="131", "Not_A Brand";v="24"' \ -H 'sec-ch-ua-mobile: ?0' \ -H 'sec-ch-ua-platform: "macOS"' \ -H 'sec-fetch-dest: empty' \ -H 'sec-fetch-mode: cors' \ -H 'sec-fetch-site: cross-site' \ -H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36' \ -H 'x-requested-with: XMLHttpRequest' According to this, the WA ID is a3d65b58-ee3c-42af-adb9-9e39e09503c3, and we see the store information if we pass regionId to the API. Shell curl 'https://ssma24.com/production-location/locations?regionId=a3d65b58-ee3c-42af-adb9-9e39e09503c3&channel=WEB&includePrivate=false' \ -H 'accept: */*' \ -H 'accept-language: en-US,en;q=0.9' \ -H 'authorization: Basic removedSecretCodeHere' \ -H 'cache-control: no-cache' \ -H 'origin: https://shakeshack.com' \ -H 'platform-os: macos' \ -H 'platform-version: 1.71.20' \ -H 'pragma: no-cache' \ -H 'priority: u=1, i' \ -H 'referer: https://shakeshack.com/' \ -H 'sec-ch-ua: "Google Chrome";v="131", "Chromium";v="131", "Not_A Brand";v="24"' \ -H 'sec-ch-ua-mobile: ?0' \ -H 'sec-ch-ua-platform: "macOS"' \ -H 'sec-fetch-dest: empty' \ -H 'sec-fetch-mode: cors' \ -H 'sec-fetch-site: cross-site' \ -H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36' \ -H 'x-requested-with: XMLHttpRequest' This information is still not sufficient until we know how to get to a store page and identify the availability. If a store has a shake, it will display the Dubai shake in the shake section. And after investigating the curl calls, I see the call for the menu option. Shell curl 'https://ssma24.com/v1.0/locations/82099/menus?includeOptionalCategories=utensils&platform=web' \ -H 'accept: */*' \ -H 'accept-language: en-US,en;q=0.9' \ -H 'authorization: Basic removedSecretCodeHere==' \ -H 'cache-control: no-cache' \ -H 'channel: WEB' \ -H 'origin: https://shakeshack.com' \ -H 'platform-os: macos' \ -H 'platform-version: 1.71.20' \ -H 'pragma: no-cache' \ -H 'priority: u=1, i' \ -H 'referer: https://shakeshack.com/' \ -H 'sec-ch-ua: "Google Chrome";v="131", "Chromium";v="131", "Not_A Brand";v="24"' \ -H 'sec-ch-ua-mobile: ?0' \ -H 'sec-ch-ua-platform: "macOS"' \ -H 'sec-fetch-dest: empty' \ -H 'sec-fetch-mode: cors' \ -H 'sec-fetch-site: cross-site' \ -H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36' \ -H 'x-requested-with: XMLHttpRequest' If a shake is available, then we see it in the product section. If a store does not have it, then we will not see it in the response. So, all we need to do is get all the store information and verify which stores have the shake. If you look into the store queries, its oloId is linked in the queries related to a store. This can be mapped to the store information from the previous queries, using which I was able to get all the store IDs. With some basic shell scripting, I was able to create this curl script, which will tell which store will have the shake. Shell for store in 203514 236001 265657 82099 62274 96570 203515; do echo $store curl "https://ssma24.com/v1.0/locations/$store/menus?includeOptionalCategories=utensils&platform=web" \ -H 'accept: */*' \ -H 'accept-language: en-US,en;q=0.9' \ -H 'authorization: Basic removedSecretCodeHere==' \ -H 'cache-control: no-cache' \ -H 'channel: WEB' \ -H 'origin: https://shakeshack.com' \ -H 'platform-os: macos' \ -H 'platform-version: 1.71.20' \ -H 'pragma: no-cache' \ -H 'priority: u=1, i' \ -H 'referer: https://shakeshack.com/' \ -H 'sec-ch-ua: "Google Chrome";v="131", "Chromium";v="131", "Not_A Brand";v="24"' \ -H 'sec-ch-ua-mobile: ?0' \ -H 'sec-ch-ua-platform: "macOS"' \ -H 'sec-fetch-dest: empty' \ -H 'sec-fetch-mode: cors' \ -H 'sec-fetch-site: cross-site' \ -H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36' \ -H 'x-requested-with: XMLHttpRequest' | jq| grep '"name": "Dubai Chocolate Pistachio Shake",' done Where each store is the storeId, and I used the jq library to format the response into JSON output, and did a grep on this name, stores that have the shake will give a hit. Sample response Based on this, we know that stores 82099, 62274, and 96570 have the shakes, and I got the store address from the previous calls and got the shakes. :) Conclusion In this document, we went through the network calls to identify the information we need and also figured out a way to get the store information through their network calls. That said, we can also automate this through Selenium (which is compute heavy), or we can use an AI to analyze the website and figure out a way to get this infrastructure ready for us. This will be an advancement that will remove certain QA related jobs from the industry (we are not there yet, but we will be in that position very soon). Most of the approaches we saw today were manual. In my next articles, I will show how we can take this simple idea and build a cloud infrastructure around it. I will be writing about building A cloud-based serverless call to get the info from the web.A cloud-based trigger service that will trigger the serverless call.A storage system to store this data for caching — where can we store the responses?A notification system to notify through a mobile app or in various other ways. If we are getting enough data points, I will be showing how to build an ML model using sparse data points. This is for educational purposes, me and my partner (SGG) took this real world example to build a cloud infrastructure, the code and other artifacts will not be published to ensure Shake Shack websites are not overloaded. More

Implementing a Multi-Agent KYC System

By Narendhira Chandraseharan

Every engineer who implemented KYC systems has dealt with a frustrating reality. You build rule-based engines that break every time regulations change. Document processing takes days because everything goes through manual review queues. API integrations become brittle nightmares when you're trying to coordinate identity verification, OCR services, and watchlist screening. The numbers tell the story: most KYC systems process documents in 2–3 days with false positive rates hitting 15-20%. That means one in five legitimate customers gets flagged for manual review. Meanwhile, compliance teams burn out reviewing thousands of documents daily, and customer support fields endless calls about delayed approvals. Modern regulations make it worse. Real-time compliance monitoring and comprehensive audit trails create complexity that sequential processing simply can't handle at scale. When you're processing thousands of applications daily across multiple jurisdictions, traditional approaches fall apart. Agentic AI changes the game completely. Instead of rigid decision trees, you get systems that can reason through complex scenarios, adapt to new patterns, and maintain detailed decision trails that auditors actually understand. These aren't just smarter chatbots — they're autonomous software agents that can orchestrate complex workflows across multiple systems and data sources. Multi-Agent Architecture Design The architecture splits KYC processing across five specialized agents. Each agent handles a specific domain but communicates through a central orchestrator that manages workflow state and ensures consistency. Python import uuid import asyncio import logging from dataclasses import dataclass, field from typing import Protocol, Optional, Literal, Dict, Any, List # ----- Typed results passed between agents ----------------------------------- @dataclass(frozen=True) class OnboardingResult: submitted_documents: Dict[str, Any] kyc_level: Literal["L1", "L2", "L3"] @dataclass(frozen=True) class VerificationResult: is_complete: bool normalized_data: Dict[str, Any] issues: List[str] = field(default_factory=list) @dataclass(frozen=True) class RiskResult: risk_score: float drivers: List[str] = field(default_factory=list) # e.g., "PEP match", "Device mismatch" @dataclass(frozen=True) class ComplianceDecision: status: Literal["approved", "manual_review", "rejected"] rationale: str case_id: str 1. Smart Document Requirements Collection The onboarding agent eliminates the one-size-fits-all approach to document collection. Instead of requesting every possible document upfront, it analyzes customer profile, jurisdiction requirements, and initial risk indicators to create tailored requests. 2. Advanced Document Processing and Verification Document verification goes beyond simple OCR. The agent implements layered validation combining multiple OCR providers, computer vision-based authenticity detection, and sophisticated entity resolution to catch potential duplicates. Python # ----- Agent interfaces (Protocols) ------------------------------------------ class OnboardingAgent(Protocol): async def determine_requirements(self, customer_data: Dict[str, Any], case_id: str) class DocumentAgent(Protocol): async def process_documents(self, submitted_documents: Dict[str, Any], case_id: str) class RiskAgent(Protocol): async def calculate_risk_profile(self, verification: VerificationResult, customer_data: Dict[str, Any], case_id: str ) class ComplianceAgent(Protocol): async def make_final_decision(self, verification: VerificationResult, risk: RiskResult, case_id: str ) class MonitoringAgent(Protocol): async def initialize_monitoring(self, customer_data: Dict[str, Any], risk_score: float, case_id: str) 3. Multi-Source Risk Assessment Engine Risk assessment integrates data from multiple external APIs and databases to build comprehensive customer risk profiles. The agent handles parallel data collection, intelligent caching, and sophisticated scoring algorithms. 4. Compliance Decision Engine and Continuous Monitoring The compliance agent makes final determinations with full audit trails, while the monitoring agent handles post-approval surveillance. Python # ----- Orchestrator with decisions, timeouts and logging --------------------------- class KYCOrchestrator: def _init_( self, *, onboarding: OnboardingAgent, documents: DocumentAgent, risk: RiskAgent, compliance: ComplianceAgent, monitoring: MonitoringAgent, step_timeout_sec: float = 30.0, logger: Optional[logging.Logger] = None, ) -> None: self.onboarding = onboarding self.documents = documents self.risk = risk self.compliance = compliance self.monitoring = monitoring self.step_timeout_sec = step_timeout_sec self.log = logger or logging.getLogger("kyc.orchestrator") async def process_kyc_application(self, customer_data: Dict[str, Any]) case_id = str(uuid.uuid4()) self.log.info("KYC case created", extra={"case_id": case_id, "customer_id": customer_data.get("customer_id")}) try: onboarding_result = await asyncio.wait_for( self.onboarding.determine_requirements(customer_data, case_id), timeout=self.step_timeout_sec, ) self.log.debug("Onboarding complete", extra={"case_id": case_id, "kyc_level": onboarding_result.kyc_level}) verification_result = await asyncio.wait_for( self.documents.process_documents(onboarding_result.submitted_documents, case_id), timeout=self.step_timeout_sec, ) self.log.debug("Document verification complete", extra={"case_id": case_id, "issues": verification_result.issues}) risk_result = await asyncio.wait_for( self.risk.calculate_risk_profile(verification_result, customer_data, case_id), timeout=self.step_timeout_sec, ) self.log.debug("Risk assessment complete", extra={"case_id": case_id, "risk_score": risk_result.risk_score}) decision = await asyncio.wait_for( self.compliance.make_final_decision(verification_result, risk_result, case_id), timeout=self.step_timeout_sec, ) self.log.info("Compliance decision", extra={"case_id": case_id, "status": decision.status, "rationale": decision.rationale}) if decision.status == "approved": # Catch-all to avoid dropping the case; keep the audit trail intact self.log.exception("Unhandled error in KYC flow", extra={"case_id": case_id}) return ComplianceDecision( status="manual_review", rationale=f"Unexpected error: {type(e).__name__}. Route to L3 review.", case_id=case_id, ) 5. Performance Assessment Production implementations demonstrate significant improvements across key metrics. Processing time drops from hours to minutes on average. User abandonment during onboarding decreases while compliance accuracy improves drastically. The technical foundation centers on robust error handling, explainable decision-making, secure data processing, and clear audit trails. Each agent operates independently with clear interfaces, enabling horizontal scaling and fault isolation. State management persists workflow context across agent interactions, ensuring consistency even during partial failures. Conclusion This multi-agent architecture addresses traditional KYC pain points through autonomous reasoning and comprehensive audit trails. With the regulatory landscape getting more complex and fraudsters becoming more advanced, KYC is not only a compliance step; it can become a competitive advantage when implemented seamlessly. The future of KYC is adaptable and agentic. Agents will reason like analysts, execute reliably, document thoroughly, and maintain audit trails. With Agentic AI-driven KYC, organizations can deliver faster onboarding that ensures customer delight. More

Networking’s Open Source Era Is Just Getting Started

By Nico Vibert

LLMs at the Edge: Decentralized Power and Control

By Bhanuprakash Madupati

Stop Reactive Network Troubleshooting: Monitor These 5 Metrics to Prevent Downtime

By Sascha Neumeier

Azure IOT Cloud-to-Device Communication Methods

Today, managing communication between the cloud and millions of smart devices is challenging. Suppose you are managing a huge number of devices out there and you need to push some critical device state update to them all, but many of them are offline or may have spotty network issues; how do you make sure this message gets through? The Azure IoT Hub provides three major cloud-to-device communication mechanisms: C2D messages, direct methods, and desired properties in the device twin. These are each designed for different use cases. This article presents how to effectively select these methods to build reliable, scalable, and effective IoT solutions. Knowing the details when to use each one for what scenarios will help to build robust and reliable IOT solutions. 1. Cloud-to-Device (C2D) Message This method is perfect when a message has to get to the device, but not instantly, and eventually with guaranteed delivery. When the cloud sends this message, it does not directly push to the device; instead, IoT Hub stores that message in the message queue, and it waits there until the device connects and is ready to pick up. This ensures guaranteed delivery in scenarios when the device is offline or has spotty connectivity.This C2D is ideal whenever guaranteed delivery is important, though it is not immediate. Sequence Diagram But what if waiting isn’t an option and the device needs to take action immediately? That’s when you move on to the second method: direct methods. 2. Direct Methods These are direct requests to the device and a response from the device interaction. This makes it perfect for situations where immediate action is needed. Think about the emergency stop on some device immediately, and you need to know when it is reached and executed. A limitation of direct methods is that they have throttling limits. This is important in design consideration, as scalability is a real issue if you have thousands of devices and suddenly all need to use the direct method.This is not best suited where the devices call a direct method every few seconds, and you scale up to thousands, you are going to have issues. It is more suited for less frequently used commands. Sequence Diagram So we discussed C2D for reliability and Direct Methods for immediate actions. What about when you want to manage the configuration or state of a whole set of devices? Let's talk about the third method: desired properties in the device twin. 3. Desired Properties in Device Twin A device twin is a twin copy of a physical device that lives in the cloud. It stores all the device information, including its current state and properties. Think about scenarios where you're rolling out a new setting to all devices and you don't want to target each device individually, such as turning on or off some feature flag on a whole set of devices.The device reads the desired properties asynchronously. When the device gets online, it checks its device twin and syncs itself the way the cloud wants it to be configured. It is not immediate but eventually consistent. The device can report back its state to the cloud, so the cloud knows what is really happening. This scale massively, you just set the configuration in the cloud in the device twin for devices, and it eventually syncs to the device. Basically, the desired property is long-term configuration management, keeping devices in the same state as what is configured in the cloud. Sequence Diagram Below is the server-side and device-side code implementation on how to talk to the IoT hub to send a message using the methods that I explained above. To send any message to the IoT Hub, it is a prerequisite to have an Azure IoT Hub resource set up. Refer to this link to create an IoT Hub. Cloud-Side Implementation for IoT Hub Cloud-to-Device Communication TypeScript import { Client, Message, DeviceMethodParams, Registry, Twin } from "azure-iothub"; export class IoTHub { private _client: Client | null = null; private _registry: Registry | null = null; //Replace {HubEndpoint} and {key} with your real values private static readonly CONNECTION_STRING: string = "HostName=myhubdeviot.azure-devices.net;SharedAccessKeyName=iothubowner;SharedAccessKey=<yourKeyHere>"; /** * Initialize the IoT Hub Service Client and Registry */ public async init(): Promise<void> { if (!this._client) { this._client = Client.fromConnectionString(IoTHub.CONNECTION_STRING); await this._client.open(); } if (!this._registry) { this._registry = Registry.fromConnectionString(IoTHub.CONNECTION_STRING); } } /** * 1.Send a Cloud-to-Device (C2D) message */ public async sendC2DMessage( deviceId: string, messageId: string, data: any, expiryMins: number ): Promise<void> { if (!this._client) { throw new Error("IoTHub client not initialized. Call init()."); } const message = new Message(JSON.stringify(data)); message.messageId = messageId; message.ack = "none"; message.expiryTimeUtc = Date.now() + expiryMins * 60 * 1000; await this._client.send(deviceId, message); } /** * 2. Invoke a direct method on a device */ public async invokeDirectMethod( deviceId: string, options: DeviceMethodParams ): Promise<any> { if (!this._client) { throw new Error("IoTHub client not initialized.Call init()."); } const response = await this._client.invokeDeviceMethod(deviceId, options); return response.result; } /** * 3. Update desired properties on a device twin */ public async updateDesiredProperty( deviceId: string, twinPatch: any, etag: string = "*" ): Promise<Twin> { if (!this._registry) { throw new Error("IoTHub registry not initialized. Call init()."); } return await this._registry.updateTwin(deviceId, twinPatch, etag); } /** * Close the IoT Hub Service Client */ public async close(): Promise<void> { if (this._client) { await this._client.close(); this._client = null; } this._registry = null; } } Device-Side Handlers Implementation for IoT Hub Cloud-to-Device Communication TypeScript import { Client as DeviceClient, Message, Twin, DeviceMethodRequest, DeviceMethodResponse } from "azure-iot-device"; import { Mqtt } from "azure-iot-device-mqtt"; // Replace with your device connection string const deviceConnectionString = "HostName=myhubdeviot.azure-devices.net;DeviceId=myDeviceId;SharedAccessKey=<deviceKey>"; async function main() { const client = DeviceClient.fromConnectionString(deviceConnectionString, Mqtt); // Open connection await client.open(); console.log("Device connected to IoT Hub."); // 1 Handle Cloud-to-Device messages client.on("message", (msg: Message) => { const data = msg.getData().toString(); console.log("Received C2D message:", data, "MessageId:", msg.messageId); // Complete the message to remove from IoT Hub queue client.complete(msg, (err) => { if (err) console.error("Error completing C2D message:", err); }); }); // 2 Handle Direct Method calls client.onDeviceMethod("reboot", async (request: DeviceMethodRequest, response: DeviceMethodResponse) => { console.log("Direct method 'reboot' invoked with payload:", request.payload); // Simulate device action const result = { status: "Device will reboot in " + request.payload.delay + " seconds" }; await response.send(200, result, (err) => { if (err) console.error("Failed sending method response:", err); else console.log("Direct method response sent."); }); }); // 3 Handle desired property updates client.getTwin((err, twin: Twin) => { if (err) { console.error("Error getting twin:", err); return; } console.log("Twin initialized. Current desired properties:", twin.properties.desired); twin.on("properties.desired", (desiredChange) => { console.log("Desired property update received:", desiredChange); }); }); } main().catch((err) => console.error("Device error:", err)); Conclusion Being able to choose the right IoT communication method allows engineers to build systems that are truly resilient and responsive. The next time you think about a device, think about what scenarios your system should support. Does the message need to get there eventually or right now? Or does it need to sync using the desired property over time? Answers to these questions frame your selection and ultimately determine the success of the IoT system.

By Anup Rao

How TBMQ Uses Redis for Persistent Message Storage

TBMQ was primarily designed to aggregate data from IoT devices and reliably deliver it to backend applications. Applications subscribe to data from tens or even hundreds of thousands of devices and require reliable message delivery. Additionally, applications often experience periods of downtime due to system maintenance, upgrades, failover scenarios, or temporary network disruptions. IoT devices typically publish data frequently but subscribe to relatively few topics or updates. To address these differences, TBMQ classifies MQTT clients as either application clients or standard IoT devices. Application clients are always persistent and rely on Kafka for session persistence and message delivery. In contrast, standard IoT devices — referred to as DEVICE clients in TBMQ — can be configured as persistent depending on the use case. This article provides a technical overview of how Redis is used within TBMQ to manage persistent MQTT sessions for DEVICE clients. The goal is to provide practical insights for software engineers looking to offload database workloads to persistent caching layers like Redis, ultimately improving the scalability and performance of their systems. Why Redis? In TBMQ 1.x, DEVICE clients relied on PostgreSQL for message persistence and retrieval, ensuring that messages were delivered when a client reconnected. While PostgreSQL performed well initially, it had a fundamental limitation — it could only scale vertically. We anticipated that as the number of persistent MQTT sessions grew, PostgreSQL’s architecture would eventually become a bottleneck. For a deeper look at how PostgreSQL was used and the architectural limitations we encountered, see our blog post. To address this, we evaluated alternatives that could scale more effectively with increasing load. Redis was quickly chosen as the best fit due to its horizontal scalability, native clustering support, and widespread adoption. Migration to Redis With these benefits in mind, we started our migration process with an evaluation of data structures that could preserve the functionality of the PostgreSQL approach while aligning with Redis Cluster constraints to enable efficient horizontal scaling. Redis Cluster Constraints While working on a migration, we recognized that replicating the existing data model would require multiple Redis data structures to efficiently handle message persistence and ordering. This, in turn, meant using multiple keys for each persistent MQTT client session. Redis Cluster distributes data across multiple slots to enable horizontal scaling. However, multi-key operations must access keys within the same slot. If the keys reside in different slots, the operation triggers a cross-slot error, preventing the command from executing. We used the persistent MQTT client ID as a hash tag in our key names to address this. By enclosing the client ID in curly braces {}, Redis ensures that all keys for the same client are hashed to the same slot. This guarantees that related data for each client stays together, allowing multi-key operations to proceed without errors. Atomic Operations via Lua Scripts Consistency is critical in high-throughput environments where many messages may arrive concurrently for the same MQTT client. Hashtagging helps to avoid cross-slot errors, but without atomic operations, there is a risk of race conditions or partial updates. This could lead to message loss or incorrect ordering. It is important to make sure that operations updating the keys for the same MQTT client are atomic. Redis is designed to execute individual commands atomically. However, in our case, we need to update multiple data structures as part of a single operation for each MQTT client. Executing these sequentially without atomicity opens the door to inconsistencies if another process modifies the same data in between commands. That’s where Lua scripting comes in. Lua script executes as a single, isolated unit. During script execution, no other commands can run concurrently, ensuring that the operations inside the script happen atomically. Based on this information, we decided that for any operation, such as saving messages or retrieving undelivered messages upon reconnection, we will execute a separate Lua script. This ensures that all operations within a single Lua script reside in the same hash slot, maintaining atomicity and consistency. Choosing the Right Redis Data Structures One of the key requirements for persistent session handling in an MQTT broker is maintaining message order across client reconnects. After evaluating various Redis data structures, we found that sorted sets (ZSETs) provided an efficient solution to this requirement. Redis sorted sets naturally organize data by score, enabling quick retrieval of messages in ascending or descending order. While sorted sets provided an efficient way to maintain message order, storing full message payloads directly in sorted sets led to excessive memory usage. Redis does not support per-member TTL within sorted sets. As a result, messages persisted indefinitely unless explicitly removed. Similar to PostgreSQL, we had to perform periodic cleanups using ZREMRANGEBYSCORE to delete expired messages. This operation carries a complexity of O(log N + M), where M is the number of elements removed. To overcome this limitation, we decided to store message payloads using strings data structure while storing in the sorted set references to these keys. client_id is a placeholder for the actual client ID, while the curly braces {} around it are added to create a hash tag. In the image above, you can see that the score continues to grow even when the MQTT packet ID wraps around. Let’s take a closer look at the details illustrated in this image. At first, the reference for the message with the MQTT packet ID equal to 65534 was added to the sorted set: Shell ZADD {client_id}_messages 65534 {client_id}_messages_65534 Here, {client_id}_messages is the sorted set key name, where {client_id} acts as a hash tag derived from the persistent MQTT client’s unique ID. The suffix _messages is a constant added to each sorted set key name for consistency. Following the sorted set key name, the score value 65534 corresponds to the MQTT packet ID of the message received by the client. Finally, we see the reference key that points to the actual payload of the MQTT message. Similar to the sorted set key, the message reference key uses the MQTT client’s ID as a hash tag, followed by the _messages suffix and the MQTT packet ID value. In the next iteration, we add the message reference for the MQTT message with a packet ID equal to 65535 into the sorted set. This is the maximum packet ID, as the range is limited to 65535. Shell ZADD {client_id}_messages 65535 {client_id}_messages_65535 So, at the next iteration MQTT packet ID should be equal to 1, while the score should continue to grow and be equal to 65536. Shell ZADD {client_id}_messages 65536 {client_id}_messages_1 This approach ensures that the message’s references will be properly ordered in the sorted set regardless of the packet ID’s limited range. Message payloads are stored as string values with SET commands that support expiration (EX), providing O(1) complexity for writes and TTL applications: Shell SET {client_id}_messages_1 "{ \"packetType\":\"PUBLISH\", \"payload\":\"eyJkYXRhIjoidGJtcWlzYXdlc29tZSJ9\", \"time\":1736333110026, \"clientId\":\"client\", \"retained\":false, \"packetId\":1, \"topicName\":\"europe/ua/kyiv/client/0\", \"qos\":1 }" EX 600 Another benefit, aside from efficient updates and TTL applications, is that the message payloads can be retrieved: Shell GET {client_id}_messages_1 Or removed: Shell DEL {client_id}_messages_1 with constant complexity O(1) without affecting the sorted set structure. Another very important element of our Redis architecture is the use of a string key to store the last MQTT packet ID processed: Shell GET {client_id}_last_packet_id "1" This approach serves the same purpose as in the PostgreSQL solution. When a client reconnects, the server must determine the correct packet ID to assign to the next message that will be saved in Redis. Initially, we considered using the sorted set’s highest score as a reference. However, since there are scenarios where the sorted set could be empty or completely removed, we concluded that the most reliable solution is to store the last packet ID separately. Managing Sorted Set Size Dynamically This hybrid approach, leveraging sorted sets and string data structures, eliminates the need for periodic cleanups based on time, as per-message TTLs are now applied. In addition, following the PostgreSQL design, we needed to somehow address the cleanup of the sorted set based on the message limit set in the configuration. YAML # Maximum number of PUBLISH messages stored for each persisted DEVICE client limit: "${MQTT_PERSISTENT_SESSION_DEVICE_PERSISTED_MESSAGES_LIMIT:10000}" This limit is an important part of our design, allowing us to control and predict the memory allocation required for each persistent MQTT client. For example, a client might connect, triggering the registration of a persistent session, and then rapidly disconnect. In such scenarios, it is essential to ensure that the number of messages stored for the client (while waiting for a potential reconnection) remains within the defined limit, preventing unbounded memory usage. Java if (messagesLimit > 0xffff) { throw new IllegalArgumentException("Persisted messages limit can't be greater than 65535!"); } To reflect the natural constraints of the MQTT protocol, the maximum number of persisted messages for individual clients is set to 65535. To handle this within the Redis solution, we implemented dynamic management of the sorted set’s size. When new messages are added, the sorted set is trimmed to ensure the total number of messages remains within the desired limit, and the associated strings are also cleaned up to free up memory. Lua -- Get the number of elements to be removed local numElementsToRemove = redis.call('ZCARD', messagesKey) - maxMessagesSize -- Check if trimming is needed if numElementsToRemove > 0 then -- Get the elements to be removed (oldest ones) local trimmedElements = redis.call('ZRANGE', messagesKey, 0, numElementsToRemove - 1) -- Iterate over the elements and remove them for _, key in ipairs(trimmedElements) do -- Remove the message from the string data structure redis.call('DEL', key) -- Remove the message reference from the sorted set redis.call('ZREM', messagesKey, key) end end Message Retrieval and Cleanup Our design not only ensures dynamic size management during the persistence of new messages but also supports cleanup during message retrieval, which occurs when a device reconnects to process undelivered messages. This approach keeps the sorted set clean by removing references to expired messages. Lua -- Define the sorted set key local messagesKey = KEYS[1] -- Define the maximum allowed number of messages local maxMessagesSize = tonumber(ARGV[1]) -- Get all elements from the sorted set local elements = redis.call('ZRANGE', messagesKey, 0, -1) -- Initialize a table to store retrieved messages local messages = {} -- Iterate over each element in the sorted set for _, key in ipairs(elements) do -- Check if the message key still exists in Redis if redis.call('EXISTS', key) == 1 then -- Retrieve the message value from Redis local msgJson = redis.call('GET', key) -- Store the retrieved message in the result table table.insert(messages, msgJson) else -- Remove the reference from the sorted set if the key does not exist redis.call('ZREM', messagesKey, key) end end -- Return the retrieved messages return messages By leveraging Redis’ sorted sets and strings, along with Lua scripting for atomic operations, our new design achieves efficient message persistence and retrieval, as well as dynamic cleanup. This design addresses the scalability limitations of the PostgreSQL-based solution. Migration from Jedis to Lettuce To validate the scalability of the new Redis-based architecture for persistent message storage, we selected a point-to-point (P2P) MQTT communication pattern as a performance testing scenario. Unlike fan-in (many-to-one) or fan-out (one-to-many) scenarios, the P2P pattern typically involves one-to-one communication and creates a new persistent session for each communicating pair. This makes it well-suited for evaluating how the system scales as the number of sessions grows. Before starting large-scale tests, we conducted a prototype test that revealed the limit of 30k msg/s throughput when using PostgreSQL for persistence message storage. At the moment of migration to Redis, we used the Jedis library for Redis interactions, primarily for cache management. As a result, we initially decided to extend Jedis to handle message persistence for persistent MQTT clients. However, the initial results of the Redis implementation with Jedis were unexpected. While we anticipated Redis would significantly outperform PostgreSQL, the performance improvement was modest, reaching only 40k msg/s throughput compared to the 30k msg/s limit with PostgreSQL. This led us to investigate the bottlenecks, where we discovered that Jedis was a limiting factor. While reliable, Jedis operates synchronously, processing each Redis command sequentially. This forces the system to wait for one operation to complete before executing the next. In high-throughput environments, this approach significantly limited Redis’s potential, preventing the full utilization of system resources. RedisInsight shows ~66k commands/s per node, aligning with TBMQ’s 40k msg/s, as Lua scripts trigger multiple Redis operations per message. To overcome this limitation, we migrated to Lettuce, an asynchronous Redis client built on top of Netty. With Lettuce, our throughput increased to 60k msg/s, demonstrating the benefits of non-blocking operations and improved parallelism. At 60k msg/s, RedisInsight shows ~100k commands/s per node, aligning with the expected increase from 40k msg/s, which produced ~66k commands/s per node. Lettuce allows multiple commands to be sent and processed in parallel, fully exploiting Redis’s capacity for concurrent workloads. Ultimately, the migration unlocked the performance gains we expected from Redis, paving the way for successful P2P testing at scale. For a deep dive into the testing architecture, methodology, and results, check out our detailed performance testing article. Conclusion In distributed systems, scalability bottlenecks often emerge when vertically scaled components, like traditional databases, are used to manage high-volume, session-based workloads. Our experience with persistent MQTT sessions for DEVICE clients demonstrated the importance of designing around horizontally scalable solutions from the start. By offloading session storage to Redis and implementing key architectural improvements during the migration, TBMQ 2.x built a persistence layer capable of supporting a high number of concurrent sessions with exceptional performance and guaranteed message delivery. We hope our experience provides practical guidance for engineers designing scalable, session-aware systems in distributed environments.

By Dmytro Shvaika

Implementing Write-Through Cache for Real-Time Data Processing: A Scalable Approach

Real-time data processing systems often struggle with balancing performance and data consistency when handling high volumes of transactions. This article explores how a write-through local cache can optimize performance. Introduction to Write-Through Caches A write-through cache is a caching strategy where data is written to both the cache and the backing store simultaneously. This approach ensures that the cache always contains the most recent data while maintaining consistency with the underlying data store. In distributed systems processing event streams, write-through caches can significantly reduce latency and database load by localizing data access patterns. Unlike write-back caching (which delays updates to the backing store) or read-through caching (which only populates on read operations), write-through caching provides stronger consistency guarantees while still offering performance benefits. The Entity-Signal Pattern Before diving into implementation details, let's understand the data model we're working with: EntityIDs: Unique identifiers for items in the system (e.g., users, devices)Signals: Attributes or events associated with these entitiesSignal History: A collection of signals for each entity that may need to be accessed frequently This pattern is common in analytics systems, fraud detection platforms, recommendation engines, and IoT applications, where entity behavior needs to be tracked over time. Architecture Overview Our write-through cache implementation focuses on optimizing stream processing workloads where: Each processing node handles specific entityIDs (partitioned by hash)Signal history for these entities is frequently accessedNew signals continually arrive via stream processing (e.g., Kinesis) The (local) cache serves three primary purposes: Periodically load signals for entityIDs from RedisStore new signals arrivingAlways provide complete and current signal data for any cached entity Implementation Details The implementation consists of two coordinated components: 1. Loading Cache The loading cache is responsible for periodically refreshing entity data from the backend Redis store: Java private final LoadingCache<String, Set<Signal>> loadingCache; // In constructor this.loadingCache = CacheBuilder.newBuilder() .maximumSize(LOADING_CACHE_MAX_SIZE) .expireAfterWrite(LOADING_CACHE_TIMEOUT, TimeUnit.MINUTES) .removalListener(this) .recordStats() .build(new SignalCacheLoader( redisDAO, secretTokenizer, weblabClient, metricsManager)); Key configurations include: Maximum size: Prevents unbounded memory growth (LOADING_CACHE_MAX_SIZE = 1000)Expire-after-write: Ensures data is refreshed periodically (LOADING_CACHE_TIMEOUT = 5 minutes)Removal listener: Notifies when entries are evicted 2. Signal Cache The signal cache stores incoming signals between refresh cycles: Java // Starts with initial capacity of 100 and grows till the size limit of loading cache // As loading cache expires its entries, this class acts as a removal listener and removes the entry for Signal // cache as well. When new signals need to be stored: Java /** * Index signals to the entityId. * @param key entityId. * @param value Set of signals. */ @Override public void put(String key, Set<Signal> value) { if (key != null && value != null) { // Save the signal in local cache localCache.put(key, Lists.newArrayList(value)); } } When requesting signals for an entity: Java @Override public Set<Signal> getIfPresent(Object key) { if (key != null && (key instanceof String)) { final Set<Signal> signals = loadingCache.get((String) key); if (log.isDebugEnabled()) { log.debug("Number of signals retrieved from redis cache loader for [entity: {}] is {}", getResult((String) key), signals.size()); } // Fetch any recent signals for entity which came in after redis load() final List<Signal> inMemorySignals = (List<Signal>) localCache.getIfPresent(key); if (log.isDebugEnabled()) { log.debug("Number of signals retrieved from local in-memory cache for [entity: {}] is {}", getResult((String) key), inMemorySignals.size()); } signals.addAll(inMemorySignals); return new HashSet<>(signals); } return Collections.emptySet(); } Memory Management Strategy The cache implementation employs a coordinated memory management strategy: Loading cache: Configured with maximum size and expiry time. When entries expire or are evicted due to size constraints, removal notifications are triggered.Signal cache: It starts with a defined initial capacity and grows as needed. Its size is effectively controlled by the loading cache through the removal notification mechanism. Java /** * When loadingCache expires its entry, this listener removes the key from * localCache as well. * @param removalNotification */ @Override public void onRemoval(RemovalNotification removalNotification) { if (log.isDebugEnabled()) { log.debug("Invalidating local cache for [entity: {}]", secretTokenizer.getHashedResult((String) removalNotification.getKey())); } localCache.invalidate(removalNotification.getKey()); } This coordinated approach ensures both caches stay in sync and allows the JVM to reclaim memory when entries are no longer needed. Business Benefits 1. Reduced Transaction Costs By localizing data access, this caching strategy dramatically reduces Redis operations: Without cache: Each incoming signal requires a Redis read and write.With cache: Redis reads occur once per expiry interval per entity.Cost reduction: For high-volume systems processing millions of events, this can translate to a 95%+ reduction in Redis operations. 2. Improved Latency Stream processing latency is significantly reduced: Without cache: 10-20ms per signal (Redis round-trip)With cache: Sub-millisecond response for cached entitiesBenefit: Lower processing latency leads to more responsive systems 3. Enhanced Resilience The write-through cache provides resilience against temporary Redis failures: Processing can continue with cached data during short outagesWrites can be queued and flushed when connectivity is restoredSystem availability increases while dependency on Redis availability decreases 4. Optimized Cost Structure Beyond direct transaction cost reduction: Infrastructure savings: Reduced Redis cluster size needed for the same workloadScaling efficiency: More efficient processing means fewer nodes needed for the same throughputOperational simplicity: Less complex Redis scaling and management Considerations and Best Practices While implementing this pattern, keep these guidelines in mind: Cache size tuning: Set the maximum cache size based on entity cardinality and available memory.Expiry intervals: Balance freshness needs against Redis load.Monitoring: Implement cache hit/miss metrics to verify efficiency.Consistency controls: Consider adding version stamps for conflict resolution.Warm-up strategy: Pre-populate caches for critical entities during startup. Conclusion A well-implemented write-through cache can transform the performance and cost profile of entity-based stream processing systems. By intelligently caching entity signals at the processing node level and coordinating with a backing Redis store, organizations can achieve both high performance and data consistency while significantly reducing infrastructure costs. This pattern is particularly valuable for high-throughput systems where entities exhibit locality (the same entities appear repeatedly in the stream) and where signal history needs to be maintained and frequently accessed during processing. Whether you're building a fraud detection system, user behavior analytics, or IoT data processor, consider implementing this write-through caching pattern to boost performance while keeping your database costs under control.

By Rohith Narasimhamurthy

AI on the Fly: Real-Time Data Streaming From Apache Kafka to Live Dashboards

In the current fast-paced digital age, many data sources generate an unending flow of information, a never-ending torrent of facts and figures that, while perplexing when examined separately, provide profound insights when examined together. Stream processing can be useful in this situation. It fills the void between real-time data collecting and actionable insights. It’s a data processing practice that handles continuous data streams from an array of sources. Real-time data streaming has started having an important impact on modern AI models for applications that need quick decisions. We can consider a few examples where AI models need to deliver instant decisions, such as self-driving cars, fraud in stock market trading, and smart factories that utilize technology like sensors, robots, and data analytics to automate and optimize manufacturing processes. Real-time data streaming plays a key role for AI models as it allows them to handle and respond to data as it comes in, instead of just using old fixed datasets. This speed matters a lot for tasks where quick choices can make a big difference, like spotting fraud in money transfers, tweaking suggestions in online shops, or steering self-driving cars, as said above as an example of AI models that need to deliver instant decisions. By leveraging real-time data, AI models can maintain a current understanding of their environment, adapt quickly to changes, and improve performance through continuous updates. What's more, real-time streaming helps AI work in edge computing and IoT setups where quick processing is often needed. Without the ability to work in real-time, AI systems might become old news, slow to react, and less useful in fast-moving, data-heavy settings. From Source to Streams Over the past few years, Apache Kafka has emerged as the leading standard for streaming data. Fast-forward to the present day, Kafka has achieved ubiquity, being adopted by at least 80% of the Fortune 100. Kafka’s architecture versatility makes it exceptionally suitable for streaming data at a vast ‘internet’ scale, ensuring fault tolerance and data consistency crucial for supporting mission-critical applications. Flink is a high-throughput, unified batch and stream processing engine, renowned for its capability to handle continuous data streams at scale. It seamlessly integrates with Kafka and offers robust support for exactly-once semantics, ensuring each event is processed precisely once, even amidst system failures. Flink emerges as a natural choice as a stream processor for Kafka. While Apache Flink enjoys significant success and popularity as a tool for real-time data processing, accessing sufficient resources and current examples for learning Flink can be challenging. Turning Streams into Insights After ingesting real-time data stream into the multi-node Apache Kafka cluster and subsequently integrating with the Flink cluster, the ingested streaming data can be allowed for the enhancement, filtering, aggregation, and alteration. This has a significant impact on AI systems, as it enables real-time feature engineering to take place before feeding the data to models. As AI systems, we can consider TensorFlow, which is an open-source platform and framework for machine learning, including libraries and tools based on Python and Java. It is designed with the objective of training machine learning and deep learning models on data. Here is the pseudocode in Java that demonstrates how we can pass processed stream data from Apache Flink to a TensorFlow AI model. We can use the DataStream API of Flink to ingest streaming data from Kafka's topic and subsequently parse and process the data. Eventually, send processed data to TensorFlow for prediction. Java // Step 1: Ingest streaming data from a source (e.g., Kafka) DataStream<String> rawStream = env.addSource(new FlinkKafkaConsumer<>(...)); // Step 2: Parse and process the data DataStream<FeatureVector> processedStream = rawStream .map(new ParseAndTransformFunction()); // Step 3: Parse and process the data DataStream<FeatureVector> processedStream = rawStream .map(new ParseAndTransformFunction()); // Extract features suitable for model input // Step 4: Send processed features to TensorFlow for prediction DataStream<PredictionResult> predictions = processedStream .map(new RichMapFunction<FeatureVector, PredictionResult>() { // If calling a TensorFlow SavedModel locally transient SavedModelBundle model; @Override public void open(Configuration parameters) { // Load the TensorFlow SavedModel from local or distributed file system model = SavedModelBundle.load("/path/to/saved_model", "serve"); } @Override public PredictionResult map(FeatureVector input) throws Exception { // Convert FeatureVector to Tensor Tensor inputTensor = Tensor.create(input.toTensorArray()); // Run inference Tensor resultTensor = model.session() .runner() .feed("input_tensor_name", inputTensor) .fetch("output_tensor_name") .run() .get(0); // Parse and return result return new PredictionResult(resultTensor); } @Override public void close() { model.close(); // Close model session } }); // ALTERNATIVE: If using a REST API to call TensorFlow Serving DataStream<PredictionResult> predictionsViaAPI = processedStream .map(new MapFunction<FeatureVector, PredictionResult>() { @Override public PredictionResult map(FeatureVector input) throws Exception { // Serialize feature vector as JSON String jsonPayload = serializeToJSON(input); // Send HTTP POST request to TensorFlow Serving REST API String response = HttpClient.post("http://localhost:8501/v1/models/model:predict", jsonPayload); // Parse response and return prediction return parsePredictionFromJSON(response); } }); // Step 5: Use predictions (e.g., sink to database, alerting system, etc.) predictions.addSink(new MyPredictionSink()); // Execute the pipeline env.execute("Flink + TensorFlow Streaming Inference"); What’s Next From TensorFlow? Moving the predicted model from TensorFlow to Grafana for dynamic visualization isn't straightforward. We need to take a few steps in between. This is because Grafana is a multi-platform, open-source analytics and interactive visualization web application. It doesn't work with machine learning models. Instead, it connects to databases that store data over time. For continuous predictions, we can use InfluxDB, TimescaleDB (a PostgreSQL extension), or any other vendor-specific time-series database. This approach makes it ideal for deploying and tracking models in production that support real-time monitoring, historical trend analysis, and ML model observability. Conclusion In today's world, where we are concerned about every millisecond counts, closing the gap between AI and real-time data isn't just a technical achievement. It gives us an edge over competitors. When we stream data through Apache Kafka and display insights right away on live dashboards, we are not only just watching what's happening now, we are shaping it too. This real-time AI system turns raw data into instant intelligence, whether it's spotting unusual patterns, boosting recommendation systems, or guiding operational choices. As information flows faster, those who can respond to it will lead the way. So, connect to the data stream, let our models think, and breathe life into our dashboards. Of course, there are numerous technical problems to solve, starting from data cleansing to the deployment strategy with the right architectural approach. Thank you for reading! If you found this article valuable, please consider liking and sharing it.

By Gautam Goswami

CORE

Blockchain, AI, and Edge Computing: Redefining Modern App Development

The overall landscape of app development is continuing with a transformative shift that is driven by various latest technologies, including AI or artificial intelligence, edge computing, and blockchain. These innovations are enhancing the efficiency and functionality of the apps, catering to new layers of security, improving scalability, and enhancing the user experience. The use of the latest technologies is high among app development companies, and they are trying to optimize app performance through this technology. This article will examine the interconnection of these technologies with others and their specific contributions and impacts on the development process of modern apps. Role of Blockchain in App Development In recent times, blockchain is the most useful technology for app development, and it is mostly known for its cryptocurrencies, including Bitcoin, which is required for app development (Gad et al. 2022). Blockchain is decentralized and delivers huge benefits, including transparency, improved security, and data immutability. Over the past two decades, the emergence of smartphones, mobile internet, and applications has significantly contributed to the e-commerce industry, playing a substantial role in the country's economic growth. It can mitigate all the requirements for a central authority and minimize the risk to avoid failure. In addition, the latest cryptographic techniques always protect the data from uncertified access. In this context, transaction history is visible and verified by all the people. This has a significant impact on application development by automating the process through the agreement, delivering secure identities, and ensuring better transparency in the supply chain and logistics. Roles of AI in App Development AI is improving how apps are developed and the way they function. By activating the apps, it is able to learn the behavior of users and implement it in real time. In recent times, AI has improved decision-making, automation, and personalization (Badmus et al. 2024). The contribution of AI is everywhere, in ML models, which help with dynamic adaptability and predictive analytics. The NLP is effective in improving communication through voice assistants and chatbots. In addition to that, it enables image recognition as well as augmented reality features. In recent times, the latest technologies have addressed the bandwidth and latency issues that exist in cloud computing architecture, and they primarily support the 5G network. In this context, MEC, or "mobile edge computing," is used to deliver emerging solutions that overcome various cloud computing issues. It is used in different sectors in different ways. For healthcare apps, it diagnoses the patient's conditions through data analysis. For retail, it helps to personalize the shopping experience, and for financial institutions such as banks, it is adopted for risk management and fraud detection. Role of Edge Computing in App Development Edge computing evaluated the overall limitations of centralized cloud systems by processing data from its nearest resources, resulting in reduced latency, enhanced performance, and improved privacy (Mansouri and Babar, 2021). Low latency is important for some real-time applications, including IoT and gaming, and it reduces the need for high-end data transfers and maintains the privacy of the data. Edge computing has various applications; it is widely used in IoT devices such as autonomous vehicles, smart home products, and wearable gadgets. This technology is also used in VR and AR for real-time processing to deliver immersive experiences to the user. In addition, this technology is widely usable for telecommunication app development as it improves 5G network capabilities, which enhances customers' experiences. Currently, various stakeholders, including Google, IBM, Microsoft, and other MNCs, are enabling cloud-based services for steering enterprise computations based on edge computing. Code Snippets Images or diagrams Figure 1: Global edge computing market (Source: Market.us, 2024) The global edge computing market is growing steadily worldwide. The above figure shows that the use of edge computing is significant in app development, and it will reach its maximum at the end of 2033. Figure 2: Edge computing marketing revenue in the world (Source: Statista, 2024) The global edge computing market is expected to reach nearly $ 350 billion by the end of 2027. The above figure shows that the use of this technology is increasing, and it delivers computing, network capabilities, and storage to the local network, which reduces latency, improves performance, and minimizes cost. Figure 3: Global mobile app-based edge computing market (Source: Market.us, 2024) The above figure shows the use of edge computing in app development in various sectors. From the above figure, it is visible that the IT and telecom industries mostly use this technology in developing their apps. Conclusion The adoption of AI, edge computing, and blockchain has redefined the boundaries of app development in modern days. These technologies are working collaboratively and address all the critical challenges while unlocking innovation opportunities. Blockchain is widely important for app development in recent times, as it ensures a high level of trust and security. AI mainly provides intelligent automation, which boosts the overall performance of the application, including operations. On the other hand, the contribution of edge computing is to provide high performance and maintain this in the app, which improves user experience. By embracing the triad, developers may create apps that are not only technologically latest but also linked with the current demands and support the digital landscape. Further Improvement To improve AI, blockchain, and edge computing, it will be possible to enhance the app development process, and several steps can be taken to further improve it. It is required to focus on standardization by developing some protocols for interoperability. Along with that, it is important to focus on scalability by addressing all the limitations in the blockchain and the capabilities of edge devices. Better ethical considerations are also required, which ensure that the AI models are fair and transparent. Additionally, it is necessary to provide proper training and resources for developing expertise in these three technologies. By focusing on all these areas, the integration and adoption of all the transformative technologies may be improved, and it will be important for developing smarter apps.

By Balasubramani Murugesan

Implementing Scalable IoT Architectures on Azure

The Internet of Things (IoT) comprises smart devices connected to a network, sending and receiving large amounts of data to and from other devices, which generates a substantial amount of data to be processed and analyzed. Edge computing, a strategy for computing on location where data is collected or used, allows IoT data to be gathered and processed at the edge, rather than sending the data back to a data center or cloud. Together, IoT and edge computing are a powerful way to rapidly analyze data in real-time. In this Tutorial, I am trying to lay out the components and considerations for designing IoT solutions based on Azure IoT and services. Azure IoT offers a robust, flexible cloud platform designed to handle the massive data, device management, and analytics that modern IoT systems demand. Why Choose Azure IoT? Key advantages include: Scalability: Whether a handful of devices or millions, Azure’s cloud infrastructure scales effortlessly.Security: Built-in end-to-end security features protect data and devices from cyber threats.Integration: Seamlessly connects with existing Microsoft tools like Azure AI, Power BI, and Dynamics 365.Global reach: Microsoft’s global data centers ensure low latency and compliance with regional regulations. Core Azure IoT Components Azure IoT Hub: Centralized management of IoT devices with secure, bi-directional communication.Azure Digital Twins: Create comprehensive digital models of physical environments to optimize operations.Azure Sphere: Secure microcontroller units designed to safeguard IoT devices from threats.Azure Stream Analytics: Real-time data processing and analysis to enable immediate decision-making. For businesses aiming for scale, Azure provides tools that simplify device provisioning, firmware updates, and data ingestion — all while maintaining reliability. How to Build Scalable IoT Solutions With Azure IoT With Azure IoT Hub, companies can manage device identities, monitor device health, and securely transmit data. This reduces manual overhead and streamlines operations. Azure IoT’s layered approach includes: Hardware-based security modules (Azure Sphere)Device authentication and access controlData encryption at rest and in transitThreat detection with Azure Security Center This comprehensive security framework protects critical business assets. Successfully leveraging Azure IoT requires deep expertise in cloud architecture, security, and integration. IoT consultants guide businesses through: Solution design aligned with strategic goalsSecure device provisioning and managementCustom analytics and reporting dashboardsCompliance with industry regulations This ensures rapid deployment and maximized ROI. Core Building Blocks of a Scalable IoT Solution There are six foundational components: Modular edge devices: Using devices capable of handling more data types, protocols, or workloads prepares the system for future enhancementsEdge-to-cloud architecture: Real-time processing at the edge with long-term analytics in the cloud—is critical for responsiveness and scaleScalable data pipelines: This includes event streaming, transformation, and storage layers that can dynamically adjust.Centralized management and provisioning: Remote provisioning tools and cloud-based dashboards that support secure lifecycle management.Future-ready analytics layer: Integrating a cloud-agnostic analytics engine — capable of anomaly detection, predictive maintenance, and trend analysis.API-first integration approach: APIs ensure that the IoT system can integrate with existing asset management tools and industry-specific software. Mistakes to Avoid When Scaling IoT Skipping a pilot that includes scale planning: Don’t just prove it works — prove it grows.Building for today’s traffic only: Plan for 10X the number of devices and data volume.Locking into one vendor without flexibility: Use open APIs and portable formats to reduce vendor risk.Treating security as a plug-in: It must be designed from the start and built into every component.Underestimating operational complexity: Especially when support, maintenance, and updates kick in. Key Practical Challenges and Solutions for Scalable IoT 1. Edge Processing and Local Intelligence Devices that only collect data aren’t scalable. They need to filter, compress, or even analyze data at the edge before sending it upstream. This keeps bandwidth manageable and lowers latency for time-sensitive decisions. 2. Cloud-Native Backend (Azure IoT) The backend is where most scale issues live or die. Choose cloud-native platforms that provide: Autoscaling message brokers (MQTT, AMQP)Managed databases (for structured + time-series data)Easy integrations with analytics toolsSecure API gateways 3. Unified Device Management A pilot with 10 sensors is easy. Managing 10,000 across countries is not. Invest early in device lifecycle management tools that: Handle provisioning, updates, and decommissionsTrack firmware versions and configurationsProvide automated alerts and health checks This is where experienced IoT consultants can guide you in picking a platform that matches your hardware and business goals. 4. Scalable Security and Access Controls Security is about ensuring that only the right users, systems, and apps have access to the right data. Key points to consider: Role-based access control (RBAC)Multi-tenant security layers (if you serve multiple customers or sites)End-to-end encryption across every nodeRegular key rotation and patch automation Scalability means being able to onboard 500 new devices without creating 500 new headaches. 5. Data Governance and Normalization Imagine 50 device types all reporting “temperature” — but each one does it differently. That’s why standardized data models and semantic labeling matter. Your architecture should include: Stream processing for cleanupSchema validationData cataloging and taggingIntegration with your BI and ML systems Smart IoT strategy ensures you don’t drown in your own data once scale hits. Scalability in IoT isn’t about planning for massive growth — it’s about removing obstacles to growth when it happens. Whether it’s 10 sensors today or 10,000 tomorrow, your architecture should support the same performance, security, and agility. As IoT continues to evolve, Azure will undoubtedly remain at the forefront of this exciting and transformative field, helping businesses drive innovation and stay competitive in an increasingly connected world. Learn more about Azure IoT here.

By Bhimraj Ghadge

Digital Twins Reborn: How AI Is Finally Fulfilling the Promise of IoT

Ten years ago, I wrote an article for DZone on The Future of IoT. When General Electric unveiled their digital twin technology for aircraft engines, we were on the cusp of an industrial revolution. The idea was compelling: create virtual replicas of physical assets that could be monitored, analyzed, and optimized in real-time. However, as many early IoT enthusiasts discovered, the gap between concept and widespread implementation proved wider than anticipated. Fast-forward to 2025, and digital twins are experiencing a renaissance, powered by advances in artificial intelligence addressing the challenges that once held the technology back. The Evolution: From Static Models to Cognitive Replicas Early digital twins were primarily data visualization tools with basic predictive capabilities. While groundbreaking, GE's aircraft engine twins required extensive manual configuration and domain expertise to extract actionable insights. Today's AI-enabled digital twins are fundamentally different. They're not just passive virtual representations but active cognitive models that learn, adapt, and even anticipate changes in their physical counterparts without explicit programming. What's changed is that modern digital twins don't just mirror physical systems—they understand them. Machine learning algorithms now identify patterns that human engineers might miss, creating a continuously improving feedback loop. Breaking Down the Integration Barrier IoT implementations required rare "full-stack" engineers comfortable with embedded systems and cloud architecture. AI has changed this equation in several ways: Self-configuring sensors: The next generation of IoT devices uses AI at the edge to auto-calibrate and adapt to their environment, requiring far less manual configuration.Unified development platforms: Tools like Siemens' Xcelerator and Microsoft's Azure Digital Twins have evolved to bridge the gap between physical modeling and software development, creating environments where domain experts and software engineers can collaborate more effectively.No-code/low-code interfaces allow subject-matter experts to interact with digital twin data without deep programming knowledge directly. Many organizations report that they spent years trying to find engineers who could work across the full IoT stack. Now, with advanced tools, mechanical engineers can work with AI assistants that translate their domain knowledge into digital twin configurations, cutting implementation time by as much as 70% in some documented cases. Real-World Impact: Where AI-Enabled Digital Twins Are Delivering Now Predictive Maintenance That Works Early predictive maintenance often generated excessive false alarms or missed critical failures. Today's AI-powered twins have dramatically improved accuracy. Norwegian oil company Equinor deployed AI-enhanced digital twins across offshore platforms, which reduced unplanned downtime by 68% in 2024, with false positive rates below 3%, numbers that were unimaginable with previous generations of the technology. Urban Infrastructure Management Singapore and Barcelona are implementing comprehensive urban digital twins that integrate building systems, traffic patterns, energy usage, and environmental sensors. These systems don't just monitor, but actively optimize resource allocation based on real-time conditions and predictive modeling. Singapore's digital twin anticipated rainfall patterns during a recent storm and automatically adjusted drainage systems, preventing flooding that would have affected thousands of residents. Healthcare Systems Optimization Hospital networks have deployed facility-wide digital twins that track everything from patient flow to equipment utilization. Mount Sinai Hospital in New York reported a 27% improvement in emergency department throughput after implementing an AI-enabled system that could predict patient surge hours in advance and automatically suggest staffing and resource adjustments. The Breakthrough Technologies Driving Change Several specific AI advances have been instrumental in digital twins' renaissance: Foundation Models for Anomaly Detection Large foundation models pre-trained on industrial equipment data can now detect subtle anomalies that would elude traditional rule-based systems. These models transfer knowledge between seemingly unrelated physical systems, identifying patterns that specific-purpose algorithms miss. Reinforcement Learning for Optimization Digital twins now employ reinforcement learning algorithms that continuously test different configurations in the virtual environment before applying changes to physical systems. This allows for safe experimentation and optimization without risking operational disruptions. Multimodal AI Integration Modern digital twins integrate data from multiple sources—vibration sensors, thermal imaging, audio analysis, and visual inspection—process them through specialized AI models to form a comprehensive understanding of physical assets. Looking Ahead: The Next Five Years The actual transformation is just beginning. Here's what industry experts predict for the near future: Self-Healing Systems Digital twins will move beyond prediction to autonomous correction. Early examples include self-healing power grids that can reconfigure to prevent cascading failures, and manufacturing lines that automatically adjust parameters to maintain product quality despite component wear. Cross-System Optimization Digital twins of separate systems will increasingly interact with each other, creating networks of virtual models that optimize collectively. Imagine city traffic management systems coordinating with building energy management and public transportation to holistically optimize urban resource usage. Knowledge Preservation and Transfer As industries face workforce challenges from retiring experts, AI-enabled digital twins are becoming repositories of institutional knowledge, capturing the insights and intuition of experienced personnel and making them available to new generations of workers. The Human Element Remains Essential Despite these advances, the human element remains crucial. The most successful digital twin implementations pair AI capabilities with human expertise. The goal isn't to remove people from the equation. Rather, it's to elevate human decision-making by handling the data complexity that overwhelms traditional analysis. Engineers aren't being replaced—they're being empowered. For organizations that struggled with early IoT implementations, now may be the time to revisit digital twin technology with fresh eyes. The barriers of hardware-software integration and specialized talent requirements have significantly diminished, while the potential business value has only increased. As we look to the future, it seems the long-promised IoT revolution didn't fail—it was simply waiting for AI to catch up.

By Tom Smith

CORE

Amazon EMRFS vs HDFS: Which One Is Right for Your Big Data Needs?

Amazon EMR is a managed service from AWS for big data processing. EMR is used to run enterprise-scale data processing tasks using distributed computing. It breaks down tasks into smaller chunks and uses multiple computers for processing. It uses popular big data frameworks like Apache Hadoop and Apache Spark. EMR can be set up easily, enabling organizations to swiftly analyze and process large volumes of data without the hassle of managing servers. The two primary options for storing data in Amazon EMR are Hadoop Distributed File System (HDFS) and Elastic MapReduce File System (EMRFS). HDFS is the traditional storage layer in Hadoop environments. It divides large data files into smaller segments and distributes them across a cluster of computers. It replicates data across computers, enhancing reliability and assuring fault tolerance. HDFS has data stored directly on the machines within the cluster. This makes it fast and a low-latency option. However, it has its limitations in terms of data capacity, and managing storage can become challenging as data volumes increase. EMRFS, on the other hand, is an Amazon-specific file system that integrates seamlessly with Amazon EMR. EMRFS uses Amazon S3 to store data. This integration decouples computing and storage, allowing them to scale independently of each other. EMRFS is compatible with Hadoop applications, enabling Hadoop jobs to run on Amazon EMR, leveraging S3’s durability, availability, and performance. An architect designing a big data application on EMR should consider these two storage options. HDFS provides low latency and in-cluster processing, while EMRFS comes with the benefits of a managed service, along with phenomenal durability and scalability backed by S3. In the following sections, let’s dive deeper into the capabilities and limitations of this storage. The Key Differences Between AWS EMRFS and HDFS EMRFS, being specifically designed for Amazon Web Services, seamlessly integrates with S3 to provide a highly scalable storage solution. Hence, it comes with all the benefits of S3. As an example, it provides elastic storage scalability without the overhead of dealing with physical infrastructure. It allows for scaling in or out with applications effortlessly. Conversely, HDFS has been an integral part of big data systems for years. It ensures fault tolerance and accessibility across multiple nodes by replicating data over multiple nodes. Additionally, HDFS ensures the consistency of datasets. In summary, while EMRFS excels in its cloud-native flexibility and scalability, HDFS stands out for its reliability and proven performance in traditional setups. The optimal choice between these technologies ultimately depends on the specific use case and non-functional requirements. Performance Comparison The biggest strength of HDFS is that it is fast. It stores data over several nodes within the compute environment. So, in cases where iterative reads on the same dataset or disk I/O-intensive workloads are required, HDFS provides ultra-low latency. However, since HDFS uses ephemeral storage, the data stored in HDFS will be deleted once the instances are terminated. EMRFS uses S3 to store data. So it is retained even after the Hadoop cluster is terminated. When dealing with massive datasets and one-time reads per run, EMRFS performs really well. It provides a centralized platform to read and write large volumes of data. However, iterative reads from S3 would be slower in comparison to HDFS. In summary, whether you're leaning towards EMRFS or sticking with HDFS depends on your unique needs. If you are in a “need for speed”, go with HDFS. Need durability? EMRFS has got your back. The right choice will propel your large-scale data processing efforts into new heights of efficiency and effectiveness. Cost Analysis AWS has a unique pricing strategy that offers flexibility and scalability, making it an appealing choice for many organizations. EMRFS is built for the cloud. So, with EMRFS, you pay for what you use — there's no need to over-provision resources. With EMRFS, it is not required to provision core nodes since the data is stored in S3. HDFS needs to provision core nodes in the Hadoop cluster to store data. This results in compute costs. Moreover, it has a replication cost since it replicates data over several nodes. Overall, EMRFS provides better cost efficiency over HDFS. Use Case Scenarios There are a number of industry applications that require big data solutions. Retailers often analyze consumer behavior data to optimize marketing strategies. Manufacturers run analytics over large volumes of data to monitor equipment health for predictive maintenance. To stay ahead of the curve, drive growth, and innovation, organizations use big data systems to discover insights. As an architect, it is super important to consider the non-functional requirements of an application. Things like volume of datasets, durability, availability, data access patterns, and latency are a few of the key requirements that need to be taken into consideration. For example, retailers who would like to analyze customer behavior can run a nightly batch job on EMR using the customer activities of the day. In this case, EMRFS can be used to store the dataset. The EMR job will load the dataset at the start and write back the output after post-processing. This would be the most cost-effective solution. Here, durability and volume are more important than latency. Now, when it comes to real-time monitoring of equipment health, latency takes precedence over everything else. In this case, relevant data should be accessible quickly, and thus it should be stored within the Hadoop cluster. In this case, HDFS outperforms EMRFS. However, with HDFS, it is essential to be mindful of data replication strategies. Otherwise, the cost can spiral up pretty quickly. Additionally, optimizing file sizes can dramatically improve performance during read and write operations. Another great strategy can be a hybrid approach. EMRFS can be used as long-term storage, while HDFS can be used for caching intermediate results and as hot storage for processing data. In this case, the dataset is hosted on EMRFS. At the start, this data is loaded onto HDFS for faster processing and is terminated after the process is completed. Conclusion In conclusion, choosing between EMRFS and HDFS for data strategy is an opportunity to optimize big data processing on Amazon EMR. Each choice has its own strengths and limitations. Use cases and application-specific goals and requirements ultimately determine the best option. EMRFS can be the ideal choice for you if you're searching for scalability, flexibility, and smooth integration with other AWS services. HDFS, on the other hand, can be your preferred choice if you'd rather take a more conventional method that performs well in a Hadoop ecosystem. Also, a hybrid approach can work wonders in specific scenarios.

By Satrajit Basu

CORE

Debugging Distributed ML Systems

My ML model for categorizing suddenly started classifying groceries as entertainment expenses. But why? What happened? I was looking at my personal finance dashboard and noticed something was completely off. The logs from each service looked normal. The health checks were green. Yet somehow, my grocery store purchases were being flagged as entertainment, and my restaurant bills were showing up as utilities. For some background, I had recently broken my monolith finance tracker into multiple microservices. What used to be a single Flask app with traceable execution had become a distributed puzzle of API calls, Redis caches, and background jobs. When something went wrong, figuring out where and why had become a nightmare. After spending many late nights debugging issues that seemed to appear from nowhere, I finally implemented tracing using OpenTelemetry and Jaeger. It has honestly transformed my personal project from a debugging headache into something I actually enjoy maintaining. If you're running your own distributed ML project and finding yourself lost in a sea of logs, this guide will show you how to set up tracing infrastructure that makes debugging manageable instead of purely frustrating. The Personal Project Debugging Nightmare Let me give you a picture of what debugging used to look like in my finance tracker. After decomposing my monolith, I had five core services: Transaction Processor: Ingests and cleans raw transaction dataCategory Predictor: ML service for transaction categorizationSpending Analyzer: Computes spending patterns and insightsFraud Detector: Identifies suspicious transactionsBudget Manager: Tracks budgets and sends alerts A typical transaction processing flow would look like this: New transaction comes in via CSV upload or bank APITransaction Processor cleans and validates the dataCategory Predictor calls the ML model to classify the transactionFraud Detector checks for suspicious patternsSpending Analyzer updates monthly statisticsBudget Manager checks if any budget limits were exceeded When this worked smoothly, it felt like magic. When it didn't, debugging was pure pain. The categorization disaster I mentioned earlier started with nonsensical expense categories. Looking at individual service logs, everything seemed fine: Transaction Processor: "Processed transaction ID 12845 ✓"Category Predictor: "Classified 'Wholefoods#123' as 'entertainment' ✓"Fraud Detector: "No suspicious patterns detected ✓"Spending Analyzer: "Updated monthly totals ✓" But clearly something was off as my grocery shopping was being classified as entertainment. Without a way to trace a single transaction through all these services, I was basically guessing where the problem might be. OpenTelemetry: The Game Changer for Personal Projects OpenTelemetry might sound like a overkill for a personal project, but it's worth the setup effort. The basic idea is simple: every request gets a unique trace ID that follows it through your entire system, and each service operation becomes a "span" within that trace. Here's how I implemented it across my Python services running in Docker containers: Python # requirements.txt additions for all services opentelemetry-api==1.21.0 opentelemetry-sdk==1.21.0 opentelemetry-instrumentation-flask==0.42b0 opentelemetry-instrumentation-requests==0.42b0 opentelemetry-exporter-jaeger-thrift==1.21.0 # shared/tracing.py - Common tracing setup import os from opentelemetry import trace from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.trace.export import BatchSpanProcessor from opentelemetry.exporter.jaeger.thrift import JaegerExporter from opentelemetry.instrumentation.flask import FlaskInstrumentor from opentelemetry.instrumentation.requests import RequestsInstrumentor def setup_tracing(service_name): # Set up tracer provider trace.set_tracer_provider(TracerProvider()) tracer = trace.get_tracer(__name__) # Configure Jaeger exporter jaeger_exporter = JaegerExporter( agent_host_name=os.getenv("JAEGER_HOST", "jaeger"), agent_port=int(os.getenv("JAEGER_PORT", "6831")), ) # Add span processor span_processor = BatchSpanProcessor(jaeger_exporter) trace.get_tracer_provider().add_span_processor(span_processor) # Auto-instrument Flask and requests FlaskInstrumentor().instrument() RequestsInstrumentor().instrument() return tracer # category_predictor/app.py - Example service implementation from flask import Flask, request, jsonify import joblib import requests from shared.tracing import setup_tracing app = Flask(__name__) tracer = setup_tracing("category-predictor") # Load the categorization model model = joblib.load('models/category_classifier.pkl') @app.route('/categorize', methods=['POST']) def categorize_transaction(): transaction_data = request.json with tracer.start_as_current_span("categorize_transaction") as span: span.set_attribute("transaction.id", transaction_data.get('id')) span.set_attribute("transaction.amount", transaction_data.get('amount')) span.set_attribute("merchant", transaction_data.get('description', '')[:50]) try: # Get user spending patterns for context user_id = transaction_data.get('user_id') with tracer.start_as_current_span("fetch_user_context") as context_span: context_response = requests.get( f"http://spending-analyzer:5000/patterns/{user_id}" ) context_span.set_attribute("context.status_code", context_response.status_code) user_context = context_response.json() if context_response.ok else {} # Prepare features for ML model with tracer.start_as_current_span("prepare_features") as feature_span: features = prepare_transaction_features(transaction_data, user_context) feature_span.set_attribute("features.count", len(features)) feature_span.set_attribute("features.has_context", len(user_context) > 0) # Run ML model prediction with tracer.start_as_current_span("model_prediction") as model_span: category = model.predict([features])[0] confidence = model.predict_proba([features]).max() model_span.set_attribute("prediction.category", category) model_span.set_attribute("prediction.confidence", float(confidence)) model_span.set_attribute("model.version", "v1.2.3") span.set_attribute("result.category", category) span.set_attribute("result.confidence", float(confidence)) return jsonify({ 'category': category, 'confidence': float(confidence) }) except Exception as e: span.record_exception(e) span.set_status(trace.Status(trace.StatusCode.ERROR, str(e))) return jsonify({'error': str(e)}), 500 def prepare_transaction_features(transaction, user_context): """Extract features for the ML model""" description = transaction.get('description', '').lower() amount = float(transaction.get('amount', 0)) features = [ len(description), amount, 1 if 'grocery' in description or 'kroger' in description else 0, 1 if 'restaurant' in description or 'cafe' in description else 0, user_context.get('avg_grocery_amount', 50.0), # User's average grocery spending user_context.get('grocery_frequency', 0.1), # How often they buy groceries ] return features if __name__ == '__main__': app.run(host='0.0.0.0', port=5000) The beauty of this setup is that OpenTelemetry automatically traces HTTP requests between services. I only needed to add manual spans for ML-specific operations like feature preparation and model inference. Setting Up Jaeger: Trace Dashboard Getting Jaeger running alongside my finance tracker services was straightforward with Docker Compose: YAML # docker-compose.yml version: '3.8' services: jaeger: image: jaegertracing/all-in-one:1.50 ports: - "16686:16686" # Jaeger UI - "6831:6831/udp" # Jaeger agent environment: - COLLECTOR_OTLP_ENABLED=true networks: - finance-network transaction-processor: build: ./transaction-processor environment: - JAEGER_HOST=jaeger - JAEGER_PORT=6831 depends_on: - jaeger - redis networks: - finance-network category-predictor: build: ./category-predictor environment: - JAEGER_HOST=jaeger - JAEGER_PORT=6831 depends_on: - jaeger networks: - finance-network spending-analyzer: build: ./spending-analyzer environment: - JAEGER_HOST=jaeger - JAEGER_PORT=6831 depends_on: - jaeger - redis networks: - finance-network fraud-detector: build: ./fraud-detector environment: - JAEGER_HOST=jaeger - JAEGER_PORT=6831 depends_on: - jaeger networks: - finance-network redis: image: redis:alpine networks: - finance-network networks: finance-network: driver: bridge After running docker-compose up, I could access the Jaeger UI at http://localhost:16686 and see traces from all my services. The first time I saw a complete transaction flow visualized end-to-end was a good feeling. Solving the Great Categorization Bug Remember that mysterious categorization bug I mentioned? Here's exactly how distributed tracing helped me solve it. With all tracing added, I could finally trace individual problematic transactions through the entire pipeline. I found a recent transaction that had been miscategorized and pulled up its trace in Jaeger. The trace looked normal at first glance: all services responded successfully, and the flow proceeded as expected. But when I examined the spans more carefully, I noticed something weird in the "fetch_user_context" span: Plain Text span: fetch_user_context transaction.id = "12845" user_id = "user123" context.status_code = 200 context.cache_key = "patterns:user12" // Missing digit! context.cache_hit = true patterns.user_id = "user12" // Wrong user! The smoking gun was right there in the span attributes. My Redis caching key was getting truncated due to a string formatting bug, so user123's categorization was using spending patterns from user12. No wonder the predictions were nonsensical—the model was getting context from someone with completely different spending habits. The bug was in my caching code: Python # The buggy version def get_user_patterns(user_id): cache_key = f"patterns:user{user_id[:4]}" # BUG: Truncating user ID! cached = redis.get(cache_key) # ... rest of the logic Python # The fixed version def get_user_patterns(user_id): cache_key = f"patterns:{user_id}" # Fixed: Use full user ID cached = redis.get(cache_key) # ... rest of the logic Without distributed tracing, this bug would have taken me days to track down. I would have suspected the ML model, the feature engineering, the data preprocessing—everything except a caching bug. With tracing, I found and fixed it in 20 minutes.

By Ramya Boorugula

WAN Is the New LAN!?!?

For decades, the Local Area Network (LAN) was the heart of enterprise IT. It represented the immediate, high-speed connectivity within an office or campus. But in today's cloud-first, globally distributed world, the very definition of "local" has expanded. The Wide Area Network (WAN) was considered to be the most expensive link. However, its high agility and intelligent fabric make it more reliable and help make LAN expand globally. The paradigm shift is clear: "WAN is the new LAN". This transformation hasn't happened overnight. A lot of research hours went into this innovation, and it took more than 2 decades for the evolution. It's a journey that began with the limitations of traditional Multiprotocol Label Switching (MPLS) infrastructure, evolved through the revolutionary capabilities of Software-Defined Wide Area Networking (SD-WAN), and is now culminating in the promise of hyper-scale Cloud WAN. The Reign of MPLS In the early 2000s, MPLS was the undisputed king of enterprise WANs. All enterprises heavily relied on MPLS-based circuits to expand the connectivity between their data centers and branch offices with guaranteed Quality of Service (QoS). In this method, you know the path that you are going to take, meaning packets travel along predefined, high-speed routes, ensuring reliability and high performance for mission-critical applications like voice and video that need real-time streaming. However, the MPLS has its own significant challenges: High Costs: MPLS circuits are way too expensive for mid-size startups to adopt. Bandwidth upgrade is also an expensive affair.Lack of Flexibility: Adding new sites or increasing capacity was a lengthy, complex, and often manual process, involving weeks or even months of provisioning.4 This rigidity made it difficult for businesses to adapt to rapid changes.Cloud Incompatibility: As applications migrated to the cloud (SaaS, IaaS), MPLS's hub-and-spoke architecture forced cloud traffic to "backhaul" through a central data center.5 This introduced latency, negated cloud benefits, and created bottlenecks.Limited Visibility and Control: Enterprises often lack granular control over the MPLS Network, and since the path is predefined, if there is a failure in the path, service providers need to help troubleshoot the drop traffic. The rise of cloud computing and the distributed workforce exposed MPLS's limitations, paving the way for a more dynamic solution. SD-WAN: The Agile Overlay Revolution The mid-2010s ushered in the era of Software-Defined Wide Area Networking (SD-WAN), a game-changer that addressed many of MPLS's shortcomings. SD-WAN decouples the network control plane from the underlying hardware, creating a virtual overlay that can intelligently utilize various transport methods – including cheap broadband internet, LTE, and even existing MPLS circuits. Key advantages of SD-WAN over traditional MPLS include: Cost Efficiency: SD-WAN uses readily available and less expensive Internet broadband for communication. This technique significantly reduced WAN costs by 50-60% when compared to MPLS.Enhanced Agility and Flexibility: Centralized, software-based management allowed for rapid deployment of new sites, quick policy changes, and dynamic traffic steering. New branches could be brought online in days, not months, often with zero-touch provisioning.Optimized Cloud Connectivity: SD-WAN does destination based routing and prioritizes cloud bound traffic directly to the Internet, instead of data center routing improving application performance and reducing the round trip time. It understood individual applications and their SLA requirements, ensuring optimal traffic delivery.Improved Performance and Resiliency: SD-WAN actively monitors network conditions across multiple links, automatically selecting the best path for applications and providing sub-second failover in case of an outage. This built-in redundancy dramatically increased network resilience.Centralized Management and Visibility: A single pane of glass provided comprehensive visibility into network performance, application usage, and security policies, empowering IT teams with greater control. SD-WAN quickly moved from emerging tech to mainstream, with nearly 90% of enterprises rolling out some form of it by 2022. It became the enabling technology for a cloud-centric world, making the public internet the new enterprise WAN backbone. Fast Forward to Cloud WAN: The Planet-Scale Network as a Service While SD-WAN brought immense benefits, the increasing complexity of multi-cloud environments, distributed workforces, and the burgeoning demands of AI and IoT workloads have led to the next evolution: Cloud WAN. Cloud WAN represents a shift from managing fragmented network components to consuming a fully managed, globally distributed network as a service. Hyperscale cloud providers, like Google Cloud, are now extending their massive private backbone networks, traditionally used for their own services, directly to enterprises. Google Cloud WAN, for example, leverages Google's planet-scale network encompassing millions of miles of fiber and numerous subsea cables to provide a unified, high-performance, and secure enterprise backbone. It's designed to simplify global networking by: Unified Global Connectivity: Connecting geographically dispersed data centers, branch offices, and campuses over a single, highly performant backbone, acting as a modern alternative to traditional WANs.Simplified Management: Abstracting the underlying network complexity and providing a policy-based framework for declarative management. This means enterprises can focus on business requirements rather than intricate technical configurations.Optimized for the AI Era: Designed to handle the low-latency, high-throughput demands of AI-powered workloads and other data-intensive applications, offering up to 40% faster performance compared to the public internet.Cost Savings: By consolidating network infrastructure and leveraging a managed service, Cloud WAN can offer significant total cost of ownership (TCO) reductions (e.g., up to 40% savings over customer-managed solutions).Integrated Security: Cloud WAN solutions often come with integrated, multi-layered security capabilities, ensuring consistent security policies across the entire network.Flexibility and Choice: While a managed service, Cloud WAN platforms often integrate with leading SD-WAN and SASE (Secure Access Service Edge) vendors, allowing enterprises to protect existing investments and maintain consistent security policies. The "WAN is the new LAN" paradigm isn't just about faster connections; it's about a fundamental shift in how enterprises approach their global network. It's about consuming connectivity as a seamless, software-defined service that adapts to business needs rather than a static, hardware-centric infrastructure. As businesses continue their digital transformation journeys, embracing hybrid and multi-cloud strategies and leveraging advanced technologies like AI, the evolution of WAN to Cloud WAN will be critical to unlocking their full potential. The network is no longer just a utility; it's a strategic enabler, performing with the speed, agility, and intelligence once reserved for the most localized of networks.

By Harika Rama Tulasi Karatapu

IoT

DZone's Featured IoT Resources

Top IoT Experts

The Latest IoT Topics