DZone Spotlight

Tuesday, March 10 View All Articles »

Best Practices to Make Your Data AI-Ready

By Mykhailo Kopyl

The key problem organizations encounter when implementing AI is not the technology itself, but the data needed to feed AI models. Many companies have plenty of data, but when it comes to quality, it often turns out to be messy, inconsistent, or biased. If you want your AI investments to deliver real value, you must make your data AI‑ready first. Below, I share some best practices for building an AI-ready culture and establishing a data management framework that ensures high-quality data pipelines for AI initiatives. Start with Understanding Which Data You Need AI readiness begins with use cases. You need to understand what type and how much data you require to build an efficient data analytics platform. Start by defining how AI will change a specific process, decision, or metric for your company. A good AI data strategy aligns data usage with business goals. This approach prevents you from investing time and resources in cleaning data you won’t use. Trust me, it can greatly optimize costs for your AI projects. Once you have defined your use cases, you need to specify the exact data requirements, including formats, fields, latency, and more. A common mistake I see is making vague statements instead of focused specifications. For example, “customer data” is too broad; it’s better to divide it into specific fields like “customer ID,” “email address,” and “signup date.” This makes validation concrete and automatable. Build Strong Data Governance and Ownership One thing I know for sure is that AI projects fail fast if no one owns the data quality process. You need someone in your organization accountable for field definitions, data catalogs, access policies, and quality metrics. Without clear ownership, data changes often go unnoticed. Governance should also enforce role‑based access, encryption standards, and lineage tracking so that data is traceable from source to model input. These factors help you comply with policies like GDPR while also reducing risk in AI decision-making. Use Metadata and Catalogs to Make Data Discoverable Metadata helps you quickly understand what each dataset contains, how it was created, and how it changes over time. This makes data easy to find for analysts and AI engineers. Build or use a data catalog that: indexes tables, schemas, and fields;documents ownership and definitions;tracks lineage and usage Metadata catalogs also serve as the basis for trust and reproducibility. When someone knows exactly where a dataset came from and how it has been transformed, they can validate that the model is working with reliable inputs. Maintain a Central Data Platform Data silos are a common problem for most organizations. Implementing data analysis in healthcare, I experienced this firsthand. Data tied up in departmental systems slows discovery and increases fragmentation. I don’t say that you need the “everything goes here” system. This would be risky. But you need a data management layer that allows you to find, query, and monitor data from a single place. Think of it like a shared library. Start by registering your most critical datasets, but not everything at once. Document ownership, field definitions, refresh frequency, and known quality issues. Standardize access through shared query interfaces, whether teams use SQL, APIs, or other tools. Also, build quality checks directly into pipelines, adding validation rules for freshness, completeness, and schema changes at ingestion. Track and Improve Quality Continuously AI models require fresh data to retrain, so ensuring data quality is an ongoing process. Automate checks and set thresholds that trigger alerts. This allows your team to intervene before issues become costly problems. If a pipeline breaks or a critical field starts missing values, you should know before a model retrains on bad data. Once models are live, monitor their outputs and link them back to data quality signals. If a model consistently makes errors tied to certain data fields, trace the issue back and fix it upstream. Test AI Readiness Before Full Deployment Implementing AI iteratively has become best practice. The same applies to testing data for AI readiness. Before committing to full production, run small pilot projects to validate that data quality is sufficient and measure whether the dataset actually supports the business use case. In one project I worked on, we tried to build an employee attrition model using HR system data and moved too quickly toward implementation. We assumed core fields like job level, manager ID, and role history were reliable. During model testing, we realized that role changes were overwritten instead of tracked over time. As a result, the model learned misleading patterns. We had to step back, redesign the data model, and introduce proper history tracking before continuing. Pilot tests like this help catch gaps and adjust quality standards without significant risk. Wrapping Up AI success depends on data that is complete, accurate, and structured. Models trained on partial or inconsistent data will perform poorly and produce misleading results. In this article, I intentionally didn’t focus on cleaning and preparing a specific dataset, but rather on building a framework for effective data management in organizations pursuing AI projects. To see real results from your AI initiatives, ensure a consistent and reliable data flow. This reduces costly errors and transforms data into a strategic asset rather than just a byproduct of operations. More

2026 Developer Research Report

By Carisse Dumaua

Hello, our dearest DZone Community! Last year, we asked you for your thoughts on emerging and evolving software development trends, your day-to-day as devs, and workflows that work best — all to shape our 2026 Community Research Report. The goal is simple: to better understand our community and provide the right content and resources developers need to support their career journeys. After crunching some numbers and piecing the puzzle together, alas, it is in (and we have to warn you, it's quite a handful)! This report summarizes the survey responses we collected from December 9, 2025, to January 27 of this year, and includes an overview of the DZone community, the stacks developers are currently using, the rising trend in AI adoption, year-over-year highlights, and so much more. Here are a few takeaways worth mentioning: AI use climbs this year, with 67.3% of readers now adopting it in their workflows.While most use multiple languages in their developer stacks, Python takes the top spot.Readers visit DZone primarily for practical learning and problem-solving. These are just a small glimpse of what's waiting in our report, made possible by you. You can read the rest of it below. 2026 Community Research ReportRead the Free Report We really appreciate you lending your time to help us improve your experience and nourish DZone into a better go-to resource every day. Here's to new learnings and even newer ideas! — Your DZone Content and Community team More

Trend Report

Database Systems

Every organization is now in the business of data, but they must keep up as database capabilities and the purposes they serve continue to evolve. Systems once defined by rows and tables now span regions and clouds, requiring a balance between transactional speed and analytical depth, as well as integration of relational, document, and vector models into a single, multi-model design. At the same time, AI has become both a consumer and a partner that embeds meaning into queries while optimizing the very systems that execute them. These transformations blur the lines between transactional and analytical, centralized and distributed, human driven and machine assisted. Amidst all this change, databases must still meet what are now considered baseline expectations: scalability, flexibility, security and compliance, observability, and automation. With the stakes higher than ever, it is clear that for organizations to adapt and grow successfully, databases must be hardened for resilience, performance, and intelligence. In the 2025 Database Systems Trend Report, DZone takes a pulse check on database adoption and innovation, ecosystem trends, tool usage, strategies, and more — all with the goal for practitioners and leaders alike to reorient our collective understanding of how old models and new paradigms are converging to define what’s next for data management and storage.

Refcard #388

Threat Modeling Core Practices

By Apostolos Giannakidis

CORE

Refcard #401

Getting Started With Agentic AI

By Lahiru Fernando

The Inner Loop Is Eating The Outer Loop

For as long as most of us have been building software, there has been a clean split in the development lifecycle: the inner loop and the outer loop. The inner loop is where a developer lives day to day. Write code, run it locally, check if it works, iterate. It is fast, tight, and personal. The outer loop is everything after you push. Continuous integration pipelines, integration tests, staging deployments, and code review. It is comprehensive but slow, and for good reason. Running your entire test suite against every keystroke would be insane. So we optimized: fast feedback locally, thorough validation later. This split was not some grand architectural decision. It was a pragmatic response to a real constraint. Comprehensive validation testing against real dependencies in a realistic environment was slow and expensive. So developers made a tradeoff to sacrifice thoroughness for speed in the inner loop and defer the real testing to continuous integration (CI). Write a unit test, mock a dependency or two, and move on. The comprehensive stuff runs later, in a pipeline, and you deal with failures when they show up. Sometimes hours later. Sometimes the next day. That tradeoff only made sense when we had no alternative. Now, the model is evolving into a single loop where validation happens at every stage of the software development lifecycle (SDLC). The Constraint That Created Two Loops Is Breaking The inner and outer loop split was never about two fundamentally different kinds of work. It was about a limitation: you could not perform comprehensive validation fast enough to be part of the development loop. Integration testing meant spinning up services, provisioning databases, and waiting for environments. That was a 15-minute-to-hours proposition, not a seconds proposition. So it got batched into CI. Now, infrastructure has caught up. Ephemeral environments can spin up in seconds, giving you real integration testing against actual dependencies on a branch, pre-merge. There is no wait. The technical barrier to comprehensive but fast validation is gone. Continuous Delivery Becomes Practical for Everyone The idea of pushing smaller units of code to production more frequently is not new, but most teams still struggle to pull it off in distributed, cloud-native architectures. In a microservices architecture, testing a small change properly means validating it against multiple downstream consumers. Historically, this meant slow environment provisioning, waiting in a queue for a staging spot, or relying on mocked dependencies. To cope, teams batched changes, running massive integration suites nightly or weekly. When something broke, debugging spanned days of commits. With access to fast, comprehensive ephemeral environments, continuous delivery becomes highly practical. A developer can make a focused change, spin up a sandbox that routes traffic through the modified service, validate against real dependencies in seconds, and push it forward. The per-change cost of validation drops low enough that batching becomes unnecessary. Debugging is vastly simplified because the blast radius is limited to a single small, well-understood change. Ultimately, the path from code written to running in production shrinks from days to hours. For Agents, Fast Validation Is a Critical Infrastructural Change This merging of the loops is an exciting evolution for software development as a whole. But for teams implementing agentic workflows at scale, it is a structural necessity. Agents are now writing most of our code, and they have a very different relationship with validation than humans do. Fast feedback is not a preference for agents. It is essential. An agent does not get frustrated waiting for tests, but the speed and fidelity of feedback directly impact what an agent can accomplish. An agent that can validate a change against real services in 10 seconds will iterate 30 times in the window, whereas an agent waiting on a five-minute environment spin-up iterates once. Speed is not just a quality of life thing for agents. It is a throughput multiplier. Humans traded thoroughness for speed because those two things were in tension. You could have fast but shallow local mocked tests or slow but thorough CI integration tests. Pick one. In fast, ephemeral environments, agents do not face that trade-off. They get comprehensive validation at inner-loop speed. They can test against real dependencies, real services, and real data flows to validate the behavior of their changes in seconds. What Agents Do With Fast, Comprehensive Environments When an agent picks up a coding task with access to the right environment and tools, the workflow looks nothing like the old inner and outer loop divide. The agent writes code, then validates it. It does not use mocked unit tests, but rather tests against real dependencies in an ephemeral environment that spins up in seconds. It finds a problem, fixes it, and validates again. It might run through this cycle dozens of times before a PR ever exists. Each iteration is both fast and thorough. Then the agent goes further. It reviews its own code, or has another agent review it. It checks edge cases. It verifies that the change works correctly within the broader dependency graph. All of this happens on a branch, pre-merge, in seconds per cycle. By the time anything gets pushed toward main, it has already been through a level of validation that most traditional pipelines would envy. The outer loop has very little left to catch, allowing CI to act as a lightweight, continuous feedback mechanism for the agent rather than a heavy, delayed gatekeeper. Code Review Gets Absorbed Into the Workflow Here is another piece of the outer loop that is collapsing inward: code review. Agentic code review is quickly becoming standard. But the interesting shift is not just that AI can review code. It is that the review becomes part of the agent's own development loop rather than a separate phase. An agent writes code, validates it in a sandbox, reviews the change, addresses issues, and re-validates. Only then does it create a PR. By the time a developer sees a PR, if they need to see it at all, the mechanical quality issues are already resolved. The PR becomes less of a gate to check work and more of a record of what was done, how it was validated, and the evidence that it works. Developer review does not disappear entirely. Architecture decisions, security-sensitive changes, and novel approaches still benefit from human judgment. But the outer-loop review bottleneck, where PRs sit in a queue waiting for an overloaded engineer to context-switch into reviewer mode, largely goes away. The Tooling Ceiling Becomes the Agent Ceiling If this thesis is right, and if the inner loop really is absorbing the outer loop, it creates a very clear bottleneck. The quality of environments and tools available to the agent. An agent with only local unit testing will catch local bugs. Give it access to fast ephemeral environments with real dependency graphs, and it catches integration issues, configuration drift, and behavioral regressions. Give it access to performance benchmarks, security scanners, and observability data, and it catches even more. This shifts where the highest-leverage infrastructure investment is. Instead of building more elaborate post-merge CI pipelines, the winning bet is making comprehensive, realistic validation available pre-merge. It must be fast enough and cheap enough that agents can use it on every iteration, not just on PR submission. Conclusion The organizations that figure this out first and invest in giving agents fast, comprehensive, pre-merge validation will be the ones that actually achieve continuous delivery. With validation happening continuously, the outer loop becomes part of the inner loop. CI becomes more lightweight, serving as one of several layers of validation and feedback in a true continuous delivery flow. The inner loop is merging with the outer loop. The question is not whether this shift is happening. It is whether your validation tooling is ready for it.

By Arjun Iyer

Square, SumUp, Shopify: Data Streaming for Real-Time Point-of-Sale (POS)

Point-of-Sale (POS) systems are no longer just cash registers. They are becoming real-time, connected platforms that handle payments, manage inventory, personalize customer experiences, and feed business intelligence. Small and medium-sized merchants can now access capabilities once reserved for enterprise retailers. Mobile payment platforms like Square, SumUp, and Shopify make it easy to sell anywhere and integrate sales channels seamlessly. At the same time, data streaming technologies such as Apache Kafka and Apache Flink are transforming retail operations. They enable instant insights and automated actions across every store, website, and supply chain partner. This post explores the current state of mobile payment solutions, the role of data streaming in retail, how Kafka and Flink power POS systems, the SumUp success story, and the future impact of Agentic AI on the checkout experience. Mobile Payment and Business Solutions for Small and Medium-Sized Merchants The payment landscape for small and medium-sized merchants has undergone a rapid transformation. For years, accepting card payments meant expensive contracts, bulky hardware, and complex integration. Today, companies like Square, SumUp, and Shopify have made mobile payments simple, mobile, and affordable. Block (Square) offers a unified platform that combines payment processing, POS systems, inventory management, staff scheduling, and analytics. It is especially popular with small retailers and service providers who value flexibility and ease of use. SumUp started with mobile card readers but has expanded into full POS systems, online stores, invoicing tools, and business accounts. Their solutions target micro-merchants and small businesses, enabling them to operate in markets that previously lacked access to digital payment tools. Shopify integrates its POS offering directly into its e-commerce platform. This allows merchants to sell in physical stores and online with a single inventory system, unified analytics, and centralized customer data. These companies have blurred the lines between payment providers, commerce platforms, and business management systems. The result is a market where even the smallest shop can deliver a payment experience once reserved for large retailers. Data Streaming in the Retail Industry Retail generates more event data every year. Every scan at a POS, every online click, every shipment update, and every loyalty point redemption is a data event. In traditional systems, these events are collected in batches and processed overnight or weekly. The problem is clear: by the time insights are available, the opportunity to act has often passed. Data streaming solves this by making all events available in real time. Retailers can instantly detect low stock in a store, trigger replenishment, or offer dynamic discounts based on current shopping patterns. Fraud detection systems can block suspicious transactions before completion. Customer service teams can see the latest order updates without contacting the warehouse. In previous retail industry examples, data streaming has powered: Omnichannel inventory visibility for accurate stock counts across stores and online channels.Dynamic pricing engines that adjust prices based on demand and competitor activity.Personalized promotions triggered by live purchase behavior.Real-time supply chain monitoring to handle disruptions immediately. Emerging Trend: Unified Commerce The next stage beyond omnichannel is Unified Commerce. Here, all sales channels — physical stores, online shops, mobile apps, marketplaces, and social commerce — operate on a single, real-time data foundation. Instead of integrating separate systems after the fact, every transaction, inventory update, and customer interaction flows through one unified platform. Data streaming technologies like Apache Kafka make Unified Commerce possible by ensuring all touchpoints share the same up-to-date information instantly. This enables consistent pricing, seamless cross-channel returns, accurate loyalty balances, and personalized experiences no matter where the customer shops. Unified Commerce turns fragmented retail technology into a single, connected nervous system. Data Streaming with Apache Kafka and Flink for POS in Retail In an event-driven retail architecture, Apache Kafka acts as the backbone. It ingests payment transactions, inventory updates, and customer interactions from multiple channels. Kafka ensures these events are stored durably, replayable for compliance, and available to downstream systems within milliseconds. Apache Flink adds continuous stream processing capabilities. For POS use cases, this means: Running fraud detection models in real time, with alerts sent instantly to the cashier or payment gateway.Aggregating sales data on the fly to power live dashboards for store managers.Updating loyalty points immediately after a purchase to improve customer satisfaction.Ensuring that both physical stores and e-commerce channels reflect the same stock levels at all times. Together, Kafka and Flink create a foundation for operational excellence. They enable a shift from manual, reactive processes to automated, proactive actions. Using data streaming at the edge for POS systems enables ultra-low latency processing and local resilience, but scaling and managing it across multiple locations can be challenging. Running data streaming in the cloud offers central scalability and simplified governance, though it depends on reliable connectivity and may introduce slightly higher latency. SumUp: Real-Time POS at Global Scale with Data Streaming in the Cloud SumUp processes millions of transactions per day across more than 30 countries. To handle this scale and maintain high availability, they adopted an event-driven architecture powered by Apache Kafka and fully managed Confluent Cloud. In the Confluent customer story, SumUp explains how Kafka has allowed them to: Process every payment event in real time.Maintain a unified data platform across regions, ensuring compliance with local payment regulations.Scale easily to handle seasonal transaction spikes without service interruptions.Speed up developer delivery cycles by providing event data as a service across teams. Implementing Critical Use Cases Across the Business More than 20 teams at SumUp now rely on Confluent Cloud to deliver mission-critical capabilities. Global Bank Tribe: Operates SumUp’s banking and merchant payment services. Real-time data streaming keeps transaction records updated instantly in merchant accounts. Reusable data products improve resilience for high-volume processes such as 24/7 monitoring, fraud detection, and personalized recommendations.CRM Team: Delivers customer and product information to operational teams in real time. Moving away from batch processing creates a smoother customer experience and enables data sharing across the organization.Risk Data and Machine Learning Platform: Feeds standardized, near-real-time data into machine learning models. These models make decisions on the freshest data available, improving outcomes for both teams and merchants. By embedding Confluent Cloud across multiple domains, SumUp has turned event data into a shared asset that drives operational efficiency, customer satisfaction, and innovation at scale. For merchants, this means faster transaction confirmations, improved reliability, and new digital services without downtime. The Future of POS and Impact of Agentic AI The POS of tomorrow will be more than a payment device. It will be a connected intelligence hub. Agentic AI, with autonomous systems capable of proactive decision-making, will play a central role. Future capabilities could include: AI-driven recommendations for upsells, customized to each shopper’s behavior and context.Predictive inventory replenishment that automatically places supplier orders when stock is low.Automated fraud prevention that adapts in real time to emerging threats.Dynamic loyalty program offers tailored at the exact moment of purchase. When Agentic AI is powered by real-time event data from Kafka and Flink, decisions will be both faster and more accurate. This will shift POS systems from passive endpoints to active participants in business growth. For small and medium-sized merchants, this evolution will unlock capabilities previously available only to enterprise retailers. The result will be a competitive, data-driven retail landscape where agility and intelligence are built into every transaction.

By Kai Wähner

CORE

Implementing Sharding in PostgreSQL: A Comprehensive Guide

As applications scale and data volumes increase, efficiently managing large datasets becomes a core requirement. Sharding is a common approach used to achieve horizontal scalability by splitting a database into smaller, independent units known as shards. Each shard holds a portion of the overall data, making it easier to scale storage and workload across multiple servers. PostgreSQL, as a mature and feature-rich relational database, offers several ways to implement sharding. These approaches allow systems to handle high data volumes while maintaining performance, reliability, and operational stability. This guide explains how sharding can be implemented in PostgreSQL using practical examples and clear, step-by-step instructions. In a sharded setup, table data is distributed across multiple nodes based on a chosen sharding key. For instance, a customer table may be split by region or customer_id, with each shard storing a specific subset of records. The primary challenge lies in routing queries and transactions to the correct shard while preserving data consistency and application transparency. PostgreSQL supports sharding through built-in features such as postgres_fdw and table partitioning, as well as extensions like Citus for more advanced and large-scale deployments. Setting Up Sharding in PostgreSQL To demonstrate the approach, consider a scenario in which sharding is implemented for a Sales table. In this example, sales data is distributed across multiple regions using region_id as the sharding key. Each region is assigned its own shard, allowing the data to be spread across multiple databases while keeping it logically organized. The configuration involves creating individual shards, setting up PostgreSQL to handle data distribution, and ensuring that queries are routed to the correct shard. The process begins with the base PostgreSQL setup. PostgreSQL should be installed on all required systems. A primary database is then created, which the application connects to directly. This database acts as the coordinator node, responsible for directing queries to the appropriate regional shards based on the sharding logic. SQL -- Step 1: Create the main database CREATE DATABASE sales_db; -- Step 2: Connect to the main database \c sales_db Once connected, create a schema that defines the structure of the sales table. Instead of creating a single monolithic table, define the schema without immediately populating it with data. Instead, shards will be created as partitions, with data distributed across them based on regions. SQL -- Step 3: Define the Sales table schema CREATE TABLE sales ( sale_id SERIAL PRIMARY KEY, region_id INT NOT NULL, sale_amount DECIMAL(10, 2), sale_date DATE NOT NULL ) PARTITION BY LIST (region_id); The PARTITION BY LIST clause specifies how region_id determines data placement. For each region, a partition (a shard) will be created. For example, if you have three regions, you might create separate shards as follows: SQL -- Step 4: Create individual shards for each region CREATE TABLE sales_region_1 PARTITION OF sales FOR VALUES IN (1); CREATE TABLE sales_region_2 PARTITION OF sales FOR VALUES IN (2); CREATE TABLE sales_region_3 PARTITION OF sales FOR VALUES IN (3); In this example, the sales_region_1 table will store all records where region_id = 1, while sales_region_2 will store data for region_id = 2, and so on. Each shard can be hosted on a different PostgreSQL server to provide scalability. Configuring Foreign Data Wrappers for Distributed Shards To enable distributed sharding, use PostgreSQL’s postgres_fdw extension. This extension allows you to connect to remote PostgreSQL instances and treat them as part of the database, enabling efficient queries across shards. Install the extension and configure it as follows: SQL -- Step 5: Enable the postgres_fdw extension CREATE EXTENSION IF NOT EXISTS postgres_fdw; -- Step 6: Create a foreign server for each shard CREATE SERVER shard_1 FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host 'shard1_host', dbname 'shard1_db', port '5432'); CREATE SERVER shard_2 FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host 'shard2_host', dbname 'shard2_db', port '5432'); CREATE SERVER shard_3 FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host 'shard3_host', dbname 'shard3_db', port '5432'); -- Step 7: Create user mappings for each server CREATE USER MAPPING FOR CURRENT_USER SERVER shard_1 OPTIONS (user 'postgres', password 'password'); CREATE USER MAPPING FOR CURRENT_USER SERVER shard_2 OPTIONS (user 'postgres', password 'password'); CREATE USER MAPPING FOR CURRENT_USER SERVER shard_3 OPTIONS (user 'postgres', password 'password'); Now associate each shard (partition) with its corresponding remote server using foreign tables. This allows PostgreSQL to route queries to the appropriate server. SQL -- Step 8: Import foreign schemas for each shard CREATE FOREIGN TABLE sales_region_1 ( sale_id SERIAL, region_id INT, sale_amount DECIMAL(10, 2), sale_date DATE ) SERVER shard_1 OPTIONS (schema_name 'public', table_name 'sales_region_1'); CREATE FOREIGN TABLE sales_region_2 ( sale_id SERIAL, region_id INT, sale_amount DECIMAL(10, 2), sale_date DATE ) SERVER shard_2 OPTIONS (schema_name 'public', table_name 'sales_region_2'); CREATE FOREIGN TABLE sales_region_3 ( sale_id SERIAL, region_id INT, sale_amount DECIMAL(10, 2), sale_date DATE ) SERVER shard_3 OPTIONS (schema_name 'public', table_name 'sales_region_3'); Testing the Sharding Setup After setting up the shards, test the configuration by inserting data into the sales table and verifying that it is correctly routed to the appropriate shard. SQL -- Insert data into the main sales table INSERT INTO sales (region_id, sale_amount, sale_date) VALUES (1, 100.50, '2023-10-01'); INSERT INTO sales (region_id, sale_amount, sale_date) VALUES (2, 200.75, '2023-10-02'); INSERT INTO sales (region_id, sale_amount, sale_date) VALUES (3, 300.25, '2023-10-03'); -- Verify that data is stored in respective shards SELECT FROM sales_region_1; SELECT FROM sales_region_2; SELECT * FROM sales_region_3; Each query above should retrieve the respective rows routed to the appropriate shard. This confirms that the sharding setup is functioning correctly. Querying and Maintaining Sharded Data PostgreSQL ensures that queries to the sales table are automatically redirected to the appropriate shard based on the region_id value. Complex queries, such as aggregations across all regions, are also supported, as PostgreSQL can parallelize query execution across shards using postgres_fdw. SQL -- Example: Aggregated sales across all shards SELECT SUM(sale_amount) AS total_sales FROM sales WHERE sale_date >= '2023-10-01'; Maintenance tasks, such as adding a new shard for additional regions, can be managed seamlessly by creating new partitions and foreign table mappings as required. For example, a new region (region_id = 4) can be supported by adding a new shard: SQL -- Add a new shard CREATE TABLE sales_region_4 PARTITION OF sales FOR VALUES IN (4); CREATE SERVER shard_4 FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host 'shard4_host', dbname 'shard4_db', port '5432'); CREATE USER MAPPING FOR CURRENT_USER SERVER shard_4 OPTIONS (user 'postgres', password 'password'); CREATE FOREIGN TABLE sales_region_4 ( sale_id SERIAL, region_id INT, sale_amount DECIMAL(10, 2), sale_date DATE ) SERVER shard_4 OPTIONS (schema_name 'public', table_name 'sales_region_4'); Conclusion Sharding in PostgreSQL provides a practical way to achieve horizontal scalability, particularly for large and growing datasets in distributed environments. By using built-in features such as postgres_fdw and partitioning, PostgreSQL can execute queries across shards transparently, without requiring complex logic in the application layer. This guide has walked through a step-by-step approach to implementing sharding for a table with uneven data distribution, using practical examples to demonstrate how PostgreSQL can be scaled to support high-performance, data-intensive applications.

By arvind toorpu

CORE

Modern State Management: Signals, Observables, and Server Components

State management is a critical aspect of modern web applications. In the Angular ecosystem, reactivity has long been powered by observables (RxJS), a powerful but sometimes complex paradigm. Angular’s recent introduction of signals provides a new, intuitive reactivity model to simplify UI state handling. Meanwhile, frameworks like React are exploring server components that push some state management to the server side. This article compares these approaches: observables, signals, and server components, and when to use each in modern development. Observables and RxJS in Angular An observable represents a stream of values over time rather than a single static value. Angular makes heavy use of observables, for example, HttpClient methods return observables and NgRx uses observables for its global store. Instead of storing one value, an observable can emit a sequence of values asynchronously, and components subscribe to react to each emission. Observables follow a push model. When a new value is available, it’s pushed to all subscribers (listeners). For example: TypeScript const count$ = new BehaviorSubject(0); const doubleCount$ = count$.pipe(map(x => x * 2)); doubleCount$.subscribe(val => console.log('Double:', val)); count$.next(5); // Console: "Double: 10" In the snippet above, count$ is an observable holding a number (a BehaviorSubject with an initial value) and doubleCount$ is a derived observable that always emits twice the value of count$. When we call count$.next(5), subscribers of doubleCount$ receive the new value 10. When to use observables: Observables excel at asynchronous and event-driven scenarios. They are ideal for cases where data arrives over time, or you have multiple events to coordinate. Use observables for things like user input streams, live data updates from a server, or complex workflows that involve timing (debouncing, buffering, etc.). RxJS provides many operators to transform and combine streams, which is powerful for managing complex sequences. Trade-offs: The flexibility of observables comes at the cost of added complexity. You must subscribe to an observable to get its values, which introduces boilerplate. Debugging a chain of observable transformations can be challenging, and the paradigm of thinking in streams has a learning curve. For a simple UI state that doesn’t truly require asynchronous streams, using observables can feel like overkill. Signals: Fine-Grained Reactivity in Angular Signals are a newer reactive primitive that holds a single value and notifies dependents when that value changes. A signal is like a state variable that is reactive by default. Signals use a pull-based model, where you read the signal’s value directly, and Angular tracks this. When you update the signal, any code (or template) that reads it is automatically updated on the next cycle. For example, using signals in Angular: TypeScript import { signal, computed, effect } from '@angular/core'; const count = signal(0); const doubleCount = computed(() => count() * 2); effect(() => console.log('Double:', doubleCount())); count.set(5); // Console: "Double: 10" Here count is a writable signal initialized to 0. We define doubleCount as a computed signal that always equals count() * 2. The effect acts similarly to an observable subscription, running whenever doubleCount (and thus count) changes. When count.set(5) is called, the effect logs the new doubled value (10). All of this happens with no manual subscription or unsubscription. Angular handles dependency tracking and updates automatically. When to use signals: Signals are great for synchronous, local state in your Angular components or services. They shine in cases like form field states, toggles, counters, selection indicators, or any scenario where you have a value that the UI directly reflects. Signals make these cases simpler by removing RxJS boilerplate. You set a value, and the UI reacts. They also enable fine-grained updates; only parts of the UI, depending on a particular signal, will update when it changes, which can improve performance. Signals can replace many uses of BehaviorSubject or Observable for holding simple state. However, signals are not suited for sequences of events or asynchronous streams. If you need to handle an HTTP response, a timer, or a stream of user events, it’s better to use an observable for that and then update a signal with the result or use Angular’s interop helpers to bridge between them. In summary, use signals for the state, which is a single source of truth, and observables for data that evolves over time. Server Components and Server-Side State Server Components represent an architecture where some components run on the server instead of the client. React’s server components (RSC) are a recent example: for the first time, React components can execute entirely on the server and deliver pre-rendered HTML to the browser. The browser receives UI output (HTML/string data), not the component code, so it has less JavaScript to download and execute. The key benefit of this approach is performance. By rendering on the server, you can keep large libraries or heavy computations out of the client bundle; only static HTML is sent down, with no need to ship those libraries to the browser. Server components are also closer to your data sources, making data fetching more efficient and keeping sensitive data or keys safely on the server. Another benefit is that the server can cache and reuse rendered results for multiple users. However, server-side state is fundamentally different from client state. It’s ephemeral once the HTML is generated and sent to the browser; the user cannot directly trigger changes in that server-rendered UI without a round-trip to the server. In essence, server-side state is immutable during a render cycle; changing it won’t trigger re-renders in the browser. Therefore, purely server-rendered components work best for static or read-only parts of the UI. In practice, an application will mix server and client rendering. With React RSC, you might render the shell of a page and some data-heavy list via server components, then use client components for interactive pieces on that page. Angular’s analogue is traditional server-side rendering (SSR) using Angular Universal. SSR pre-renders the initial HTML on the server for a fast first paint, but after that, Angular takes over on the client with the full app. Unlike React’s RSC, Angular’s SSR still needs to send the entire app bundle to the browser for hydration, so the performance gain is mostly on the first load. React’s Server Components push the boundary further by never sending certain component code to the client at all. Both approaches aim to improve performance by leveraging the server, but they require thinking carefully about what can be rendered ahead of time versus what needs to be interactive. Choosing the Right Approach Each state management strategy has strengths, and they often complement each other Observables (RxJS): Use for asynchronous data streams and complex event handling. If you have values that change over time or multiple event sources, observables are the go-to solution. They come with many operators for filtering and combining data streams. Be mindful to manage subscriptions to avoid leaks and keep code maintainable.Signals: Use for a local state that represents a single value or a snapshot of data at a time. Signals simplify cases like toggling UI elements, tracking form input values, or deriving one piece of state from another without the overhead of RxJS. They make reactive code more straightforward in these cases. In general, use signals when you don’t need the full power of observables; you’ll write less code for the same result. If an asynchronous operation is involved, you might use an observable for the async part and then update a signal with the outcome.Server components/SSR: Use server-driven rendering to optimize initial load and offload heavy computation from the client. In Angular, use Universal to render pages on the server for a quick first paint. The result is faster performance and less JavaScript to download. Just balance it with client-side needs; interactive parts must still use client-side state (signals or observables), so server rendering is best for content that can be largely static on load. Conclusion Modern applications can benefit from all three approaches. In Angular, signals and observables often work together: “Signals for local, synchronous UI state and computed values observables for asynchronous workflows and complex streams,” and Angular’s SSR can handle the initial rendering on the server. Knowing when to use each approach will help you create applications that are efficient, maintainable, and highly responsive to the user.

By Bhanu Sekhar Guttikonda

Consensus in Distributed Systems: Understanding the Raft Algorithm

Consider a group of friends planning a weekend outing. To make the trip successful, they need consensus on the location, schedule, and budget. Typically, one person is chosen as the leader — responsible for decisions, tracking expenses, and keeping everyone informed, including any new members who join later. If the leader steps down, the group elects another to maintain continuity. In distributed computing, clusters of servers face a similar challenge — they must agree on shared state and decisions. This is achieved through Consensus Protocols. Among the most well-known are Viewstamped Replication (VSR), Zookeeper Atomic Broadcast (ZAB), Paxos, and Raft. In this article, we will explore Raft — designed to be more understandable while ensuring reliability in distributed systems. Consensus in Distributed Computing Consensus in its simplest form refers to a general agreement. In the weekend outing analogy, it refers to all friends agreeing on a location. It's quite likely that several options are considered before the group eventually agrees on a particular location. In distributed computing, too, one or more nodes may propose values. Of all these values, one of them needs to be agreed upon by all the nodes. It's up to the consensus algorithm to decide upon one of these values and propagate the decision to all the nodes. Formally, a consensus algorithm must satisfy the following properties: Uniform agreement – All the nodes agree upon the same value — even if the node itself has proposed a different value initially.Integrity – Once a value is agreed upon by the node, it shouldn’t change.Validity – If a node agrees to a value, it must have been proposed at least by one other node too.Termination – Eventually, every participating node agrees upon a value. The uniform agreement and integrity form the core idea of consensus — everyone agrees on the same value, and once decided, it's final. The validity property ensures the elimination of trivial behavior wherein a node agrees to a value irrespective of what has been proposed. The termination property ensures fault tolerance. If one or more nodes fails the cluster should progress and eventually agree upon a value. This also eliminates the possibility of a dictator node that takes all decisions and jeopardizes the whole cluster in case it fails. Of course, if all the nodes fails the algorithm can’t proceed. There is a limit to the number of failures an algorithm can tolerate. An algorithm that can correctly guarantee consensus amongst n nodes of which at most t fail is said to be t-resilient. In essence, the termination property can be termed as a liveness guarantee, while the remaining three are safety guarantees. Raft Raft stands for Reliable, Replicated, Redundant, and Fault-Tolerant, reflecting its design principles in distributed systems. It ensures reliability by maintaining consistent logs, replication across nodes for durability, redundancy to avoid single points of failure, and fault tolerance to continue operating despite crashes or network issues. Together, these qualities make Raft a robust consensus algorithm for distributed computing. Explanation Raft utilizes a leader approach to achieve consensus. In a Raft cluster, a node is either a leader or a follower. A node could also be a candidate for a brief duration when a leader is unavailable, i.e., leader election is underway. The cluster has one and only one elected leader, which is fully responsible for managing log replication on the other nodes of the cluster. It means that the leader can decide between them and the other nodes without consulting the other nodes. A leader leads until it fails or disconnects, in which case remaining nodes elect a new leader. Fundamentally, the consensus problem is broken into two independent sub-problems in Raft: Leader Election and Log Replication. Leader Election Leader election in Raft occurs when the current leader fails or during initialization. Each election begins a new term, a time period in which a leader must be chosen. A node becomes a candidate if it doesn’t receive heartbeats from a leader within the election timeout. It then increments the term, votes for itself, and requests votes from others. Nodes vote once per term, on a first-come, first-served basis. A candidate wins if it secures a majority; otherwise, initiating a new term and election. Randomized timeouts reduce split votes by staggering candidate starts, ensuring quicker resolution and stable leadership through heartbeat messages. Raft is not Byzantine fault tolerant; the nodes trust the elected leader, and the algorithm assumes all participants are trustworthy. Log Replication The leader manages client requests and ensures consistency across the cluster. Each request is appended to the leader’s log and sent to followers. If followers are unavailable, the leader retries until replication succeeds. Once a majority of followers confirm replication, the entry is committed, applied to the leader’s own state, and considered durable. This also commits prior entries, which followers then apply to their own state, maintaining log consistency across cluster. In case a leader crashes, inconsistencies may arise if some entries were not fully replicated. A new leader resolves this by reconciling logs. It identifies the last matching entry with each follower, deletes conflicting entries in their logs, and replaces them with its own. Thus ensuring consistency even after failures. Additional Considerations Raft algorithm has below additional consideration for a robust consensus algorithm for distributed computing. Safety Guarantee Raft ensure below safety guarantees: Election safety – at most one leader can be elected in a given term.Leader append-only – a leader can only append new entries to its logs (it can neither overwrite nor delete entries).Log matching – if two logs contain an entry with the same index and term, then the logs are identical in all entries up through the given index.Leader completeness – if a log entry is committed in a given term then it will be present in the logs of the leaders since this term.State safety – if a node has applied a particular log entry to its state , then no other node may apply a different command for the same log. Cluster Membership Changes Raft handles cluster membership changes using joint consensus, a transitional phase where both old and new configurations overlap. During this phase, log entries must be committed to both sets, leaders can come from either, and elections require majorities from both. Once new configuration is replicated to a majority of its nodes, the system fully transitions. Raft also addresses the three challenges: New nodes without logs are excluded from majorities until caught up.Leaders not in new configuration step down to followers.Nodes still with old configuration that still recognize a leader ignore disruptive vote requests. Log Compaction Log compaction in Raft works by nodes taking snapshots of committed log entries, storing them with the last index and term. Leaders send these snapshots to lagging nodes, which then discard their log entirely or truncate it up to the snapshot’s latest entry. This also ensures durability in Raft. Limitations of Raft Raft has its own limitations with trades off scalability and flexibility as compared to other consensus algorithms. Leader bottleneck – Raft relies heavily on a single leader to coordinate log replication. If the leader fails, the system pauses until a new leader is elected, which can slow progress.Scaling – Raft doesn’t scale well to very large clusters — leader elections and log replication become slower and riskier as the number of nodes grows.Network partitions – this can cause temporary unavailability, since Raft prioritizes consistency over availability. An edge case exist where the elected leader is forced to resign and leadership switched between nodes continuously. Thus, forcing whole cluster to halt. Real World Production Usage of Raft Etcd uses Raft to manage a highly-available replicated log — utilized primarily in Kubernetes cluster for configuration management.Neo4j uses Raft to ensure consistency and safety.Apache Kafka Raft (KRaft) uses Raft for metadata management. In the recent version KRaft replaced Apache Zookeeper in Kafka.Camunda uses the Raft consensus algorithm for data replication. Raft vs. Paxos Raft was introduced to make consensus easier to understand and implement compared to Paxos. While Paxos is theoretically robust, it’s notoriously complex, making it hard for engineers to build reliable systems from it. Raft simplifies the process by breaking consensus into clear steps — leader election, log replication, and safety — without sacrificing correctness. This clarity makes Raft more approachable for real-world distributed systems. When to Choose Raft – Useful when building new distributed systems where clarity, maintainability, and developer adoption matter (e.g., databases, coordination services).Paxos – Useful in academic or highly specialized systems where theoretical rigor is prioritized over ease of implementation. In practice, Raft is usually the better choice for modern engineering teams because it balances correctness with simplicity. Future Trends in Consensus Future consensus algorithms are moving beyond leader-based models like Raft and Paxos. A key trend is leaderless consensus, where no single node coordinates decisions. Instead, all nodes collaborate equally, reducing the risk of a single point of failure. This makes systems more resilient and fair, especially in global networks where reliability is critical. For example, in blockchain or distributed databases, leaderless designs help ensure trust and consistency without relying on one “boss” node. Another trend is scalability-focused consensus, which aims to cut down communication overhead. As systems grow to thousands of nodes, traditional methods struggle with efficiency. New protocols are exploring ways to minimize message exchanges while still guaranteeing agreement. Also hybrid approaches are explored combining leaderless designs with probabilistic or quorum-based methods. These balance speed and fault tolerance, making them suitable for high-performance applications. Finally, energy-efficient consensus is gaining attention, especially in blockchain, where proof-of-work is costly. Future algorithms will likely emphasize greener, lightweight mechanisms. Consensus is evolving toward fairness, scalability, and sustainability — ensuring distributed systems can handle global scale without sacrificing reliability. Conclusion Raft simplifies the complex world of distributed consensus by breaking it into clear steps — leader election, log replication, and safety guarantees. While engineers may not encounter Raft every day, understanding it is essential when making architectural or design decisions for systems that demand reliability and consistency. Raft ensures that clusters agree on shared state even in the face of failures, though it comes with trade‑offs like leader bottlenecks and limited scalability. Its adoption in tools such as etcd, Kafka, and Neo4j shows its practical importance. Compared to Paxos, Raft is easier to grasp and implement, making it a strong foundation for modern distributed systems. As consensus evolves toward leaderless and scalable designs, Raft remains a critical concept every architect should be aware of when shaping resilient, fault‑tolerant solutions. References and Further Reading ConsensusRaft AlgorithmRaft (GitHub)Designing Data-Intensive ApplicationsPatterns of Distributed Systems

By Ammar Husain

CORE

Hands-On With Kubernetes 1.35

Kubernetes 1.35 was released on December 17, 2025, bringing significant improvements for production workloads, particularly in resource management, AI/ML scheduling, and authentication. Rather than just reading the release notes, I decided to test these features hands-on in a real Azure VM environment. This article documents my journey testing four key features in Kubernetes 1.35: In-place pod vertical scaling (GA)Gang scheduling (Alpha)Structured authentication configuration (GA)Node declared features (Alpha) All code, scripts, and configurations are available in my GitHub repository for you to follow along. Test Environment Setup: Cloud: Azure VM (Standard_D2s_v3: 2 vCPU, 8GB RAM)Kubernetes: v1.35.0 via MinikubeContainer runtime: containerdCost: ~$2 for full testing sessionRepository: k8s-135-labs Why Azure VM instead of local? Testing on cloud infrastructure provides production-like conditions and helps identify real-world challenges you might face during deployment. Feature 1: In-Place Pod Vertical Scaling (GA) Theory: The Resource Management Problem Traditional Kubernetes pod resizing has a critical limitation: it requires pod restart. Old Workflow: User requests more CPU for podPod must be deletedNew pod created with updated resourcesApplication downtimeState lost (unless persistent storage) For production workloads, this causes: Service interruptionsLost in-memory stateLonger scaling timesComplex orchestration needed What's New in K8s 1.35 In-place pod vertical scaling (now GA) allows resource changes without pod restart: YAML apiVersion: v1 kind: Pod spec: containers: - name: app resources: requests: cpu: "500m" memory: "256Mi" limits: cpu: "1000m" memory: "512Mi" resizePolicy: - resourceName: cpu restartPolicy: NotRequired # No restart for CPU! - resourceName: memory restartPolicy: RestartContainer # Memory needs restart Key innovation: Different restart policies for different resources. CPU changes typically don't require restart, while memory might. Hands-On Testing Repository: lab1-in-place-resize I created an automated demo script that simulates a real-world scenario: Scenario: Application scaling up to handle increased load Initial (Light Load): 250m CPU, 256Mi memoryTarget (Peak Load): 500m CPU, 1Gi memoryIncrease: 2x CPU, 4x memory Shell # Run the automated demo ./auto-resize-demo.sh Auto-resize script output showing 250m →500m and Memory 256Mi → 1Gi Results: CPU doubled (250m → 500m) without restartMemory quadrupled (256Mi → 1Gi) without restartRestart count: 0Total time: 20 seconds Critical Discovery: QoS Class Constraints During testing, I encountered an important limitation that's not well-documented: The error: Plain Text The Pod "qos-test" is invalid: spec: Invalid value: "Guaranteed": Pod QOS Class may not change as a result of resizing QoS error message when trying to resize only requests What I learned: Kubernetes has three QoS classes: Guaranteed: requests = limitsBurstable: requests < limitsBestEffort: no requests/limits The rule: In-place resize cannot change QoS class. Wrong (fails): YAML # Initial: Guaranteed QoS requests: { cpu: "500m" } limits: { cpu: "500m" } # Resize attempt: Would become Burstable requests: { cpu: "250m" } limits: { cpu: "500m" } # QoS change! Correct (works): YAML # Resize both proportionally requests: { cpu: "250m" } limits: { cpu: "250m" } # Stays Guaranteed Production Impact Before K8s 1.35: Plain Text Monthly cost for 100 Java pods: - Startup: 2 CPUs × 5 minutes = wasted during idle - Scaling event: Pod restart required - Result: Over-provisioned or frequent restarts After K8s 1.35: Plain Text Monthly cost for 100 Java pods: - Dynamic: High CPU during startup, low during steady-state - Scaling: No restarts needed - Result: 30-40% cost savings observed in testing Key Takeaways Production-ready: GA status means stable for critical workloadsReal savings: 30-40% cost reduction for bursty workloadsQoS constraint: Plan resource changes to maintain QoS classFast: Changes apply in seconds, not minutes Best use cases: Java applications (high startup, low steady-state)ML inference (variable load)Batch processing (scale down after processing) Feature 2: Gang Scheduling (Alpha) Theory: The Distributed Workload Problem Modern AI/ML and big data workloads often require multiple pods to work together. Traditional Kubernetes scheduling treats each pod independently, leading to resource deadlocks: The problem: Shell PyTorch Training Job: Needs 8 GPU pods (1 master + 7 workers) Cluster: Only 5 GPUs available What happens: ├─ 5 worker pods scheduled → Consume all GPUs ├─ Master + 2 workers pending ├─ Training cannot start (needs all 8) ├─ 5 GPUs wasted indefinitely └─ Other jobs blocked This is called partial scheduling — some pods run, others wait, nothing works. What Is Gang Scheduling? Gang Scheduling ensures a group of pods (a "gang") schedules together atomically: Shell Training Job: Needs 8 GPU pods Cluster: Only 5 GPUs available With Gang Scheduling: ├─ All 8 pods remain pending ├─ No resources wasted ├─ Smaller jobs can run └─ Once 8 GPUs available → all schedule together Key principle: All or nothing. Implementation Challenge Kubernetes 1.35 introduces a native Workload API for gang scheduling (Alpha), but I discovered it requires feature gates that caused kubelet instability: YAML # Attempted native approach --feature-gates=WorkloadAwareScheduling=true # Result: kubelet failed to start Error: "context deadline exceeded" Solution: Use scheduler-plugins — the mature, production-tested implementation. Hands-On Testing Repository: lab2-gang-scheduling Setup: YAML # Automated installation ./setup-gang-scheduling.sh # What it installs: # 1. scheduler-plugins controller # 2. PodGroup CRD # 3. RBAC permissions Key discovery: Works with default Kubernetes scheduler — no custom scheduler needed! Test 1: Small Gang (Success Case) YAML apiVersion: scheduling.x-k8s.io/v1alpha1 kind: PodGroup metadata: name: training-gang spec: scheduleTimeoutSeconds: 300 minMember: 3 # Requires 3 pods minimum YAML # Create 3 pods with the gang label for i in {1..3}; do kubectl apply -f training-worker-$i.yaml done Result: Plain Text NAME READY STATUS AGE training-worker-1 1/1 Running 6s training-worker-2 1/1 Running 6s training-worker-3 1/1 Running 6s All pods scheduled within 1 second of each other! PodGroup status: YAML status: phase: Running running: 3 Test 2: Large Gang (All-or-Nothing) Now let's prove gang behavior by creating a gang that's too large: YAML apiVersion: scheduling.x-k8s.io/v1alpha1 kind: PodGroup metadata: name: large-training-gang spec: minMember: 5 YAML # Create 5 pods requesting 600m CPU each # Total: 3000m (exceeds our 2 vCPU VM) for i in {1..5}; do kubectl apply -f large-training-$i.yaml done All 5 pods staying pending, proving all-or-nothing behavior Result: Plain Text NAME READY STATUS AGE large-training-1 0/1 Pending 15s large-training-2 0/1 Pending 15s large-training-3 0/1 Pending 15s large-training-4 0/1 Pending 15s large-training-5 0/1 Pending 15s Event: Plain Text Warning FailedScheduling 60s default-scheduler 0/1 nodes are available: 1 Insufficient cpu Perfect gang behavior: All pending, no partial scheduling, no wasted resources! Comparison: With vs Without Gang Scheduling Scenariowithout gangwith gangSmall gang (3 pods, enough resources)Schedule individuallyAll schedule togetherLarge gang (5 pods, insufficient resources)❌ Partial: 2-3 Running, rest PendingAll remain PendingResource efficiencyWasted (partial gang can't work)Efficient (resources available for other jobs)Deadlock preventionNo protectionProtected Production Considerations Alpha feature warning: Not recommended for production yetScheduler-plugins is the mature alternativeNative API will improve in K8s 1.36+ Production alternatives: Volcano SchedulerKAI Scheduler (NVIDIA)Kubeflow with scheduler-plugins Key Takeaways Critical for AI/ML: Distributed training needs gang schedulingPrevents deadlocks: All-or-nothing prevents resource wasteWorks today: scheduler-plugins is production-readyAlpha status: Native API needs maturation Best use cases: PyTorch/TensorFlow distributed trainingApache Spark jobsMPI applicationsAny multi-pod workload Feature 3: Structured Authentication Configuration (GA) Theory: The Authentication Configuration Challenge Traditional Kubernetes authentication uses command-line flags on the API server: Shell kube-apiserver \ --oidc-issuer-url=https://accounts.google.com \ --oidc-client-id=my-client-id \ --oidc-username-claim=email \ --oidc-groups-claim=groups \ --oidc-username-prefix=google: \ --oidc-groups-prefix=google: Problems: Command lines become extremely longDifficult to validate before restartNo schema validationHard to manage multiple auth providersRequires API server restart for changes What's New in K8s 1.35 Structured authentication configuration moves auth config to YAML files: YAML apiVersion: apiserver.config.k8s.io/v1beta1 kind: AuthenticationConfiguration jwt: - issuer: url: https://accounts.google.com audiences: - my-kubernetes-cluster claimMappings: username: claim: email prefix: "google:" groups: claim: groups prefix: "google:" Benefits: Clear, structured formatSchema validationVersion controlledEasy to manage multiple providersBetter error messages Hands-On Testing Repository: lab3-structured-auth ⚠️ Warning: This lab modifies the API server configuration. While safe in minikube, this is risky in production without proper testing. The challenge: Modifying API server configuration requires editing static pod manifests — get it wrong and your cluster breaks. My approach: Create backup firstTest in disposable minikubeVerify thoroughly before production Test: GitHub Actions JWT Authentication I configured the API server to accept JWT tokens from GitHub Actions: YAML apiVersion: apiserver.config.k8s.io/v1beta1 kind: AuthenticationConfiguration jwt: - issuer: url: https://token.actions.githubusercontent.com audiences: - kubernetes-test claimMappings: username: claim: sub prefix: "github:" Implementation steps: Plain Text # 1. Create auth config cat > /tmp/auth-config.yaml <<EOF [config above] EOF # 2. Copy to minikube minikube cp /tmp/auth-config.yaml /tmp/auth-config.yaml # 3. Backup API server manifest minikube ssh sudo cp /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/backup.yaml # 4. Add authentication-config flag sudo vi /etc/kubernetes/manifests/kube-apiserver.yaml # Add: --authentication-config=/tmp/auth-config.yaml API server manifest showing authentication-config flag added API Server Restart: The API server automatically restarts when the manifest changes: Shell kubectl get pods -n kube-system -w | grep kube-apiserver Verification: Shell # Check authentication-config flag is active minikube ssh "sudo ps aux | grep authentication-config" Process showing --authentication-config=/tmp/auth-config.yaml flag API verification: Shell # Check authentication API is available kubectl api-versions | grep authentication Result: Shell authentication.k8s.io/v1 Success! Structured authentication is working. Before/After Comparison Before: YAML spec: containers: - command: - kube-apiserver - --advertise-address=192.168.49.2 - --authorization-mode=Node,RBAC After: YAML spec: containers: - command: - kube-apiserver - --authentication-config=/tmp/auth-config.yaml # NEW! - --advertise-address=192.168.49.2 - --authorization-mode=Node,RBAC Multiple Providers Example The structured format makes multiple auth providers easy: YAML apiVersion: apiserver.config.k8s.io/v1beta1 kind: AuthenticationConfiguration jwt: - issuer: url: https://token.actions.githubusercontent.com audiences: [kubernetes-test] claimMappings: username: {claim: sub, prefix: "github:"} - issuer: url: https://accounts.google.com audiences: [my-cluster] claimMappings: username: {claim: email, prefix: "google:"} - issuer: url: https://login.microsoftonline.com/{tenant-id}/v2.0 audiences: [{client-id}] claimMappings: username: {claim: preferred_username, prefix: "azuread:"} Key Takeaways Production-ready: GA status, safe for critical clustersBetter management: Clear structure beats command-line flagsMulti-provider: Easy to configure multiple identity providersRequires restart: API server must restart to load config Best use cases: Organizations with multiple identity providersComplex authentication requirementsDynamic team structuresCompliance requirements Feature 4: Node Declared Features (Alpha) Theory: The Mixed-Version Cluster Problem During Kubernetes cluster upgrades, you typically have a rolling update: Plain Text Cluster During Upgrade: ├─ node-1 (K8s 1.34) → Old features ├─ node-2 (K8s 1.34) → Old features ├─ node-3 (K8s 1.35) → New features ✅ └─ node-4 (K8s 1.35) → New features ✅ The challenge: Scheduler doesn't know which nodes support which featuresPods using K8s 1.35 features might land on 1.34 nodes → FailManual node labeling requiredHigh operational overhead What Is Node Declared Features? Nodes automatically advertise their supported Kubernetes features: Plain Text status: declaredFeatures: - GuaranteedQoSPodCPUResize - SidecarContainers - PodReadyToStartContainersCondition Benefits: Automatic capability discoverySafe rolling upgradesIntelligent schedulingZero manual configuration Hands-On Testing Repository: lab4-node-features This Alpha feature requires enabling a feature gate in kubelet configuration. Initial state: Shell kubectl get --raw /metrics | grep NodeDeclaredFeatures Result: Shell kubernetes_feature_enabled{name="NodeDeclaredFeatures",stage="ALPHA"} 0 Feature disabled by default. Enabling the Feature Shell minikube ssh # Backup kubelet config sudo cp /var/lib/kubelet/config.yaml /tmp/backup.yaml # Edit kubelet config Add feature gate: YAML apiVersion: kubelet.config.k8s.io/v1beta1 featureGates: NodeDeclaredFeatures: true # ADD THIS authentication: anonymous: enabled: false Kubelet config after (featureGates added)] Restart kubelet: Shell sudo systemctl restart kubelet sudo systemctl status kubelet Verification Shell # Check node now declares features kubectl get node minikube -o jsonpath='{.status.declaredFeatures}' | jq Result: JSON [ "GuaranteedQoSPodCPUResize" ] Success! The node is advertising its capabilities! The Connection to Lab 1 Notice something interesting? The declared feature is GuaranteedQoSPodCPUResize - the exact capability we tested in Lab 1! What this means: Node running K8s 1.35 knows it supports in-place pod resizingAdvertises this capability automaticallyScheduler can route pods requiring this feature hereOlder nodes (K8s 1.34) wouldn't declare this feature Testing Feature-Aware Scheduling YAML # Create a pod kubectl apply -f feature-aware-pod.yaml YAML # Check scheduling kubectl get pod feature-aware-pod Result: Plain Text NAME READY STATUS RESTARTS AGE feature-aware-pod 1/1 Running 0 7s Complete test flow showing feature declared, pod created, and successfully scheduled] Pod successfully scheduled on feature-capable node! Future: Smart Scheduling In future Kubernetes versions (when this reaches Beta/GA), you'll be able to: YAML apiVersion: v1 kind: Pod metadata: name: resize-requiring-app spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: node.kubernetes.io/declared-feature-InPlacePodVerticalScaling operator: Exists # Only schedule on nodes with this feature containers: - name: app image: myapp:latest Key Takeaways Automatic discovery: Nodes advertise capabilities without manual configSafe upgrades: Mixed-version clusters handled intelligentlyFeature connection: Links to Lab 1 in-place resize capabilityAlpha status: Requires feature gate, not production-ready Best use cases: Rolling cluster upgradesMixed-version environmentsFeature-dependent workloadsTesting new capabilities Lessons Learned: What Worked and What Didn't Challenges Encountered Alpha features are tricky Native Workload API caused kubelet failuresSolution: Used mature scheduler-plugins insteadLesson: Alpha doesn't mean "almost ready" QoS constraints not well-documented Spent time debugging resize failuresDiscovered QoS class immutability requirementLesson: Test thoroughly, document findings API server modifications are risky Required careful backup strategyMinikube made recovery easyLesson: Always test in disposable environments first What Worked Well GA features are solid In-place resize: FlawlessStructured auth: No issuesBoth ready for production Scheduler-plugins maturity More reliable than native Alpha APIsProduction-tested by many organizationsLesson: Mature external projects > Alpha native features Azure VM testing environment Realistic conditionsEasy to resetCost-effective (~$2 total)Lesson: Cloud VMs ideal for feature testing Production Readiness Assessment Ready for Production 1. In-place pod vertical scaling (GA) Stable, tested, documentedReal cost savings (30-40%)Clear constraints (QoS preservation)Recommendation: Deploy to production now 2. Structured authentication configuration (GA) Mature, well-designedBetter than command-line flagsRequires API server restartRecommendation: Use for new clusters, migrate existing ones carefully Use With Caution ⚠️ 3. Gang scheduling (Alpha) Native API unstableUse scheduler-plugins instead (production-ready)Essential for AI/ML workloadsRecommendation: Use scheduler-plugins, not native API 4. Node Declared Features (Alpha) Requires feature gateLimited current valueWill be critical when GARecommendation: Wait for Beta/GA unless testing upgrades Cost and Time Investment Testing Environment Costs Azure VM: Standard_D2s_v3Duration: 8 hours of testingCompute cost: ~$0.77 (VM stopped between sessions)Storage cost: ~$0.10Total: Less than $1 for comprehensive testing Time Investment activitytimeEnvironment setup30 minLab 1 (In-place resize)1.5 hoursLab 2 (Gang scheduling)2 hoursLab 3 (Structured auth)1 hourLab 4 (Node features)1.5 hoursDocumentation1.5 hoursTotal8 hours ROI: Knowledge gained far exceeds time invested. Testing prevented production issues. Recommendations for Your Kubernetes Journey If You're Running K8s 1.34 or Earlier Upgrade path: 1.34 → 1.35 is straightforwardFocus on GA features first: In-place resize, structured authTest in dev/staging: Use my repository as starting pointMeasure impact: Track cost savings from in-place resize If You're Running AI/ML Workloads Implement gang scheduling immediately: Use scheduler-pluginsTest distributed training: Prevent resource deadlocksMonitor scheduling: Ensure all-or-nothing behavior workingPlan for native API: Will mature in K8s 1.36+ If You're Managing Large Clusters Structured auth: Migrate now for better managementRolling upgrades: Plan for node feature declaration (future)Cost optimization: In-place resize reduces over-provisioningMulti-tenancy: Gang scheduling prevents noisy neighbor issues Complete Repository All code, scripts, and detailed instructions are available: GitHub: https://github.com/opscart/k8s-135-labs Each lab includes: Detailed theory and backgroundStep-by-step instructionsAutomated scripts where possibleTroubleshooting guidesProduction recommendationsRollback procedures Conclusion Kubernetes 1.35 brings meaningful improvements to production workloads: For cost optimization: In-place pod resize delivers real savings (30-40% in my tests)Eliminates over-provisioning for bursty workloadsNo application changes required For AI/ML workloads: Gang scheduling prevents resource deadlocksEssential for distributed trainingScheduler-plugins provides production-ready solution For operations: Structured authentication simplifies managementNode declared features will improve rolling upgradesBetter observability and debugging The bottom line: K8s 1.35 GA features are production-ready and deliver immediate value. Alpha features show promising future directions but need more maturation. Connect: Blog: https://opscart.comGitHub: https://github.com/opscartLinkedIn: linkedin.com/in/shamsherkhan Other projects: Kubectl-health-snapshot – Kubernetes Optimization Security Validatork8s-ai-diagnostics – Kubernetes AI Diagnostics References Kubernetes 1.35 Release NotesKEP-1287: In-Place Pod Vertical ScalingScheduler-Plugins DocumentationKEP-3331: Structured Authentication ConfigurationKEP-4568: Node Declared Features

By Shamsher Khan

CORE

Why “End-to-End” AI Will Always Need Deterministic Guardrails

The "Long Tail" Is Longer Than You Think Imagine you are driving at night. Your headlights catch a figure ahead. It appears to be a large dog standing on a single wheel, moving at 10 mph. A human driver immediately processes this as: Ah, it's Halloween! It’s probably a kid in a Halloween dog costume riding their unicycle and going back home after their candy run. The driver then categorizes the “figure” as a human, gives them space, and navigates around them carefully. A pure end-to-end (E2E) neural network deployed on an autonomous vehicle (AV), however, lacks this semantic luxury. It ingests a stream of raw data, such as image frames, LiDAR points, or similar sensor data, to encounter something that it has likely never encountered in its training set. Is it a dog? But dogs don't have wheels. Is it a vehicle with a pet inside? Pets don’t typically ride in single-wheeled vehicles. Is it a kid on a unicycle? Kids are not typically furry or have tails. These situations are often termed as “long tail” events. Without the semantic context to resolve this anomaly, an E2E network’s behavior may become undefined. It might choose to stop or swerve. Or, in the worst case, it might confidently decide this is an erroneous, false-positive detection and choose to ignore it entirely. The AI industry’s obsession with E2E systems is understandable. The idea of feeding raw signals as input into a massive model and outputting a final set of trustworthy decisions can unlock multiple benefits. For one, the architecture sounds very elegant. Secondly, since the architecture is less complex, it scales easily. That works beautifully 99% of the time. Engineering for safety-critical systems is not about the 99%. It's about understanding the nuance in edge cases. It is about the "Truck full of Stop Signs." If a truck carrying a stack of stop signs merges onto the highway, a pure E2E model might see the pattern "STOP" and slam on the brakes at highway speeds. A human driver, however, understands the context: Stop signs don't move at highway speeds; therefore, these are cargo, not traffic control. A deterministic logical layer implements this missing logic derived from context when the human is not in the loop. Efforts can be made to continuously optimize such E2E models to safely handle such edge cases. However, by definition, these long tail events are never-ending. They are commonly categorized as unknown-unknowns. Therefore, we need some form of a fundamental deterministic wrapper to impose the laws of physics and logic on the probabilistic output of an AI model. In other words, it is important to establish a "Guardrail" that ensures the system exhibits safe behavior even if the E2E model suggests otherwise. Why E2E Is a Promise Worth Pursuing To be fair, traditional modular stacks can have their limitations. The shift to E2E is driven by tangible architectural advantages: Architecture: Traditional stacks require complex sub-modules; E2E promises a unified architecture.Communication: Traditional stacks need robust and manually defined protocols between modules; E2E systems learn internal representations during training.Context: Traditional stacks suffer from "information loss" as data moves down the pipeline; E2E systems retain rich metadata as secondary inputs. Consider the traditional AV pipeline, which slices responsibility into three major silos: Perception: The "eyes." It ingests sensor data to classify objects such as pedestrians, traffic signs, etcPlanner: The "brain." It processes perception output to calculate a safe trajectory.Control: The "hands and feet." It converts the trajectory into steering, throttle, and brake commands. In this pipeline, Perception might tell the Planner: "There is a pedestrian 50m ahead, defined in the given bounding box, and moving towards us at 3 mph." Unfortunately, in this distillation, valuable nuance is lost. Is the pedestrian looking at their phone? Are they stumbling? Are they making eye contact? An E2E system can retain rich context to enable desirable behavior. Rather than seeing a bounding box or object list, it processes the subtle nuance of the scene. It might instinctively have the AV slow down for a pedestrian whose posture suggests distraction. This ability to retain nuance is yet another reason an E2E system promises to increase desirable performance. However, performance is not safety. While the E2E model is excellent at intuition, it lacks guarantees. Therefore, we need a hybrid approach. The Architectural Solution: The 'Simplex' Pattern We need to treat the E2E model to enable the end-to-end decision-making process, but stop short of letting it be the sole decision maker. It suggests what to do based on complex pattern matching. However, a separate, deterministic layer must have the final authority to reject unsafe actions. This creates a hybrid architecture, formally referred to in safety engineering as a simplex architecture. In this architecture, the E2E model would suggest that it sees a clear path and is requesting acceleration to 85 mph. After that, the Validator, which is our Deterministic Guardrail, would evaluate this against safety constraints such as kinematic limits, semantic consistency, or hard geofences. Finally, the Actuator will execute only the commands approved by the deterministic validator. Guardrail Implementation: A Simple Example Let's look at what this concept could look like in practice. The following Python script simulates a Trajectory Validator. The core component is the TrajectoryValidator class. It accepts ground-truth data, specifically the legal speed limit from an HD Map and the physical boundaries of the lane. We then run this validator against three scenarios, with a specific focus on Scenario B: The Vandalized Sign. In this specific edge case, someone has tampered with a '35 mph' speed limit sign to look like an '85 mph' speed limit sign by spray painting on the ‘3.’ The E2E vision model, trusting its optical input, requests an acceleration to 85 mph. However, the validator performs a logic check and determines there are no 85 mph roads in the area. The map further lists the segment as a 35 mph zone. Detecting the discrepancy, the validator overrides the E2E AI’s probabilistic output and limits the AV to drive at a logical and legal speed. Python class TrajectoryValidator: def __init__(self, map_speed_limit, road_boundaries): self.map_speed_limit = map_speed_limit self.road_boundaries = road_boundaries # [min_lat, max_lat] def validate_command(self, E2E_output): """ Input: Dictionary containing the E2E_output of 'speed' and 'position'. Output: Dictionary with the 'final_cmd' and a 'status' note. """ final_cmd = E2E_output.copy() status = "VALID OUTPUT" # Check 1: Speed Limit Enforcement # We allow a small buffer (e.g., +5 mph) but limit anything beyond that. # This handles cases where the AI misreads a sign (e.g., 35 as 85). hard_limit = self.map_speed_limit + 5 if E2E_output['speed'] > hard_limit: final_cmd['speed'] = hard_limit status = "INVALID OUTPUT: EXCEEDS LIMIT. CLAMPING SPEED." # Check 2: Road Boundary Enforcement # If the AI proposes driving off-road, we trigger a safety stop. pos_lat = E2E_output['position_lat'] if not (self.road_boundaries[0] <= pos_lat <= self.road_boundaries[1]): final_cmd['speed'] = 0 # Pull over to the shoulder final_cmd['position_lat'] = self.road_boundaries[0] status = "INVALID OUTPUT: OFF-ROAD DETECTED. INITIATING SAFE STOP." return final_cmd, status # --- Scenario Runner --- def run_scenarios(): # Define our "World": 35 mph zone, Road exists between lat=-5 and lat=5 validator = TrajectoryValidator(map_speed_limit=35, road_boundaries=[-5, 5]) # Case A: Normal Operation # E2E AI proposes safe speed (30) in middle of lane (lat=1) output_a = {'speed': 30, 'position_lat': 1} cmd_a, status_a = validator.validate_command(output_a) print(f"Scenario A: E2E Output {output_a} -> Validated: {cmd_a} [{status_a}]\n") # Case B: The "Graffiti" Error # E2E AI reads a vandalized sign and proposes 85 mph output_b = {'speed': 85, 'position_lat': 2} cmd_b, status_b = validator.validate_command(output_b) print(f"Scenario B: E2E Output {output_b} -> Validated: {cmd_b} [{status_b}]\n") # Case C: Hallucination # E2E AI thinks the sidewalk (lat=12) is drivable space output_c = {'speed': 20, 'position_lat': 12} cmd_c, status_c = validator.validate_command(output_c) print(f"Scenario C: E2E Output {output_c} -> Validated: {cmd_c} [{status_c}]\n") if __name__ == "__main__": run_scenarios() The Universal Need for Guardrails Around End-to-End AI This architectural pattern isn't specific to autonomous vehicles. It applies anywhere E2E AI black boxes make high-stakes decisions. Consider an E2E medical AI model in Radiology. Such a model might analyze raw MRI scans to detect tumors. It is incredibly sensitive and can spot patterns invisible to the human eye. However, it is also prone to "shortcut learning," and perhaps it learned to associate a specific lighting artifact with benign tissue. The failure: The AI classifies a reasonably large 5cm tumor as "Benign" with 99% confidence because of a lighting artifact-based training.A suitable guardrail: A deterministic rule that enforces that if a tumor with a mass > 3cm is detected, it MUST be flagged for biopsy irrespective of the model’s confidence in it being benign. The guardrail should not care about the AI's confidence in a limited, yet high-stakes, situation. It should care about hard metrics. By layering simple rules over a complex E2E AI model, we prevent catastrophic false positives or negatives in the case of the medical AI, autonomous driving AI, or other similar safety-critical systems. Conclusion: Why Determinism Is Non-Negotiable The beauty of the code above is its traceability. If the car refuses to drive 85 mph, we don't have to guess. The logs explicitly state 'INVALID OUTPUT: EXCEEDS LIMIT.' We have instant, deterministic root cause analysis. On the other hand, if a pure E2E model decides to drive 85 mph, debugging it involves dissecting millions of weights to understand why it prioritized the "8" loop of the graffiti over the context of the residential street. We cannot train out every edge case. It will be a long time before we have enough training data for the majority of the "unknown-unkowns." However, by wrapping our probabilistic E2E AI in deterministic guardrails, we allow the AI to be brilliant while ensuring the overall system remains safe.

By Nishant Bhanot

Implementing Decentralized Data Architecture on Google BigQuery: From Data Mesh to AI Excellence

In the era of generative AI and large language models (LLMs), the quality and accessibility of data have become the primary differentiators for enterprise success. However, many organizations remain trapped in the architectural paradigms of the past — centralized data lakes and warehouses that create massive bottlenecks, high latency, and "data swamps." Enter the Data Mesh. Originally proposed by Zhamak Dehghani, Data Mesh is a sociotechnical approach to sharing, accessing, and managing analytical data in complex environments. When paired with the scaling capabilities of Google BigQuery, it creates a foundation for "AI Excellence," where data is treated as a first-class product, ready for consumption by machine learning models and business units alike. In this technical deep-dive, we will explore how to architect a Data Mesh on Google Cloud, leveraging BigQuery's unique features to drive decentralized data ownership and AI-ready infrastructure. 1. The Architectural Shift: Why Data Mesh? Traditional data architectures are typically centralized. A single data engineering team manages the ingestion, transformation, and distribution of data for the entire company. As the number of data sources and consumers grows, this team becomes a bottleneck. The Four Pillars of Data Mesh Domain-Oriented Decentralized Data Ownership: The people who know the data best (e.g., the Marketing team) should own and manage it.Data as a Product: Data is not a byproduct; it is a product delivered to internal consumers with SLAs, documentation, and quality guarantees.Self-Serve Data Platform: A centralized infrastructure team provides the tools (like BigQuery) so domains can manage their data autonomously.Federated Computational Governance: Global standards for security and interoperability are enforced through automation. Comparative Overview: Monolith vs. Mesh FeatureCentralized Data Lake/WarehouseDecentralized Data MeshOwnershipCentral Data TeamBusiness Domains (Sales, HR, etc.)Data QualityReactive (Fixed by Data Engineers)Proactive (Managed by Domain Owners)ScalabilityLinear (Bottlenecks occur)Exponential (Parallel execution)Access ControlUniform (Often too loose or tight)Granular (Domain-specific policies)AI ReadinessLow (Silod context)High (Context-rich data products) 2. Technical Mapping: Building the Mesh on BigQuery Google BigQuery is uniquely suited for Data Mesh because it separates storage and compute, allowing different projects to interact with the same data without physical duplication. Core Components BigQuery Datasets: Act as the boundaries for data products.Google Cloud Projects: Serve as the containers for domain environments.Analytics Hub: Facilitates secure, cross-organizational data sharing.Dataplex: Provides the fabric for federated governance and data discovery. System Architecture Diagram This diagram illustrates the relationship between domain-specific producers, the central catalog, and the AI consumers. 3. Implementing Domain Ownership and Data Products In a Data Mesh, each domain manages its own BigQuery projects. They are responsible for the full lifecycle of their data products: ingestion, cleaning, and exposure. Defining the Data Product A data product on BigQuery is not just a table. It includes: The Raw Data (Internal Dataset)The Cleaned/Aggregated Data (Public Dataset)Metadata (Labels and Descriptions)Access Controls (IAM roles) Code Example: Creating a Domain-Specific Data Product Using SQL and gcloud, we can define a data product with specific access controls. In this example, we create a "Customer LTV" product for the Sales domain. Plain Text -- Step 1: Create the dataset in the domain project -- This acts as the container for our data product CREATE SCHEMA `sales-domain-prod.customer_analytics` OPTIONS( location="us", description="High-quality customer lifetime value data for AI consumption", labels=[("env", "prod"), ("domain", "sales"), ("data_product", "cltv")] ); -- Step 2: Create a secure view to expose only necessary columns -- This follows the principle of least privilege CREATE OR REPLACE VIEW `sales-domain-prod.customer_analytics.cltv_gold` AS SELECT customer_id, total_spend, last_purchase_date, predicted_churn_score FROM `sales-domain-prod.customer_analytics.raw_customer_data` WHERE is_verified = TRUE; Automating Governance with IAM To ensure the domain maintains ownership while allowing the central team to monitor, we use granular IAM roles. Plain Text # Assign the Data Owner role to the Sales Domain Team gcloud projects add-iam-policy-binding sales-domain-prod \ --member="group:[email protected]" \ --role="roles/bigquery.dataOwner" # Assign the Data Viewer role to the AI/ML Consumer Service Account gcloud projects add-iam-policy-binding sales-domain-prod \ --member="serviceAccount:[email protected]" \ --role="roles/bigquery.dataViewer" 4. Federated Governance with Google Dataplex Governance in a Data Mesh cannot be manual. We use Google Dataplex to automate metadata harvesting, data quality checks, and lineage tracking across all domain projects. The Data Flow for Governance Data Quality Checks (The "Quality Score" Metric) To ensure AI models aren't trained on garbage, domains must define quality rules. Dataplex allows us to run YAML-based data quality checks. Plain Text # Dataplex Data Quality Rule Example rules: - column: customer_id dimension: completeness threshold: 0.99 expectation_type: expect_column_values_to_not_be_null - column: total_spend dimension: validity expectation_type: expect_column_values_to_be_between params: min_value: 0 max_value: 1000000 5. From Mesh to AI: Fueling Vertex AI Once the Data Mesh is established, AI teams no longer spend 80% of their time finding and cleaning data. They can "shop" for data in the Analytics Hub and connect it directly to Vertex AI. Seamless Integration with Vertex AI Feature Store BigQuery acts as the offline store for Vertex AI. Because the data is already organized into domain-driven products, creating a feature set is a simple metadata mapping. Code Example: Training a Model on Mesh Data Using BigQuery ML (BQML), we can train a model directly on our decentralized data product without moving it to a central location. SQL -- Training a Churn Prediction Model using the Sales Domain Data Product CREATE OR REPLACE MODEL `ai-consumer-project.models.churn_predictor` OPTIONS(model_type='logistic_reg', input_label_cols=['churned']) AS SELECT * EXCEPT(customer_id) FROM `sales-domain-prod.customer_analytics.cltv_gold` AS data_product JOIN `marketing-domain-prod.engagement.user_activity` AS activity_product ON data_product.customer_id = activity_product.user_id; This SQL highlights the power of Data Mesh: the AI consumer joins two different data products from two different domains (Sales and Marketing) seamlessly because they adhere to global naming and identity standards. 6. Implementation Strategy: A Phased Approach Moving to a Data Mesh is as much about culture as it is about technology. Follow this roadmap: Phase 1: Identification (Months 1-2): Identify 2-3 pilot domains (e.g., Sales, Logistics). Define their data product boundaries.Phase 2: Platform Setup (Months 3-4): Set up the BigQuery environment with Dataplex and Analytics Hub. Establish a "Self-Serve" template using Terraform.Phase 3: Governance Automation (Months 5-6): Implement automated data quality and cataloging. Define global tagging standards.Phase 4: AI Scaling (Month 6+): Enable ML teams to consume data products via Vertex AI and BigQuery ML. 7. Challenges and Mitigations ChallengeDescriptionMitigationInteroperabilityDomains using different IDs for the same customer.Enforce a "Master Data Management" (MDM) set of global dimensions.Cost ManagementDecentralized teams might overspend on BigQuery slots.Use BigQuery Reservations and Quotas per project/domain.Skills GapDomain teams might lack data engineering skills.Provide a robust "Self-Serve" platform with easy-to-use templates. Conclusion: The Mesh as an AI Accelerator The ultimate goal of the Data Mesh on BigQuery is to democratize intelligence. By decentralizing data ownership, we ensure that those closest to the business logic are responsible for the data's integrity. By centralizing governance and tools, we ensure that this data remains discoverable, secure, and ready for the next generation of AI. Building a Data Mesh is not an overnight process, but for organizations looking to scale AI beyond simple prototypes, it is the only viable path forward. Start small, treat your data as a product, and let BigQuery's infrastructure handle the scale while your domains handle the value. For more technical guides on Google, AI architecture and implementation, follow: Twitter/XLinkedInGitHub

By Jubin Abhishek Soni

Clean Code in the Age of Copilot: Why Semantics Matter More Than Ever

Abstract Generative AI tools treat your codebase as a prompt; if your context is ambiguous, the output will be hallucinated or buggy. This article demonstrates how enforcing clean code principles — specifically naming, Single Responsibility, and granular unit testing — drastically improves the accuracy and reliability of AI coding assistants. Introduction There is a prevailing misconception that AI coding assistants (like GitHub Copilot, Cursor, or JetBrains AI) render clean code principles obsolete. The argument suggests that if an AI writes the implementation and explains it, human readability matters less. This view is dangerous. From an architectural standpoint, AI does not fix bad code; it amplifies it. LLMs work on probability and context. If your codebase is riddled with "God Classes," ambiguous variable names (var data), and leaked abstractions, you are effectively feeding "noise" into the model's context window. The result is context contamination: the AI mimics your bad patterns, generating legacy code at lightning speed. To leverage AI effectively, we must raise the bar on code quality. We are no longer just writing for human maintainers; we are optimizing the context for our AI pair programmers. Prerequisites To get the most out of this architectural deep dive, you should be familiar with: Java or C# syntax (Examples use Java 17+).SOLID principles (Specifically Single Responsibility).AI assistants (Experience with Copilot, ChatGPT, or similar tools).Basic refactoring patterns (Extract Method, Rename Variable). Core Concept: The Codebase Is the Prompt Think of your current file and its imports as the "system prompt" for the AI. When an LLM suggests code, it looks at the surrounding tokens to determine intent. Low semantic density: Code using names like Manager, Util, or process() forces the AI to guess intent based on structural patterns rather than business logic.High semantic density: Code using names like InvoiceReconciliationStrategy or calculateOverdueFees() confines the AI's search space, leading to highly accurate logic generation. The shift: Clean code is no longer just about maintainability; it is about prompt engineering via architecture. Implementation: The "Context" Test Let’s look at a practical example of how bad abstractions confuse AI, and how refactoring fixes the generation. Scenario 1: The "God Object" (Low Context) We have a legacy class that handles everything regarding a user. This is a common anti-pattern. Java public class UserManager { // Ambiguous naming, mixed responsibilities public void handle(String id, boolean type, double val) { if (type) { // DB connection logic leaked here String q = "UPDATE users SET s = " + val + " WHERE id = " + id; Database.exec(q); } else { // Business logic mixed with persistence if (val > 100) { System.out.println("User " + id + " is high value"); Email.send(id, "Promo"); } } } } The AI failure mode: If you ask Copilot to "Add a check for suspended users" in this context, it will likely: Insert raw SQL queries directly into the method (mimicking the bad pattern).Use magic booleans or unclear variable names.Violate the Open/Closed principle. The AI sees the mess and assumes the mess is the correct architectural style. Scenario 2: Refactoring for Semantic Density Let's refactor this to be "AI-readable." We will apply single responsibility (SRP) and explicit naming. Step 1: Isolate the Data Structure First, we create a record to define exactly what a "User" is. Java // Clear definition of data public record UserScore(String userId, double loyaltyPoints, boolean isPremium) {} Step 2: Define Clear Interfaces We create interfaces that describe actions, not generic managers. Java public interface UserRepository { void updateLoyaltyPoints(String userId, double points); UserScore getUser(String userId); } public interface PromotionService { void sendHighValuePromo(String userId); } Step 3: The Business Logic (The Clean Context) Now, we write the logic class. Notice how the code reads like natural language. Java public class LoyaltyTierHandler { private final UserRepository userRepo; private final PromotionService promoService; private static final double HIGH_VALUE_THRESHOLD = 100.0; public LoyaltyTierHandler(UserRepository userRepo, PromotionService promoService) { this.userRepo = userRepo; this.promoService = promoService; } /** * AI Instruction: This method calculates eligibility based purely on points. */ public void processUserStatus(String userId, double currentPoints) { if (currentPoints > HIGH_VALUE_THRESHOLD) { promoService.sendHighValuePromo(userId); } userRepo.updateLoyaltyPoints(userId, currentPoints); } } The AI success mode: If you now ask Copilot to "Add a check for suspended users," the context provides clear guardrails. Boundary detection: The AI sees UserRepository. It will likely suggest adding isSuspended() to the interface rather than writing raw SQL in the handler.Logic placements: It sees HIGH_VALUE_THRESHOLD. It will likely create a SUSPENDED_STATUS constant rather than using magic strings. By fixing the naming and structure, you forced the AI to generate code that adheres to your architecture. Prompt Engineering via Architecture: The Unit Test Feedback Loop If production code is the "context," your unit tests are the "constraints." One of the most powerful workflows for AI-assisted development is test-driven prompting. Instead of asking the AI to "write a function that does X," you write a granular, descriptive unit test that fails, and then ask the AI to "make this test pass." The "Vague Test" Anti-Pattern Consider a test suite with poor naming conventions and loose assertions. Java @Test void testProcess() { // Vague setup Handler h = new Handler(); var result = h.run("123", true); // Weak assertion assertNotNull(result); } The AI result: If you highlight this test and ask Copilot to generate the run method, it has zero semantic guidance. It might return a hardcoded string, a random object, or a null-check wrapper. The test passes, but the code is useless. The "Spec-Based" Test Pattern Now, let’s apply clean code naming conventions to the test. This effectively turns your test method name into a prompt. Java @Test void givenSuspendedUser_WhenProcessingTransaction_ThenThrowSecurityException() { // 1. Arrange: Clear context var user = new User("123", UserStatus.SUSPENDED); var handler = new TransactionHandler(); // 2. Act & Assert: Strict constraints assertThrows(SecurityException.class, () -> { handler.process(user, 50.00); }); } The AI result: When you ask the AI to implement process(), it analyzes the test specifically: Input: It sees UserStatus.SUSPENDED.Action: It sees process().Outcome: It sees SecurityException. The AI generates the implementation with near 100% accuracy because it is mathematically constrained by the test structure. Key Takeaways Small context windows. Large "God Classes" fill up the LLM's context window with irrelevant noise. Smaller, focused classes ensure the AI focuses only on the relevant logic.Tests are constraints. Use unit tests with "Given-When-Then" naming conventions to force the AI to solve a specific logic puzzle, rather than guessing your intent.Mimicry is the default. AI mimics the style of the file it is editing. If you allow "dirty hacks," the AI will generate dirty hacks. Clean code acts as a style guide for the model. Conclusion AI hasn't killed clean code; it has monetized it. The ROI on refactoring is now immediate—cleaner code means better AI suggestions, faster development cycles, and less time debugging machine-generated technical debt.

By Nikita Kothari

The A3 Handoff Canvas

TL; DR: The A3 Handoff Canvas The A3 Framework helps you decide whether AI should touch a task (Assist, Automate, Avoid). The A3 Handoff Canvas covers what teams often skip: how to run the handoff without losing quality or accountability. It is a six-part workflow contract for recurring AI use: task splitting, inputs, outputs, validation, failure response, and record-keeping. If you cannot write one part down, that is where errors and excuses will enter. The Handoff Canvas closes a gap in a useful pattern: from an unstructured prompt to applying the A3 Framework to document decisions with the A3 Handoff Canvas, to creating transferable skills, potentially leading to building agents. You Solved the Delegation Question. However, Things Are Now Starting to Go Wrong. The A3 Framework gives you a decision system: Assist, Automate, or Avoid. Practitioners who adopted it stopped prompting first and thinking second. Good; that was the point. But a pattern keeps repeating. A Scrum Master decides that drafting a Sprint Review recap for stakeholders who could not attend falls into the Assist category. So they prompt Claude, get a draft, edit it, and send it out. It works. By the third Sprint, a colleague asks: "How did you produce that? I want to do the same." And the Scrum Master cannot explain their own process in a way that someone else could repeat. The prompt is somewhere in a chat window. The context was in their head. The validation was: "Does this look right to me?" That is not a workflow, but a habit. Habits do not transfer to colleagues and do not survive personnel changes. The Shape of the Solution Look at the canvas as a whole before we walk through each element. Six parts, each forcing one decision you cannot skip: Task split: What does AI do? What does the human do? Where is the boundary?Inputs: What data does AI need? What format? What must be anonymized?Outputs: What does "good" look like? What are the format, length, and quality criteria?Validation: Who checks the output? Against what standard? Using what method?Failure response: What happens when the output is wrong? What are the stop rules?Records: What do you log? At what level of detail? Who owns the log? As a responsible Agile practitioner, your task is simple: complete one canvas per significant AI workflow, not per prompt, but per recurring workflow. When not to use the canvas: One-off prompts, low-stakes personal productivity tasks, and situations where the cost of record keeping exceeds the risk. The canvas is for recurring workflows where errors propagate or where other people depend on the output. Anti-pattern to watch for: Filling out canvases after the fact to justify what you already did. That is governance theater. If the A3 Handoff Canvas does not change how you work, you are performing compliance, not practicing it. The data confirms this is not an edge case. In our AI4Agile Practitioners Report (n = 289), 83% of respondents use AI tools, but 55.4% spend 10% or less of their work time with AI, and 85% have received no formal training on AI usage in Agile contexts [2]. Adoption is broad but shallow. Most practitioners are experimenting without structure, and the gap between "I use AI" and "I have a repeatable workflow" is where quality and accountability disappear. Dell'Acqua et al. (2025) found a similar pattern in a controlled setting: individuals working with AI matched two-person team performance, but with inexperienced prompting and unoptimized tools, the researchers reported a lower bound [1]. How to Use the A3 Handoff Canvas Let us walk through all six fields of the canvas: 1. Task Split: Who Does What? If you do not write the boundary down, the human side quietly becomes the "copy editor." Purpose: Make both sides explicit: what AI does, what the human does, who owns the result. What to decide: What specific task does AI perform? (Not "helps with" but a concrete, verifiable action.)What specific task does the human perform? (Not "reviews" but what you review, against what.)Who owns the final output if a stakeholder questions it? Example: "AI drafts a Sprint Review recap from the Jira export, structured by Sprint Goal alignment. In collaboration with the Product Owner, the Scrum Master selects which items to include, adds qualitative assessment, and decides what to share externally versus keep team-internal." Common failure (rubber-stamping Assist): Practitioners define the AI's task but leave the human's task vague. "I review it" is not a task definition. When the human side is vague, Assist degrades into copy-paste: you classified it as Assist, but you are treating it as Automate without the audit cadence, eroding your critical thinking over time. Every time you skip the judgment step, the muscle weakens. The practitioners in our AI4Agile 2026 survey who worry about AI are not worried about replacement; they are worried about losing the skills that make their judgment worth having [2]. 2. Inputs: What Goes In? Most practitioners skip this element because they think the input is obvious. It is not. Inputs drift over time, and occasionally change overnight when tooling updates. Purpose: Specify what data AI needs, in what format, and what must stay out. What to decide: Which data sources? (Jira export, meeting notes, Slack thread summaries, customer interview transcripts.)What format? (CSV, pasted text, uploaded document, structured prompt template, or RAG.)What must be anonymized or excluded before it enters any AI tool? Example: "Input is a CSV export from Jira filtered to the current Sprint, plus the Sprint Goal text from the Sprint Planning notes. Customer names are replaced with segment identifiers before upload." Common failure (set-and-forget Automate): Teams define inputs once and never revisit them. If your Input specification is six months old and your tooling changed twice, you have an Automate workflow running on stale assumptions. Automate only works if you set rules and audit results. Otherwise, you get invisible drift. 3. Outputs: What Does "Good" Look Like? The following five checks are the default quality bar inside the Outputs element of every canvas. Adapt them to your context to escape the most common lie in AI-assisted work: "I will know a good output when I see one." Accuracy: Factual claims trace to a source. Numbers match the input data.Completeness: Includes all mandatory items defined in your Output element.Audience fit: Written for the specified audience (non-technical stakeholders, team internal, leadership).Tone: Neutral, no blame, no spin attempt, no marketing language where analysis was requested.Risk handling: Uncertain items are flagged, not buried. Gaps are visible, not papered over. Purpose: Define the format, structure, and quality criteria before you prompt, not after you see the result. What to decide: What format and length? (300 words, structured by Sprint Goal, three sections.)What must be included? (Items completed, items not completed with reasons, risks surfaced.)What quality standard applies? (Use the five criteria above as a starting point.) Example: "250-to-350 words structured as: (A) Sprint Goal progress with items completed, (B) items not completed with reasons, (C) risks surfaced during the Sprint. Written for non-technical stakeholders." Common failure (standards drift): Practitioners define output expectations after they see the output. They adjust their standard to match what AI produced. You would never accept a Definition of Done that said "we will know it when we see it." Do not accept that from your AI workflows either. 4. Validation: Who Checks, and Against What? This is the element that exposes whether your Assist classification was honest or aspirational. Purpose: Specify how the human verifies the output, separating automated checks from judgment calls. What to decide: What can be checked mechanically? (Formatting compliance, length, required sections present.)What requires human judgment? (Accuracy of claims, appropriateness for the audience, context AI does not have.)What is the validation standard? (Spot-check a sample? Verify every claim? Cross-reference against source data?) Example: "Spot-check the greater of 3 items or 20% of items (capped at 8) against the Jira export. Verify the 'not completed' list is complete. Read the stakeholder framing out loud: would you say this in the meeting? If two or more spot checks fail, mark the output red and switch to the manual fallback." Not every output is pass/fail. Use confidence levels to handle the gray zone: Green: Validation checks pass. Safe to publish externally.Yellow: Internal use only. If a yellow-status output reaches a stakeholder, that is a Failure Response event. Yellow means "human rewrite required," not "AI-ish is close enough."Red: Stop rule triggers. Switch to manual fallback. Common failure (rationalizing Avoid as Assist): When validation is difficult, the instinct is to ban AI from the task entirely. That is too binary. Weak validation is a design constraint, not a reason to avoid. Constrain the task so validation becomes feasible: require AI to produce a theme map with transcript IDs, enforce coverage across segments, spot-check quotes against originals. Reserve Avoid for trust-heavy, relationship-heavy, high social-consequence work: performance feedback, conflict mediation, sensitive stakeholder conversations. 5. Failure Response: What Happens When It Is Wrong? Nobody plans for failure until 10 minutes before distribution, and then everyone wishes they had. Purpose: Define the fallback before you need it. What to decide: What is the stop rule? (When do you discard the AI output and go manual?)How far can an error propagate? (Does this output feed into another decision?)Who owns escalation if the error is systemic? Example: "If the recap misrepresents Sprint Goal progress, regenerate with corrected input. If inconsistencies are found less than 10 minutes before distribution, switch to the manual fallback: the Scrum Master delivers a verbal summary from the Sprint Backlog directly." Common failure (set-and-forget Automate): Teams build AI workflows without a manual fallback. When the workflow breaks, nobody remembers how to do the task without AI. Every deployment playbook includes a rollback plan. Your AI workflows need the same. 6. Records: What Do You Log? "We do not have time for that" is the number one objection. It is also how teams end up unable to explain their own workflows three months later. Purpose: Make the workflow traceable, learnable, and transferable. What to decide: What do you store? (Prompt, Skill, input data, AI output, human edits, final version, who approved.)Where do you store it? (Team wiki, shared folder, project management tool.)Who owns the log? Example: "Store the prompt text, the Jira export version, and the final stakeholder message in the team's Sprint Review folder. The Scrum Master owns the log." Common failure (no traceability at all): The objection sounds reasonable until a stakeholder asks, "Where did this number come from?" and nobody can reconstruct the answer. You do not need to log everything at the same level. Use a traceability ladder: Level 1 (Personal): Prompt or Skill plus final output plus a three-bullet checklist you used to validate. Usually, about 2 minutes once you have the habit.Level 2 (Team): Add input source, versioning, and who approved. Usually about 5 minutes.Level 3 (Regulated): Add redaction log, evidence links, and audit cadence. Required for compliance-sensitive workflows. Start at Level 1. Move up when the stakes justify it. The Sprint Review Canvas: End to End Applied to the Scrum Master's Sprint Review recap: Task split: AI drafts; Scrum Master finalizes; Product Owner sanity-checks stakeholder framing.Inputs: Jira export (current Sprint) + Sprint Goal + Sprint Backlog deltas and impediments.Outputs: 250-to-350 words; sections: Goal progress / Not completed / Risks.Validation: Spot-check max(3, 20% of items); check for missing risks; read framing aloud.Failure response: Inconsistencies found: regenerate. Less than 10 min to deadline: manual verbal summary.Records: Prompt/Skill + Jira export version + final message stored in the Sprint Review folder. Why the A3 Handoff Canvas Matters Beyond Your Own Practice In the AI4Agile Practitioner Report 2026, 54.3% of respondents named integration uncertainty as their biggest challenge in adopting AI [2]. It is not resistance, not tool quality, but the certainty about how AI fits into existing workflows. The A3 Handoff Canvas addresses that uncertainty at the team level: it turns "we are experimenting with AI" into "we have a defined workflow for AI-assisted Sprint Review recaps, and anyone on the team can run it." A filled-out A3 Handoff Canvas becomes an organizational asset. When a Scrum Master leaves, the canvases will document how AI integrates into workflows. When new team members join, they see the boundaries between human judgment and AI on day #1. When leadership asks "How is the team using AI?", canvases provide a credible answer. The levels interact, though. A team with an excellent A3 Handoff Canvas but no organizational data classification policy will hit a ceiling on Inputs. An organization with a comprehensive AI policy but no team-level canvases will have governance on paper and chaos in practice. Also, 14.2% of practitioners report receiving no organizational AI support at all [2]. If that describes your situation, the A3 Handoff Canvas is not optional. It is your minimum viable governance until your organization catches up. Conclusion: Try the A3 Handoff Canvas on One Workflow This Week Pick one AI workflow you repeat every Sprint. Write down the six elements. Fill them out honestly. Then ask your team: Does this match how we work, or have we been running on implicit assumptions? If you get stuck on Validation or Failure Response, you found the weak point before it found you. References [1] Dell'Acqua, F., Ayoubi, C., Lifshitz, H., Sadun, R., Mollick, E., Mollick, L., Han, Y., Goldman, J., Nair, H., Taub, S., and Lakhani, K.R. (2025). "The Cybernetic Teammate: A Field Experiment on Generative AI Reshaping Teamwork and Expertise." Harvard Business School Working Paper 25-043. [2] Wolpers, S., and Bergmann, A. (2026). "AI for Agile Practitioners Report." Berlin Product People GmbH.

By Stefan Wolpers

CORE