A framework is a collection of code that is leveraged in the development process by providing ready-made components. Through the use of frameworks, architectural patterns and structures are created, which help speed up the development process. This Zone contains helpful resources for developers to learn about and further explore popular frameworks such as the Spring framework, Drupal, Angular, Eclipse, and more.
Building a Sentiment Analysis Pipeline With Apache Camel and Deep Java Library (DJL)
Automating Unix Security Across Hybrid Clouds
Spring Framework's JDBC core package, designed to simplify database interactions using JDBC, is a popular option for applications to persist data to a relational database. The central classes used are JdbcClient with its fluent API and JdbcTemplate with the older classic API. When using these APIs, the CRUD operations tend to be verbose. The SimpleJdbcMapper mitigates this verbosity and also stays out of the way so you can keep using all the features of JdbcClient/JdbcTemplate. How to Do CRUD Operations With SimpleJdbcMapper Java @Table(name = "product") public class Product { @Id(type = IdType.AUTO_GENERATED) private Integer id; // maps to 'sku' column in database. @Column private String sku; // The 'name' property will map to 'product_name' column in database. @Column(name = "product_name") private String name; // maps to 'description' column in database @Column private String description; // will map to column 'available_date' by default using camel case to underscore case naming convention @Column private LocalDateTime availableDate; // will map to 'price' column by default @Column private Double price; // No annotations for this property so excluded from inserts/updates/queries etc private String someNonDatabaseProperty; // There are no mappings for relationships. SimpleJdbcMapper is not an ORM. ... } ... @Autowired private SimpleJdbcMapper sjm; ... Product product = new Product(); product.setSku("sku1"); product.setName("some product name"); product.setPrice(10.25); product.setDescription("some description"); product.setAvailableDate(LocalDateTime.now()); // because id type is AUTO_GENERATED its value will be assigned on insert. sjm.insert(product); // find by id product = sjm.findById(Product.class, product.getId()); // update product product.setPrice(11.50); sjm.update(product); // updateSpecificProperties() updates only the specified properties passed as arguments. // Will issue an SQL update only for price. product.setPrice(12.50); product.setDescription("New description"); sjm.updateSpecificProperties(product, "price", "description"); // find all List<Product> products = sjm.findAll(Product.class); // delete by object sjm.delete(product); // delete by id sjm.delete(Product.class, 5); Querying Spring's Simple PropertyRowMapper and BeanPropertyRowMapper do help reduce verbosity when querying, but have the limitation that Java property names must match the database column names. For most applications, DBAs determine the database naming convention, which generally does not match the Java property names. If this is the case, you will have to write custom row mappers or 'SELECT' statements with all the column names and their corresponding column aliases to map to the Java property names. SimpleJdbcMapper simplifies this. The getBeanFriendlySqlColumns() method, as the name suggests, will create the column names and their corresponding column aliases so that the queries can be used with Spring's SimplePropertyRowMapper and BeanPropertyRowMapper, etc., to populate the models. Java String sql = "SELECT " + sjm.getBeanFriendlySqlColumns(Product.class) + " FROM product WHERE product_name = ?"; // Using JdbcClient fluent api for the above sql. // In this case JdbcClient is using SimplePropertyRowMapper behind the scenes. List<Product> products = sjm.getJdbcClient() .sql(sql) .param("someProductName") .query(Product.class) .list(); // Using JdbcTemplate api for the above sql List<Product> products = sjm.getJdbcTemplate() .query( sql, BeanPropertyRowMapper.newInstance(Product.class), "someProductName" ); The Spring row mappers like SimplePropertyRowMapper and BeanPropertyRowMapper use reflection (behind the scenes, these row mappers use BeanWrapperImpl class). For most use cases, the performance of these row mappers is good enough. Where performance is critical, you should write a custom row mapper. Configuration The SimpleJdbcMapper configuration is similar to that of JdbcTemplate/JdbcClient. Java # application.properties # H2 database configuration. For other database configurations see SimpleJdbcMapper documentation at: # https://github.com/spring-jdbc-crud/SimpleJdbcMapper#spring-bean-configuration-for-simplejdbcmapper spring.datasource.jdbc-url=jdbc:h2:mem:testdb;DB_CLOSE_DELAY=-1 spring.datasource.driverClassName=org.h2.Driver spring.datasource.username=sa spring.datasource.password= ... @Component public class SimpleJdbcMapperConfig { // see application.properties for spring.datasource configuration @Bean @ConfigurationProperties(prefix = "spring.datasource") public DataSource sqlDataSource() { return DataSourceBuilder.create().build(); } @Bean public SimpleJdbcMapper simpleJdbcMapper(DataSource dataSource) { // This configuration is for H2 database. // See SimpleJdbcMapper documentation for other database configurations. return new SimpleJdbcMapper(dataSource); } } Maven Details SimpleJdbcMapper is available on Maven Central: XML <dependency> <groupId>io.github.spring-jdbc-crud</groupId> <artifactId>simplejdbcmapper</artifactId> <version>1.6.0</version> <!-- check maven central or SimpleJdbMapper github site and use the latest version --> </dependency> Logging Logging is done using the underlying Spring JDBC classes, so you can see the SQL being issued. You probably don't want to log the sql parameters (the last entry below) in production unless you are troubleshooting something, because it will flood the logs with SQL parameter details. Properties files # log the sql logging.level.org.springframework.jdbc.core.JdbcTemplate=TRACE # need this to log the INSERT statements logging.level.org.springframework.jdbc.core.simple.SimpleJdbcInsert=TRACE # log the parameters of sql statement logging.level.org.springframework.jdbc.core.StatementCreatorUtils=TRACE JdbcClient/JdbcTemplate You can get the underlying JdbcClient/JdbcTemplate with the code below. You can also create your own JdbcClient/JdbcTemplate and use it, since you already have the Datasource information. Java JdbcClient jdbcClient = sjm.getJdbcClient(); JdbcTemplate jdbcTemplate = sjm.getJdbcTemplate(); NamedParameterJdbcTemplate namedParameterJdbcTemplate = sjm.getNamedParameterJdbcTemplate(); Requirements One thing to note is that you will need Java 21+ and Spring 6.1.4+ or Spring Boot 3.2.3+ to use the SimpleJdbcMapper library. The library also provides other developer-friendly features. Its annotations allow automatic setting of Created By, Updated By, Created On, and Updated On fields, so developers won't be forced to set those properties manually. It provides optimistic locking via versioning, which is particularly useful in web applications to prevent multiple users from updating the same record with stale data. Source Code and Instructions SimpleJdbcMapper and source code for this tutorial are both available on GitHub, along with instructions for running the code and links to the pertinent classes. The tutorial comes with an embedded database, so no setup is required. When you run it, you will see all the SQL issued on the console.
Java has become one of the world’s most versatile programming languages, chosen for its adaptability, stability, and platform independence. Its extensive ecosystem encompasses virtually every application type, from web development to enterprise solutions, game design, the Internet of Things (IoT), and beyond. With an estimated 51 billion active Java Virtual Machines (JVMs) globally, it goes without question that Java powers a substantial portion of modern software infrastructure. However, designing dynamic and visually engaging applications takes more than coding skills — it requires the right tools. Java Graphical User Interface (GUI) frameworks are essential tools that transform basic code into visually appealing, interactive applications. This article explores the best Java GUI frameworks, highlighting their unique strengths, limitations, and ideal use cases to help you choose the best fit for your next project. What to Consider When Choosing a Java GUI Framework Selecting the right GUI framework for Java is pivotal to creating applications that excel in functionality and user experience (UX). This is because each framework offers distinct features that cater to specific requirements. Here’s a brief overview of some critical factors to consider: Performance: Java GUI frameworks vary in their ability to handle resource-intensive applications. Some are optimized for faster execution and better memory management, while others may trade performance for ease of development.Scalability: As your project grows, the GUI framework you choose should seamlessly support expansion. Some frameworks are particularly well-suited for applications that handle large datasets or high user volumes.Cross-Platform Compatibility: Not all Java GUI frameworks perform equally across operating systems. While some are truly cross-platform, others may require additional adjustments. Consider where your application will run and choose accordingly.Ease of Use & Learning Curve: Frameworks vary in complexity. If you want faster adoption, look for strong documentation and community support.Community & Support: An active user base and thorough documentation make development and debugging easier and ensure long-term maintainability. 8 Best Java GUI Frameworks for Modern Applications Below is an overview of the top eight Java GUI frameworks, each with unique features, strengths, and ideal use cases. 1. Swing Swing is one of Java’s oldest and most widely used GUI frameworks. Built on top of the Abstract Window Toolkit (AWT), it provides a rich set of pre-built components such as buttons, tables, and lists. Pros: Highly customizable components for advanced UI designsPlatform-independent across operating systemsPart of the Java Standard Library, making integration easy Cons: Slower performance for highly graphical applicationsOutdated look and feel without customizationLimited support for modern styling Best Use Case: Desktop applications requiring flexibility and cross-platform compatibilityApplications where a fully customizable UI is required 2. SWT (Standard Widget Toolkit) Originally developed by IBM for the Eclipse IDE, SWT uses native OS widgets to provide a natural look and feel. Pros: Fast performance using native widgetsPlatform-specific appearanceStrong support for productivity tools Cons: Less portable across platformsReliance on native libraries complicates distributionMore challenging to customize Best Use Case: Desktop applications that need to resemble native OS applications closelyApplications requiring native OS integration and high performance 3. JGoodies JGoodies extends Swing with libraries such as JGoodies Forms and Binding to simplify layout management and data binding. Pros: Cleaner, more modern look than standard SwingPowerful layout managers for complex UIs, including FormLayoutSimplified data binding and validation Cons: Requires additional librariesSmaller community compared to JavaFXLimited support for highly custom UI components Best Use Case: Business applications requiring advanced layouts and data validationWhere complex data binding and validation are required 4. JavaFX JavaFX is a modern GUI framework designed for visually rich applications with support for 3D graphics, media streaming, and many other advanced UI types. Pros: Strong multimedia and 3D supportScene Builder simplifies UI designHigh performance for complex visuals Cons: Steeper learning curveLarger memory footprint than simpler frameworksLimited support in legacy applications Best Use Case: Applications that require advanced graphics, animations, and media playbackHigh-performance desktop applications with modern UI design 5. JIDE JIDE is an enterprise-grade GUI framework offering advanced components for data-intensive applications. Pros: Extensive enterprise-focused component libraryHigh flexibility and customizationIdeal for complex data-driven UIs Cons: Overkill for small projectsExpensive licensingLimited open-source community Best Use Case: Enterprise applications with complex UI requirementsData-intensive applications with advanced interaction needs 6. Apache Pivot Apache Pivot is an open-source GUI toolkit using XML-based layouts. Pros: Lightweight and cross-platformSimple XML UI definitionsSmall footprint, ideal for less resource-intensive applications Cons: Limited advanced UI componentsSmaller communityLower performance for complex visuals Best Use Case: Lightweight cross-platform applications needing easily maintainable UIsProjects that prioritize simplicity over advanced visual features Note: Apache Pivot moved to the Attic in January 2025. 7. Hibernate Hibernate is an ORM framework, not a GUI framework, but it complements GUI development through robust data persistence and retrieval. Pros: Strong database integration and support for data handlingReduces boilerplate codeIdeal for data-driven applications Cons: No UI components, primarily a database-oriented toolRequires database expertiseNot suitable for graphical applications Best Use Case: Data-driven applications with minimal UI requirementsBackend-heavy applications that need strong database interaction 8. Spring Spring is a versatile framework used primarily for backend development but can support GUI applications through integrations. Pros: Highly scalableStrong backend and integration supportExtensive documentation and community Cons: Not GUI-focusedComplex for small applicationsSteeper learning curve for full-stack integration Best Use Case: Enterprise applications requiring robust backend services with GUI integration.Web applications integrated with Java backend services. Tabular Comparison of the Best Java GUI Frameworks Let's look at the critical elements of each Java GUI framework in a tabular form: FrameworkKey FeaturesBest Use CasesProsConsSwingPre-built componentsCross-platform desktop applicationsHighly customizableLower Performance for graphical appsSWTNative OS IntegrationNative-looking desktop appsFast natural OS FeelLess portable across platformsJGoodiesBasic, lightweight componentsBusiness applications, advanced layout customization, complex data binding and validationup-to-date clean look, powerful layout for managers, validation toolsunsuitable for apps requiring widespread use of custom UI componentsJavaFX3D Graphics, media supportVisually rich appAdvanced graphics and multimediaSteeper learning curveJIDEEnterprise-grade componentsData-driven enterprise applicationsExtensive library for complex UIsCostly advanced componentsApache PivotXML-based design, cross platformLightweight cross-platform applicationsSimple UI definitionsLimited high-performance featuresHibernateDatabase bindingData-heavy backend applicationsExcellent for data handlingPrimarily a database frameworkSpringBackend integration, scalablecomplex, large scale enterprise applicationsExtensive backend supportNot GUI-focused complex setup Conclusion Choosing the right Java GUI framework is a decisive factor in your project’s success. Each framework offers strengths tailored to specific use cases. Aligning your choice with requirements such as performance, scalability, cross-platform compatibility, and ease of use will help you build a robust, user-centric application that meets both current and future needs.
For many enterprises, the “lift-and-shift” (rehost) strategy remains the most pragmatic first step into the cloud. It offers speed and immediate data center exit capabilities without the complexity of refactoring applications. However, doing this manually for hundreds of workloads introduces human error, security gaps, and “migration fatigue.” To solve this, we need to treat migration not as a series of manual tasks, but as a manufacturing process. We need a Migration Factory. This article outlines an architectural blueprint for automating large-scale migrations using AWS Application Migration Service (MGN) orchestrated by Step Functions and CI/CD pipelines. The Core Problem: The Semi-Automated Trap AWS MGN is a powerful tool, but out of the box, it is only “semi-automated.” You still need to: Install agents manually on source serversMonitor replication progress in the consoleManually launch test instancesSwitch traffic for cutover When you multiply these steps by 500 servers, you get inconsistent configurations, missed security tags, and blown timelines. The solution is to wrap AWS MGN in an orchestration layer that handles lifecycle state management. The Architecture: Event-Driven Orchestration The architecture below relies on decoupling the definition of the migration (the runbook) from the execution (the pipeline). High-Level Workflow 1. The Migration Runbook (Source of Truth) Instead of disparate spreadsheets, we define the migration wave in a structured JSON or CSV runbook stored in S3. This file acts as the infrastructure contract. Sample Runbook Structure: JSON [ { "hostname": "db-server-01", "wave_id": "wave-2", "target_instance_type": "r5.large", "subnet_id": "subnet-0abc123", "security_groups": ["sg-web", "sg-db"], "cutover_time": "2025-12-20T02:00:00Z", "tags": { "CostCenter": "Finance", "Environment": "Prod" } } ] 2. State Management with Step Functions AWS Step Functions acts as the factory floor manager. It handles long-running processes that Lambda cannot, such as waiting for initial data replication (which can take days) or polling for “Ready for Testing” states. Key State Transitions: Agent Installation: Connects to the source via SSM or SSH and installs the MGN agentReplication Loop: Polling loop checking dataReplicationInfo.stateLaunch Configuration: Pushes runbook settings (instance type, security groups) to the MGN launch template via APITest/Cutover Trigger: Executes launch logic based on the schedule Automation Deep Dive: The Infrastructure Pipeline Once replication is complete and the server is launched, we shift from “migration” tools to “DevOps” tools. The Terraform Handoff A common mistake is leaving the migrated server as a “ClickOps” artifact. To ensure the new environment is manageable, the Migration Factory triggers a CodePipeline job immediately after cutover. This pipeline reads the final state of the migrated instance (AMI ID, private IP) and commits it to a Terraform state file. Python/Boto3 Logic to Update Launch Templates: Python import boto3 mgn_client = boto3.client('mgn') def update_launch_config(source_server_id, runbook_data): """ Updates the AWS MGN Launch Configuration based on Runbook Metadata """ response = mgn_client.update_launch_configuration( sourceServerID=source_server_id, targetInstanceTypeRightSizingMethod='NONE', targetInstanceType=runbook_data['instance_type'], copyPrivateIp=True, copyTags=True, launchDisposition='STARTED' ) return response['ResponseMetadata']['HTTPStatusCode'] == 200 Post-Launch Automation (The “Day 1” Scripts) A server isn’t “migrated” just because it boots. It must be integrated into the cloud ecosystem. Using AWS Systems Manager (SSM), we automate the following “Day 1” tasks immediately post-cutover: Agent Cleanup: Uninstall the MGN replication agent and legacy VMware toolsObservability: Install CloudWatch Agent and Fluent BitSecurity Hardening: Join the domain controller and apply Group PoliciesLicense Switching: For SQL Server, automate the switch from BYOL (Bring Your Own License) to AWS License Included (LI), if required, to optimize costs Security and Governance Patterns In a factory model, security must be baked in, not bolted on. Isolation: The factory operates in a dedicated “Migration VPC” with private subnets. Replication traffic flows over Site-to-Site VPN or Direct Connect — never the public internet.Encryption: All data in transit is encrypted via TLS 1.2. Data at rest (EBS volumes) is encrypted using AWS KMS keys managed by the factory.RBAC: The automation pipeline uses IAM roles with least-privilege access. Developers trigger migrations by uploading a file to S3, never by logging into the console. Results: The Efficiency Gains Implementing a factory model yields measurable improvements over manual migration: Speed: Provisioning time reduced by 50% compared to manual lift-and-shiftReliability: Migration success rates typically exceed 99% due to the elimination of manual configuration errorsCost: “Wait time” is eliminated. Servers are spun down immediately after testing, and cutovers are executed precisely on schedule, minimizing parallel run costs Conclusion Building a Migration Factory requires upfront investment in code and architecture, but for fleets larger than 50 servers, the ROI is immediate. By orchestrating AWS MGN with Step Functions and Terraform, you transform a chaotic data center exit into a predictable, boring, and successful engineering event. The goal is simple: one click to start, zero touches to finish.
Modern microservices face a common challenge: managing multiple tasks simultaneously without putting too much pressure on the systems that follow. Adjusting traditional thread pools often involves a lot of guesswork, which usually doesn't hold up in real-world situations. However, with the arrival of virtual threads in Java 21 and the growth of AI-powered engineering tools, we can create smart concurrency adapters that scale in a safe and intelligent way. This article provides a step-by-step guide to a practical proof-of-concept using Spring Boot that employs AI (OpenAI/Gemini) to assist in runtime concurrency decisions. It also integrates virtual threads and bulkheads to ensure a good balance between throughput and the safety of downstream systems. Why Concurrency Decisions Need Intelligence, Not Just Thread Pools Spring Boot microservices often execute parallel fan-out, which means they make several downstream calls for each incoming HTTP request. In the past, developers adjusted: Thread poolsExecutor settingsBulkheads and timeouts based on their gut feelings. This method can be weak when there are changes in traffic, latency, or variability in downstream services. Even with virtual threads that remove strict limits on thread counts, services still need protections to avoid: Overloaded databasesThread scheduling conflictsRetry stormsPoor tail latency This is where AI can assist by offering contextual suggestions instead of fixed configurations. Solution Summary Our proof of concept includes three key elements: Spring Boot with enabled virtual threads. This utilizes Java 21’s lightweight thread features to prevent blocking I/O from overwhelming the server.AI-driven concurrency advisor. This is a modular component that interacts with the following to suggest a maximum concurrency limit (maximum concurrent requests): OpenAI-compatible endpointsOR Google’s GeminiBulkhead pattern implemented with semaphores. This guarantees that only the recommended number of tasks operate at the same time. The objective: allow AI to assist in identifying the concurrency level that a specific workload can handle. Architecture Here’s how the request flows: The client makes a call to /api/aggregate?fanout=20&forceAi=true.The controller sends the fan-out information to the AI Concurrency Advisor.The advisor utilizes either the AI provider or a heuristic fallback.It returns a JSON object containing maxConcurrency.A semaphore bulkhead is established.Tasks are processed on virtual threads.Responses are gathered and sent back.The advisor does not run threads — it merely suggests limits. Implementation Details Enabling Virtual Threads The application.yml configuration in Spring Boot enables virtual threads: Java spring: threads: virtual: enabled: true This guarantees that the framework processes request handling and asynchronous tasks using virtual threads by default. AI Concurrency Advisor We establish an AiConcurrencyAdvisor interface. Specific implementations consist of: OpenAI clientGemini clientHeuristic fallback Sample JSON prompt utilized in the OpenAI client: JSON { "model":"gpt-4.1-mini", "temperature":0.1, "messages":[ {"role":"system","content":"You are a senior JVM performance engineer…"}, {"role":"user","content":"Operation: aggregate\nFanoutRequested: 50…"} ] } The service analyzes the JSON provided by the model and retrieves a secure maxConcurrency value. Bulkhead With Semaphore After a recommendation is received: Java Semaphore semaphore = new Semaphore(maxConcurrency); Before executing, each downstream task obtains a permit. This guarantees that only the recommended number of tasks operate at the same time — even with an unlimited number of virtual threads. Key Code Snippets AI Advisor Interface This abstraction makes AI optional, interchangeable, and secure. Java public interface AiConcurrencyAdvisor { AdvisorDecision recommend(AdvisorInput input, boolean forceAi); } Separates AI logic from business logicEnables switching between Gemini, OpenAI, or a heuristic fallbackMaintains testable and auditable concurrency decisions Advisor Input Model The quality of AI decisions depends on the context you give. Java public record AdvisorInput( String operation, int fanoutRequested, long expectedDownstreamLatencyMs, int cpuCores, Map<String, Object> hints ) {} Rather than estimating concurrency limits, we offer: Fan-out sizeLatency expectationsCPU capacityWorkload hints This reflects the thought process of a senior engineer regarding concurrency. AI Decision Sanitization Even AI recommendations must be constrained. Java int maxConcurrency = Math.max(1, Math.min(decision.maxConcurrency(), fanout)); Stops uncontrolled concurrencySafeguards downstream systemsGuarantees AI output adheres to system rulesAI provides advice — the system makes the decision. Service fan-out logic: Java try (ExecutorService vtExecutor = Executors.newVirtualThreadPerTaskExecutor()) { List<CompletableFuture<DownstreamResponse>> futures = new ArrayList<>(fanout); AtomicInteger idx = new AtomicInteger(0); for (int i = 0; i < fanout; i++) { futures.add(CompletableFuture.supplyAsync(() -> { boolean acquired = false; try { semaphore.acquire(); acquired = true; int n = idx.incrementAndGet(); return downstream.call("ds-" + n, id, latencyMs); } catch (InterruptedException e) { Thread.currentThread().interrupt(); return new DownstreamResponse("interrupted", "INTERRUPTED", 0); } finally { if (acquired) semaphore.release(); } }, vtExecutor).orTimeout(3, TimeUnit.SECONDS)); } This approach combines virtual threads with a bulkhead, allowing for safe scaling of blocking calls. Starting the Project: Run the Project Set optional environment variables for the AI provider. Execute: Java ./gradlew bootRun Test the endpoints: C curl "http://localhost:8080/api/aggregate?id=123&fanout=20" Include &forceAi=true to enforce AI usage even if no key is set. When to Utilize AI-Driven Concurrency This approach is particularly beneficial when: There is variability downstreamLatency patterns are uncertainManual adjustments are expensiveYou need clear backpressure choices AI suggestions must always be limited and checked with heuristics to guarantee safety in case LLM responses are surprising. Conclusion This proof of concept shows how AI (Gemini/OpenAI) can help with Spring Boot concurrency design. It does not replace human judgment but provides contextual recommendations based on workload characteristics. When paired with Java 21 virtual threads, this method allows for scalable, safe, and observable microservices.
Here's a problem I've seen happen far too often: your recommendation system is functioning, spitting out results in milliseconds, and meeting all its infrastructure SLAs. Everything is looking rosy in the dashboard world. Yet engagement has plummeted by 40% because your model has been pointless for several weeks. On behalf of your traditional error budget? You're golden. According to your product team? The system is broken. ML systems fail in ways that were not accounted for in classical SRE practices. A model does not 'go down'; it gradually deteriorates. Data pipelines can be 'working' while providing garbage to the model. And you won't even realize this until users start to complain or, worst, quietly depart. The past few years spent breaking and fixing ML systems have taught me that we need a paradigm shift in our error budget. Here's how it works. Understanding the Limitations of Conventional Error Budgets The challenge here is that "reliability" in ML does not live on a one-dimensional spectrum. Your API could be functioning correctly even if your model is not working. Your model could be working correctly even if your data pipeline is providing stale features to your model. You could be doing great on your aggregate numbers even if you're treating some users unfairly. What I've found is that you need to break down four different error budgets. Mapping These to Actual Error Budgets Before delving into each dimension, I must clarify the application of these to conventional SRE error budgets — not merely health checks: For each dimension, you require: • SLI (service level indicator): What you're measuring • SLO (service level objective): Your target over time • Error budget: How much you can miss the SLO before you take action Here's what model quality means with concrete examples: SLI: Accuracy of the model compared with the baseline, hourly SLO: Accuracy ≥ 92% of Baseline over the rolling 7 days Error budget: 8% allowable error in 7 days Burn rate: Monitor hourly; warn for burning above 10% of budget daily The main difference versus an error budget is that you're measuring degradation relative to a known-good state as opposed to just measuring success or failure. The math is exactly the same in both cases — a time budget that gets spent if you don't meet your SLO. Now, let's consider every dimension one by one: 1. Infrastructure Error Budget These are your standard SRE metrics: uptime, latency, and success rate of requests. It's old news, but you should have this as your baseline. What I monitor: 99.95% availability, latency of sub-150ms at p95, 99.9% success rate 2. Model Quality Error Budget This is where it gets fascinating. You must specify at what point you are willing to let the degradation of your model become noisy. What I track: • Model accuracy vs baseline accuracy (typically up to 8% loss) • Percentage of low-confidence predictions • Distribution of feature drift via statistical tests Here's how I can determine degradation: Python # Compare Current Performance with Your Personal Benchmark accuracy_degradation = (baseline_accuracy - current_accuracy) / baseline_accuracy budget_burn_rate = accuracy_degradation / acceptable_degradation Real example: Accuracy decreased from 95% to 93%, my threshold is 8% As for drift detection, I employ the Kolmogorov-Smirnov test: Python # Verify distribution of features has changed from scipy.stats import ks_2samp statistic, p_value = ks_2samp(baseline_features, current_features) drift_alert = p_value < 0.05 One thing that bit me: Tie your model accuracy metrics to business metrics. Rather than accuracy percentages, track something your PM cares about — for example, "click-through rate stays within 95% of baseline." 3. Data Quality Error Budget Garbage in, garbage out. However, the ML system "garbage" needs a different definition. What matters: • Feature completeness score (my target is 99%+) • Feature freshness degree (how many features are stale?) • Schema violations Simple quality check: Python def simple_quality_check(features): missing_rate = missing_features / total_features stale_rate = stale_features / total_features data_quality_score = min(1 - missing_rate, 1 - stale_rate) meets_sli = data_quality_score > 0.99 Traditional data pipelines only cared about having a correct schema. When working with machine learning, you also want to ensure that your data features are fresh enough and that your distributions look fairly regular. I've been burned before working on pipelines that "worked" but passed day-old data, making our model irrelevant. 4. Fairness Error Budget In your case, fairness can be either desirable or mandatory. Regardless, it should be tracked. What I monitor: • Differences in accuracy across demographic groups (this is under 5%) • False positive rate parity across segments To calculate disparate impact: Python # Determine disparate impact group_A_rate = predictions[group == 'A'].mean() group_B_rate = predictions[group == 'B'].mean() disparity = abs(group_A_rate - group_B_rate) violation = disparity > 0.05 # flag if over 5% There is no such dimension in traditional SRE because a traditional system is not involved in people's decision-making. However, as soon as your machine learning system starts approving loans or ranking candidates for jobs, you want to determine whether your system is treating people fairly. Critical Caveats Fairness metrics are extremely domain-specific and complex from a legal standpoint. The metrics that I am presenting here are only examples, and demographic parity is not necessarily a good thing for every problem you want to solve. Before using fairness budgets: Discuss with lawyers the way in which fairness may be considered in your regulatory environmentCoordinate with the product and policy teams on identifying the acceptable tradeoffsReflect on whether you have the right to maintain, process, or use sensitive attributes for monitoring purposesDo not use simplistic parity checks as the sole indicators of fairness In regulated industries such as finance, healthcare, or hiring, you require knowledge that goes beyond the capabilities of any framework. How to Actually Implement This Step 1: Determine How Reliability Applies in Your Business Don't begin with metrics in mind. Begin with conversations instead. "What is a broken model in the eyes of my PM?" "What will make my users grumble?" For an ML-driven search functionality, you can choose: Infrastructure: Less than 200 ms (p95)Model quality: Relevance scores greater than 0.85 relative to human assessorsData quality: Less than 1% of queries missing critical featuresFairness: Search diversity preserved when considering different user categories Step 2: Establish Your Baseline Run your system in a stable state for 30 days. Observe what "good" looks like. Python # Calculate your baseline during a stable period baseline = { 'accuracy': np.percentile(stable_metrics['accuracy'], 50), 'p95_latency': np.percentile(stable_metrics['latency'], 95), 'drift_threshold': calculate_drift_threshold(stable_features) } This becomes your north star. All else shall be measured from that. Step 3: Define Ownership This is crucial. Each dimension must have a "clear owner" to make decisions and take actions: Infrastructure budget → SRE owns: • Right to suspend deployments • Authority to reverse modifications • Infrastructure scaling authority Model quality budget → ML engineering owns: • Authority for triggering retraining • Authority to roll back to previous model version • Power to increase monitoring frequency Data quality budget → data engineering owns: • Power to halt data pipelines • Authority to enable fallback data sources • Right to disregard upstream data Fairness budget → ML + product + legal own together: • Needs a multi-stakeholder decision for any actions • Product evaluates business impact • Legal specifies compliance requirements • ML applies technical solutions If the budget constraints are conflicting, such that model quality is satisfactory, but fairness is violated, then the more constraining budget prevails. If you have depleted your fairness budget, you cannot just rely on your predictions for satisfactory accuracy. Step 4: Monitor Everything Establish dashboards to measure all four key dimensions. Here's how I calculate the composite health factors: Python # Current health across dimensions dimensions = { 'infrastructure': 0.95, # meeting 95% of SLO 'model_quality': 0.88, # at 88% of baseline 'data_quality': 0.98, 'fairness': 0.96 } # Weight them according to what is important to your business weights = { 'infrastructure': 0.3, 'model_quality': 0.35, 'data_quality': 0.2, 'fairness': 0.15 } composite_score = sum(dimensions[d] * weights[d] for d in dimensions) Critical note: The composite score is solely for executive visibility. Hard enforcement always happens on a per-dimension basis. Having a 90% composite score does not supersede a violation in any dimension. You are in violation if you blow your fairness budget. Step 5: Know What to Do When Budgets Blow Up This list should be recorded prior to having a situation on your hands: Infrastructure budget spent out: Stop deployments, undo changes made, see if scale is requiredModel quality budget used up: Kick off the retraining process, think about reverting to the former model version, and look at what changed in your datasetData Quality budget exhausted: Check your upstream data sources, validate your ETL pipeline, turn on feature fallbacks if you have themFairness budget used up: If it's bad, then simply stop making predictions for those subgroups. Don't release it to society until you figure out where you introduced unfair bias and retrain. A Real Example: Fraud Detection Let me illustrate what I mean with a system for preventing fraud that I built for a fintech company. Our error budgets: Infrastructure: 99.99% uptime, under 100ms at p95Model quality: Precision above 95%, Recall above 90%, False Positive Rate below 2%Data quality: +99.5% feature completion rate, <1% stale featuresFairness: FPR differences across merchant types <3% Here's what our code for monitoring looked like: Python # Validating the health of each batch of predictions made def check_fraud_detection_health(predictions, features, ground_truth): # Did model quality degrade? current_precision = precision_score(ground_truth, predictions) precision_violation = (baseline - current_precision) / baseline > 0.02 # Are features getting stale? stale_rate = features[features['age_hours'] > 24].shape[0] / len(features) data_violation = stale_rate > 0.01 # Fairness issues regarding various merchants? fprs = calculate_fpr_by_category(predictions, ground_truth) fairness_violation = max(fprs.values()) - min(fprs.values()) > 0.03 return any([precision_violation, data_violation, fairness_violation]) The "interesting" part: All these dimensions are actually tested in every prediction batch. It helps you detect issues early, as data quality problems could become evident before affecting model performance. A Few Things I've Learned Use Rolling Windows Where Time-Based Budgets Are Required Monthly budgets aren't really effective in ML either. You may have a bad week when you're retraining your model, but you can't waste the rest of the budget. I use 7-day rolling windows instead — still time budgets, but with a sliding window. Python from collections import deque # Measurements deque with maxlen of 7 days * 24 hours measurements = deque(maxlen=168) measurements.append({'timestamp': now, 'accuracy': current_accuracy}) avg_accuracy = sum([m['accuracy'] for m in measurements]) / len(measurements) budget_ok = avg_accuracy >= target_accuracy This provides some buffer for recovering from transient problems without having to call bankruptcy for the month. You're still measuring reliability over time (the point of error budgets), but the window slides smoothly rather than restarting each month. Budget According to What Is Happening In a large product rollout, I'll cut model quality budgets (can't have the model shaming us during peak traffic) while relaxing latency requirements slightly. It's fine to adjust these based on context, just be sure to record the reasoning behind adjustments as they happen. Be Alert for Cascading Failures "Garbage in, garbage out" applies here, too: bad data input leads to bad model output, which, in turn, results in more attempts and fallbacks, thus more load on the infrastructure. It is where having budgets per dimension comes in handy, as it allows you to zero in on where the problem actually occurred. Wrapping Up Conventional error budgets account for failures in infrastructure, such as servers becoming unavailable and requests timing out. They fail to account, however, for failure in ML, which occurs in terms of model drift, pipelines with stale features, and biased predictions in terms of user segments. This framework identifies these failures early. By monitoring the degradation of model quality with time, you address the issue before it affects users. By monitoring the freshness of the data, you identify the pipeline failures before their impact affects your predictions. By monitoring fairness, you identify bias before it turns into a compliance issue. The actual gains in reliability come from the following three sources: Earlier detection: You detect degradation trends before outagesRoot cause clarity: When quality goes down, you know if it's the infrastructure or the quality of the dataClear accountability: Every factor has a clear owner who has clear action power You want to start with the budget on infrastructure and the quality of models. Get familiar with tracking the baseline and calculating the burn rate. Once you're comfortable with that, you can integrate the data quality tracking. Fairness tracking is what you want to do last. It's the most complex aspect of fairness, and it's the most dependent on the domain. Your set of metrics will be different from mine in specifics. A recommendation system can deal with variation in its accuracy results better compared to the fraud detector system. However, the model that consists of four aspects, budgets that consider time intervals, and ownership that is clearly stated has proved to be effective throughout the models involving ML that I have used before. The aim is not about preventing all cases of model deterioration. It is about understanding it, comprehending why it happens, and having the power to correct it before it shatters user trust.
GraphQL and REST are two of the most popular API frameworks currently in use, with GraphQL seeing increasing adoption. GraphQL was developed by Facebook to address key challenges in REST APIs, particularly in reducing latency for mobile applications. The debate over which is better continues within the developer community, with strong arguments on both sides. The choice between GraphQL or REST depends on the type of application being built. GraphQL addresses latency issues caused by over-fetching and under-fetching, and provides built-in support for real-time updates. On the other hand, REST leverages HTTP constructs for caching, rate control, and telemetry, allowing these concerns to be handled at the platform layer rather than the application layer. Let’s explore the pros and cons of GraphQL and what an architect should evaluate when making a decision for their project. GraphQL Pros Avoidance of Data Over-Fetching/Under-Fetching One of GraphQL’s biggest advantages is preventing over-fetching of data. In REST, a single API endpoint may return more data than needed for certain clients, especially in mobile applications with limited UI real estate. This issue is typically addressed in REST using the Backend-for-Frontend (BFF) pattern, but there is no standard BFF framework. GraphQL allows clients to specify exactly which fields they need, preventing unnecessary data retrieval. Example A web app might request a user’s name, address, and email.A mobile app might request only the user’s name to optimize bandwidth. Benefit Conserves bandwidth and reduces latency. Reduced Latency REST does not have a built-in orchestration layer, meaning a client might need to call multiple REST APIs to fetch all required data. With GraphQL, a single endpoint can handle multiple entities at once, reducing network calls. GraphQL achieves this through GraphQL servers like Apollo Server, Hasura, and GraphQL Mesh, which resolve queries, orchestrate backend calls, and return a unified response. This could be achieved in REST via BFF pattern but there is no standard implementation available. REST Example A client fetching user data and order history would need: /users/{id} → Fetch user details/orders/{userId} → Fetch order history GraphQL Example Plain Text { user(id: 1) { name email orders { orderId totalPrice } } } Benefit Reduces multiple network calls into a single request, lowering latency. Real-Time Updates Subscriptions One of the biggest pros of GraphQL is its support for real-time updates via subscriptions, allowing clients to receive automatic updates when data changes. Example Use Cases Live stock market updatesChat applicationsReal-time notifications GraphQL subscriptions typically use WebSockets, though alternatives like Server-Sent Events (SSE) and polling are also possible. Benefit Efficient real-time data handling, better than REST-based polling. Decoupling of Frontend from Backend Implementation GraphQL actors (e.g., clients, servers, resolvers) interact with data as an entity graph, vis-à-vis a REST API, which is based on structured endpoints. Thus, GraphQL enables higher decoupling of the frontend from the backend implementation. Benefit Frontend teams can evolve independently from backend teams without requiring backend API changes. Versioning In REST, API versioning (/v1/users, /v2/users) is required when changing or deprecating fields. GraphQL eliminates versioning by marking deprecated fields without breaking existing clients: Plain Text type User { id: ID! name: String email: String @deprecated(reason: “Use contactEmail instead”) } Benefit Backend modifications do not force frontend updates. GraphQL Cons With that, let’s look at GraphQL's cons. One of the biggest issues is the impact on HTTP constructs which traditionally web applications have relied upon for cross cutting concerns like telemetry, caching, rate limiting and security. Caching REST leverages URL-based caching at the CDN, API Gateway, Reverse proxy and browser levels. Example You can have separate caching policies for following endpoints based on URL pattern, with the cache key being generated out of the URL Pattern and query param /locations?city=NewYork/restaurants?category=Italian GraphQL, however, uses a single endpoint, making traditional URL-based caching ineffective. We could overcome this by using HTTP GET for queries (cannot use GET for mutations). Caching can then be governed based on query params. However, this approach can become very complex because there may be several variations of the query for the same entities, and even within the data properties of entities. Without careful design, caching policy using this approach may become infeasible or less effective than in REST. In addition, developers must be careful not to use GET for large queries, as browsers enforce size limits on query parameters. This becomes a big challenge when an architect wants to decouple caching from one of the network layers (e.g., a CDN). This issue can be partially resolved using GraphQL’s persisted queries. In persisted queries, the client passes a query ID rather than the whole query. This helps address query variability and the query param length issue. However, the GraphQL server now needs to store the client query schema, thus introducing a bit of coupling between frontend and backend. Tools like Apollo GraphQL provide sophisticated caching capabilities at the client or server. But the CDN (edge) side caching challenge remains, even though Cloudflare and Akamai now offer better support for GraphQL caching. Furthermore, not all GraphQL implementations support HTTP GET and rely on POST for both queries and mutations, which renders this approach infeasible. This could be overcome by sending query params (say the “operation name”) in a POST request, but that will make it a non-standard implementation and an anti-pattern. Summary Due to GraphQL’s single endpoint structure, caching must often be handled at the application level (client and server) rather than being delegated to platform layers like CDN or reverse proxies (e.g., NGINX), as is common in REST APIs. Telemetry REST relies on structured URLs (/users, /orders), making it easy to monitor performance per endpoint. Since GraphQL uses one endpoint (/graphql), API Gateways and CDNs cannot track performance per entity. These stats need to be captured at the GraphQL application layer, and many telemetry tools now provide query-level insights. CDN tools like Akamai provide Telemetry for GraphQL by analyzing the operationName param in request. Persisted queries also enhance the Telemetry support. Yet fine-grained telemetry analysis, such as REST, is not possible in the CDN/Load Balancer/API Gateway layers. Summary GraphQL telemetry must be handled at the application level, unlike REST, where platform-level monitoring is possible. Rate Limiting Rate limiting can be done at CDN or API Gateway layer for specific REST APIs at the URL level. However, for GraphQL, rate limiting needs to be done based on query depth or complexity by GraphQL-aware Gateways like Apollo/Hasura; This approach is more complicated than REST, and once again, pushes this concern to the application layer. REST APIs allow rate limiting per endpoint (/users limited to 1000 requests per hour), and this can be implemented either at the application layer or the platform layer (CDN, API Gateway). GraphQL requires analysis of query complexity, as a single request can return vast amounts of data. Hence, rate limiting needs to be done based on query depth or complexity by GraphQL-aware Gateways like Apollo/Hasura, thus making it much more complicated than REST Summary GraphQL rate limiting requires more advanced logic and should be implemented at the application layer, where more options are available. Limited Platform Support Cloud and CDN support for GraphQL is getting stronger. However, there are still gaps. For example, AWS API Gateway does not support input validation for HTTP APIs (such as GraphQL) compared to REST APIs. AWS API Gateway does not support Resource policies for HTTP APIs for access control (for example, restrict access by IPs). Summary Platform support for GraphQL is growing but is more advanced for REST Security Poorly designed GraphQL applications can introduce several security vulnerabilities. Schema introspection exposes API details that malicious actors can use to mount attacks. Deep queries or query variations can cause Denial-of-Service (DoS) attacks by bypassing caching. This vulnerability can be mitigated by disabling introspection in Production. Strictly Persisted Queries offer the most secure solution, but introduce a level of coupling between frontend and backend Authorization in REST can be implemented via endpoint-based authorization, and this can be done at the API Gateway or CDN layer. In GraphQL, this must be done at the application level to determine which queries in the request require authorization. Summary REST security can be enforced at platform layers (API Gateways and WAFs), but GraphQL requires resolver-level security and needs more careful design. Error Handling REST APIs return standard HTTP status codes: 200 OK → Success404 Not Found → Invalid request500 Internal Server Error → Server issue GraphQL always returns HTTP 200, even for errors, embedding error messages inside the response body rather than using standard HTTP status codes. This behavior requires additional handling in monitoring tools and client applications, which traditionally rely on HTTP status codes (404 Not Found, 500 Internal Server Error) to determine API failures. Plain Text { “data”: null, “errors”: [ { “message”: “User not found”, “code”: “NOT_FOUND” } ] } Summary Traditional HTTP-based monitoring tools need customization for GraphQL error handling. Conclusion There is no one-size-fits-all framework. GraphQL is a powerful API framework with multiple benefits: Frontend-backend decouplingReduced latency for mobile applications and data-intensive web applicationsResolution of API versioning issues However, using GraphQL removes several of the REST framework’s advantages accrued from using HTTP constructs for caching/telemetry/security, which are typically handled at the platform level — CDN, Load Balancer, API Gateway, or Reverse Proxy — in REST. With GraphQL, these concerns often shift to the application layer, requiring additional effort from developers. Use GraphQL If Your app requires complex or multi-entity data fetchingYou need real-time updatesYou want frontend flexibility & backend decoupling Use REST If You need platform-level caching, telemetry, and securityYou want simpler rate limiting & security enforcementYou require standard HTTP-based error handlingYour app has moderate data requirements Finally, for many projects, GraphQL and REST can coexist, allowing you to leverage the strengths of each.
TL; DR: The A3 Framework The A3 Framework categorizes AI delegation before you prompt: Assist (AI drafts, you actively review and decide), Automate (AI executes under explicit rules and audit cadences), or Avoid (stays entirely human when failure would damage trust or relationships). Most AI training teaches better prompting. The A3 Framework teaches the prior question: Should you be prompting at all? Categorize first, then prompt. The A3 Framework Origins When agile practitioners try using AI for something important, and it goes sideways, like a wrong tone in a stakeholder email, hallucinated dependency in a status report, or generic acceptance criteria that missed the point entirely, the problem typically is not the AI. The problem is the ad hoc delegation on the human side. When you decide in the moment whether to use AI (based on time pressure, curiosity, or convenience), you are gambling. Sometimes it pays off. Sometimes you send a message that damages a relationship you spent months building. (Remember the asymmetry between trust-building and trust-destruction?) What is missing is a decision system. A way to categorize tasks before you open ChatGPT or Claude, so you know in advance what the AI is allowed to touch, what role you will play, and what stays entirely human. That system is the A3 Framework: Assist, Automate, Avoid. Why Categorization Comes Before Prompting Most AI training starts with prompts. It teaches how to write better instructions, structure context, and iterate. That is useful, but it skips the prior question: should you be prompting at all? The A3 Framework forces that question first. Before you type anything, you categorize the task into one of three buckets. Each bucket has different rules for AI involvement, different human responsibilities, and different failure modes. Once you know the category, the prompting decisions become obvious. Applying the A3 Framework is neither bureaucracy nor governance overreach. It is professionalism. Surgeons do not decide mid-operation whether to sterilize instruments. They have protocols that front-load decisions so they can focus on judgment when it matters. The A3 Framework does the same for knowledge work. Let us explore: ASSIST: AI Drafts, You Decide Assist creates the most value for agile practitioners. In Assist mode, AI generates options, drafts, or analyses. You retain full decision-making authority. The AI handles the blank page problem. You supply the judgment. For Scrum Masters: You might ask AI to generate three alternative Retrospective formats based on the symptoms you describe (e.g., low energy, recurring complaints, surface-level discussion). The AI proposes options you had not considered. You evaluate them against what you know about this specific team’s dynamics, pick one, and adapt it in the room. The AI expanded your options; it did not choose for you.For Product Owners and Managers: You might ask AI to draft acceptance criteria for a work item, or to suggest edge cases you might have missed. The AI produces five criteria in seconds. You review them, realize two are redundant, one contradicts a technical constraint the AI does not know about, and one surfaces a gap in your thinking. You edit accordingly. The draft accelerated your work; the judgment remained yours.For Agile Coaches: You might ask AI to analyze patterns across six months of Retrospective notes or team health survey data. The AI surfaces themes: recurring blockers, declining engagement in certain ceremonies, and increasing mentions of a specific dependency. You validate those patterns against what you have observed directly, then decide whether they warrant intervention. The analysis was AI; the interpretation was human. The failure mode in Assist is rubber-stamping. When you accept AI output without genuine review, you have outsourced the thinking you were supposed to retain. Assist requires active engagement, not passive acceptance. Assist tagline: AI expands options. You own the outcome. AUTOMATE: Execution Under Constraints Automate is for tasks where AI handles end-to-end execution. You set rules and audit results. Automate is not abdication. It is a delegation with guardrails. For Scrum Masters: Meeting summaries and action item extraction are classic automation targets. You configure a workflow: transcribe the Sprint Review, extract new market developments mentioned, format them as bullet points, and post to the team channel. The AI executes this repeatedly without your involvement. But you have established a ‘human in the loop before publish’ checkpoint, and you audit a sample weekly to catch drift.For Product Owners and Managers: Release note drafts can be automated from merged pull requests and “Jira transitions.” The AI assembles what was shipped, formats it for stakeholders, and queues it for your review. You audit for accuracy and tone before sending, but you are not writing from scratch each Sprint.For Agile Coaches: Trend detection across survey responses or ticket metadata can surface possible bottlenecks without you reading every data point. The AI flags anomalies; you investigate and decide whether they warrant action. The failure mode in Automate is set-and-forget. Automation without monitoring becomes invisible drift. The AI might start hallucinating dependencies, misclassifying sentiment, or producing outputs that no longer match your standards. Audit cadences are not optional. Automate tagline: Delegate execution, not responsibility. AVOID: Professionalism, Not Fear Avoid is where mature practitioners earn their keep. Avoid tasks that are too risky, too sensitive, or too context-dependent for AI involvement at any level. Performance feedback requires reading emotional cues, understanding history, and calibrating the message to the recipient. AI does not know that this Developer had a difficult quarter, that they respond better to direct challenge than gentle suggestion, or that their confidence is fragile right now. You do. Conflict mediation depends on relationships, subtext, and real-time adaptation. If you let AI summarize a conflict conversation, you risk turning lived nuance into permanent misrepresentation that one or both parties will dispute. Sensitive stakeholder communication is where tone miscalibration creates actual damage. If you let AI draft a message to a stakeholder you already have a fragile relationship with, you are not saving time. You are gambling with trust. One wrong phrase and months of relationship-building evaporate. The failure mode in Avoid is rationalization. You tell yourself the AI will ‘just create a starting point’ for that delicate email, but the starting point anchors your thinking, and fragments end up in the final version. Avoid means avoid. Not ‘assist but carefully.’ Avoid tagline: Some work stays human because the cost of failure is trust. The Cultural Benefit: Making Delegation Discussable The hidden value of the A3 Framework is not individual productivity. It is team culture. When your team shares a common vocabulary for AI delegation, the conversation shifts. Instead of suspicious questions (“Who used AI on this? Did you actually think about it?”), you get productive questions: “Which category is this work in? What guardrails do we need?” Without a shared framework, AI delegation remains implicit. When the A3 Framework is explicit, it becomes discussable. Teams can establish norms: “Acceptance criteria are Assist. Always human-reviewed before refinement.” “Stakeholder escalations are Avoid. No exceptions.” A3 also creates accountability. When something goes wrong, the framework surfaces why. Did you miscategorize the task? Did you skip the review step in Assist? Did you automate without an audit? The categories make failure analyzable rather than mysterious. Implementing the A3 Framework This Week You do not need permission to start using A3. Here is how to begin: Day 1: List ten tasks you performed last week. Categorize each as Assist, Automate, or Avoid, and compare these to the actual handling. Notice where you used AI in Avoid territory, or avoided AI in Assist territory.Day 2–3: Pick one Assist-category task and run it properly. Draft with AI, then review with actual judgment. Notice the difference between rubber-stamping and genuine evaluation.Day 4–5: Identify one Automate candidate. Design the workflow: trigger, action, checkpoint, audit schedule. Do not deploy yet. Just design.End of the week: Share A3 with one colleague. Explain the categories. Ask them to categorize a task with you. Notice how the conversation changes. Conclusion: Start Categorizing Before Prompting The agile practitioners who will thrive with AI are not those who use it most. They are the ones who know when to use it, when to constrain it, and when to keep it out entirely. The A3 Framework gives you that judgment.
A couple of years ago, I developed a class-based Data Quality (DQ) framework in Scala. After some time, I became part of an accelerator framework that dealt with many data engineering tasks, with data quality being one set of activities. After considering multiple factors, it was decided that the framework would be built using Python. Given that we wanted to handle large datasets and ensure scalability, we decided to implement the data quality framework in PySpark. Using the Scala-based DQ framework as a stepping stone and foundation, I developed an equivalent DQ framework in PySpark. The framework was successful, with the DQ functionality implemented as a Python library and packaged as a wheel. After completing the implementation — and after coming across libraries like AWS Deequ — I wanted to understand and overcome the Scala Spark / Python Spark duality: why maintain two versions of the same library, one in Scala and one in Python? I read the relevant documentation and implemented another PySpark DQ library. This time, I wrote it as a wrapper around the underlying and existing Scala Spark library. This too was a success. The next natural step was to compare the performance of the PySpark library with the Scala Spark library and the PySpark wrapper library. It made sense to do this comparison. Why? Because Spark is written in Scala, and when we use PySpark, we run code in a Python VM, which in turn starts a JVM and communicates with it — adding a small delay, over and above the overhead introduced by Python itself. After running a couple of tests, I found that the performance of the PySpark program was (almost) on par with the Scala Spark version. Then the penny dropped. The reason for the similar performance is primarily because the heavy lifting of the DQ logic was being done in Spark, with Python providing a thin wrapper. Even the native PySpark version of the library is essentially a thin wrapper, with the actual work being done by Spark. After that (long context), coming back to the flavor of the season — vibe coding. One day, after launching Copilot on a whim, I asked it the following question: Plain Text Generate a class to perform arithmetic operations, written in Scala. This class has to be exposed as Pyspark object. Generate the pyspark class such that it uses the Scala class Copilot gave me the following response (as is, no edits): Plain Text To create a Scala class for arithmetic operations and expose it to PySpark, you'll need to follow these steps: --- ### **1. Scala Class for Arithmetic Operations** Create a Scala class that performs basic arithmetic operations: ```scala package com.example.arithmetic class ArithmeticOperations extends Serializable { def add(a: Double, b: Double): Double = a + b def subtract(a: Double, b: Double): Double = a - b def multiply(a: Double, b: Double): Double = a * b def divide(a: Double, b: Double): Double = { require(b != 0, "Division by zero is not allowed") a / b } } ``` Compile this class into a JAR file and make sure it's available in your Spark environment (e.g., via `--jars` or `spark.jars.packages`). --- ### **2. Expose Scala Class to PySpark via Py4J** In PySpark, you can access the Scala class using the JVM gateway: ```python from pyspark.sql import SparkSession class PyArithmeticOperations: JVM self.jvm_obj = spark._jvm.com.example.arithmetic.ArithmeticOperations() def add(self, a: float, b: float) -> float: return self.jvm_obj.add(a, b) def subtract(self, a: float, b: float) -> float: return self.jvm_obj.subtract(a, b) def multiply(self, a: float, b: float) -> float: return self.jvm_obj.multiply(a, b) def divide(self, a: float, b: float) -> float: return self.jvm_obj.divide(a, b) ``` --- ### **3. Usage Example in PySpark** ```python spark = SparkSession.builder \ .appName("ArithmeticExample") \ .config("spark.jars", "/path/to/arithmetic-operations.jar") \ .getOrCreate() arith = PyArithmeticOperations(spark) print("Add:", arith.add(10, 5)) print("Subtract:", arith.subtract(10, 5)) print("Multiply:", arith.multiply(10, 5)) print("Divide:", arith.divide(10, 5)) ``` --- Would you like help compiling the Scala code into a JAR or setting up the PySpark environment to test this? While I have not compiled this response, I am fairly confident it will work as expected — after fixing the definition of the Python class. From my perspective, this response is a positive example of vibe coding. It should be noted that even in this case, I knew exactly what I wanted the GenAI assistant to do. Since I had already written similar code, I was confident in evaluating the generated output. Remember that giving proper directions to the assistant is key. Also remember that it is very unlikely that a single prompt will work for the problem at hand. You will need to engage with the GenAI assistant and keep refining the instructions until you either get what you want — or at least get something that satisfies most of your needs, which you can then build upon.
If you’ve worked with large-scale machine learning pipelines, you must know one of the most frustrating bottlenecks isn’t always found in the complexity of the model or the elegance of the architecture — it's writing the output efficiently. Recently, I found myself navigating a complex data engineering hurdle where I needed to write a massive Pandas sparse DataFrame — the high-dimensional output of a CountVectorizer — directly to Amazon S3. By massive, I mean tens of gigabytes of feature data stored in a memory-efficient sparse format that needed to be materialized as a raw CSV file. This legacy requirement existed because our downstream machine learning model was specifically built to ingest only that format, leaving us with a significant I/O challenge that threatened to derail our entire processing timeline. And there were two critical, non-negotiable requirements that defined the success of this architecture: Parallel writes: This was essential to drastically reduce data writing time and ensure we consistently meet our demanding production SLA.To strictly maintain row order: To ensure that incremental batches append correctly and preserve the vital record-to-record relationship with the reference dataset, the output sequence must remain entirely deterministic and reproducible across every execution. What initially appeared to be a standard to_csv call rapidly evolved into an exhaustive technical deep dive into why distributed frameworks often struggle to manage the memory overhead and serialization complexities inherent in sparse DataFrames. Here is how I solved it, navigating a series of trials and errors. The Attempts That Looked Promising — But Failed Before a successful write, I tried what the internet says works and learned why it doesn’t (at scale). In a distributed environment, these often working alternatives break when sparse data is involved. AttemptExpected OutcomeActual Result Write directly using .to_csv() Slow but workedOut-of-memory after several hours.Convert to Dask DataFrame and write partitionsParallel WritingStill sequential.Convert to Spark DataFrame and writeFast Distributed write operationPandas sparse dtype unsupported.Convert sparse → dense → parquet → CSV via SparkFast distributed write operation with little overhead for conversion.Sparse to dense conversion caused an OOM error.ThreadPoolExecutor / Dask DistributedMultithreaded writes are hence fasterStill serialized writes. Each potential path I explored ultimately broke due to a combination of architectural and resource-based limitations. Sparse DataFrame incompatibility: Most high-level distributed frameworks are optimized for dense, structured data and lack native, performant support for the specialized Pandas sparse dtype.Massive serialization overhead: This approach failed because even after serialization, compromising on the overhead, the writes were still sequential. It breached the production SLA.Persistent sequential write behavior: Despite being in a distributed environment, many libraries reverted to a single-threaded write process when dealing with the complexities of these specific data structures.Explosive memory requirements: Converting a sparse Pandas object into a Spark or Dask object often forces a conversion to a dense dataframe or a massive serialization overhead that crashes the executor. Even with maximum Glue resources, writing a single CSV file became a runtime show stopper. The Breakthrough: Chunk + Parallelize at the Spark RDD Level The solution was to stop trying to make Spark "understand" the sparse DataFrame and start using Spark as a distributed orchestrator. Instead of passing the data to Spark, I kept the DataFrame in memory (on a high-memory worker type) and used Spark's parallelism to distribute the instruction of writing, rather than the data itself. Instead of trying to write the entire DataFrame at once, the dataset was: Split into fixed-length row chunks (50,000 per chunk worked well, but can be configured based on the data volume).Each chunk is written independently as a separate CSV file.Write operations are distributed using rdd.parallelize(). By adopting this architecture, we achieved several critical technical milestones that high-level abstractions couldn't provide: Optimized parallel execution: We achieved true distributed execution across Spark executors, bypassing the traditional bottlenecks of single-threaded Python writes.Elimination of dense conversion: The workflow entirely avoided the "densification" trap, ensuring the data remained in its efficient sparse state throughout the entire I/O lifecycle.Guaranteed row order preservation: Because the write operations were strictly ordered, we maintained perfect row alignment essential for our reference file-related processing.Predictable memory footprint: Each batch was isolated to a specific row count, allowing us to maintain a stable, predictable memory overhead per executor, regardless of the total dataset size. Working Code Snippet Python from datetime import datetime import math from awsglue.context import GlueContext from pyspark.context import SparkContext # Initialize the Glue/Spark environment sc = SparkContext() glueContext = GlueContext(sc) spark = glueContext.spark_session def write_batch(batch_id): start = batch_id * batch_size end = min(start + batch_size, len(sparse_pandas_df)) # Isolate the slice using index-based slicing (fast for sparse data) batch_df = sparse_pandas_df.iloc[start:end] # Define a deterministic path to maintain row-order lineage csv_path = f"s3://{bucket}/{prefix}/chunk-{batch_id}.csv" # Write the chunk to S3 without headers or index batch_df.to_csv(csv_path, index=False, header=False) # Configuration based on data volume and memory availability batch_size = 50000 num_batches = math.ceil(len(sparse_pandas_df) / batch_size) # The Pivot: Parallelize the task range (integers), not the dataframe itself spark.sparkContext.parallelize(range(num_batches)).foreach(write_batch) Performance Gains After implementing this pattern, writing 20–40GB of sparse features dropped from: MethodTimeSingle csv write4-6 Hours (if it didn't crash)Chunked Parallel WritesMinutes Even 50–80 GB workloads completed successfully — still maintaining row order, critical for this implementation. Performance Results As mentioned in the previous section, after migrating to this chunked-parallel write, the performance gains were huge. And even when data volume becomes huge, we can handle it using horizontal scaling to allow more executors. Final Thoughts for Data Engineers When working with AWS Glue or EMR, we often feel pressured to use the highest-level abstractions (like DynamicFrames or SparkSQL). However, this challenge reminded me that sometimes the most performant solution involves dropping down to the RDD level to orchestrate standard Python logic. If you're trying to: Write huge sparse pandas DataFramesMaintain row orderingAvoid dense conversionsUse parallel IO efficiently then don’t rely on Spark conversion or threaded Dask writes. Instead, chunk your data, then let Spark distribute the writing workload, not the DataFrame.
Abstract The integration of AI-enhanced microservices within the SAFe 5.0 framework presents a novel approach to achieving scalability in enterprise solutions. This article explores how AI can serve as a lean portfolio ally to enhance value stream performance, reduce noise, and automate tasks such as financial forecasting and risk management. The cross-industry application of AI, from automotive predictive maintenance to healthcare, demonstrates its potential to redefine processes and improve outcomes. Moreover, the shift towards decentralized AI models fosters autonomy within Agile Release Trains, eliminating bottlenecks and enabling seamless adaptation to changing priorities. AI-augmented DevOps challenges the traditional paradigms, offering richer, more actionable insights throughout the lifecycle. Despite hurdles in transitioning to microservices, the convergence of AI and microservices promises dynamic, self-adjusting systems crucial for maintaining competitive advantage in a digital landscape. In the realm of enterprise solutions, scalability has always been a unicorn of sorts. As someone who’s traversed the treacherous waters of software engineering for over a decade (and then some), I’ve seen frameworks come and go like fashion trends — what might be hot one season is passé the next. Yet, the SAFe 5.0 framework has emerged as a dependable ally in managing portfolio and solution trains at scale. And now, with the integration of AI-enhanced microservices, we’re not just talking about surviving; it’s about thriving in complexity. The Realization: AI as a Lean Portfolio Ally Let’s rewind to a pivotal moment in my career. I was leading a project where we were elbow-deep in transforming legacy systems into modern, scalable architectures. The client wanted speed — who doesn’t?— but we were drowning in manual decision processes. That’s when it struck me: AI could be the key to unlocking leaner portfolio management. It wasn’t just about minimizing headcount or streamlining processes; it was about enhancing them with real-time insights. AI-driven microservices can be a game-changer for Lean Portfolio Management within SAFe. By optimizing decision analytics and enhancing value stream performance, AI simplifies, rather than complicates. I know what you’re thinking: AI tools can add complexity. One client put this to the test, and we found AI helped reduce the noise. It sliced through the data smog to identify hidden value streams and automate mundane tasks like financial forecasting and risk management. This leaner, meaner approach to portfolio management was an eye-opener. Cross-Industry Crossover: Lessons from Automotive to Healthcare Interestingly, you find inspiration in the unlikeliest of places. In a project for an automotive client focused on predictive maintenance, a light bulb went on. The automotive industry’s approach to monitoring vehicle health could be applied in healthcare. This isn't as far-fetched as it sounds. For healthcare providers, predictive health monitoring bolstered by AI-enhanced microservices can personalize treatment plans for patients. This cross-pollination is not just theoretical. While working on a client's claims center integration, we saw how AI-enhanced services from one sector can inform those in another: In this case, translating a successful predictive maintenance model — one that keeps vehicles from unexpected breakdowns — into a system that anticipates patient needs. The implications are massive: reduced wait times, tailored treatments, and improved outcomes. This unexpected connection underscored how AI can redefine not just technical processes, but the very fabric of inter-industry solutions. Decentralized AI Models: Elevating Agile Release Trains (ARTs) Now, let’s delve into the nuts and bolts, which is honestly the fun part for my inner tech geek. Integrating decentralized AI models into SAFe’s ARTs can significantly enhance their autonomy. During a high-stakes project, we shifted from a centralized to a decentralized model, which allowed ARTs to self-optimize and adapt to shifting priorities seamlessly. It was like giving ARTs a brain of their own. Decentralized AI models reduce the bottlenecks you'd typically encounter in centralized systems. Think of the ARTs as small startups within the larger enterprise ecosystem, each capable of making swift, informed decisions. The absence of a single chokepoint of decision-making means these trains can run on time and at speed, even as they navigate the complexities of changing business needs. The key takeaway here is understanding the delicate balance between granting autonomy and ensuring alignment with overarching portfolio goals. AI-Augmented DevOps: Challenging Traditional Paradigms I admit, initially, I was skeptical about introducing AI into our existing DevOps practices. It’s easy to get comfortable with the ‘if it ain’t broke, don’t fix it’ mentality. However, after watching AI tools predict deployment risks and automate testing in my current role leading Mule Transformation programs, I became a believer. These tools didn’t just empower the team; they reshaped our approach to problem solving. With AI augmenting our DevOps toolchain, we saw intelligent feedback loops forming—automated insights that were richer and more actionable. This experience taught me, sometimes we let tradition stifle innovation. Embracing AI within SAFe DevOps isn’t just beneficial; it's transformative. It challenges the perception that AI is only useful post-deployment, carving out its role in the entire lifecycle. The Industry Reality: Bridging Gaps and Overcoming Hurdles The demand for scalable enterprise solutions is undeniable, yet the journey isn’t without hurdles. At its core, the transition to microservices can be fraught with complexity and consistency challenges. Enterprises often struggle to integrate AI into existing frameworks. In my experience, many lack robust methodologies, which hinders the entire scaling process. While working with C4E teams at Tata Consultancy Services, I witnessed firsthand the challenges of maintaining consistency across distributed systems. However, integrating AI-enhanced microservices provided a lifeline—delivering intelligent monitoring, adaptive resource allocation, and predictive maintenance. Here’s my advice: don’t shy away from acknowledging these gaps. Instead, leverage them to develop specialized integration tools and methodologies. Investing in AI training for Agile professionals doesn’t just close these gaps; it obliterates them. Looking Ahead: AI and Microservices’ Convergence If I were to predict the future, I’d wager it heavily hinges on the convergence of AI and microservices within scalable frameworks like SAFe 5.0. The potential for dynamic, self-adjusting systems is immense. We're talking about systems capable of anticipating and reacting to market fluctuations with minimal human input. This isn’t just a tech enthusiast's dream—it's an emerging reality. The maturity of AI technologies spells a future where enterprises aren’t just keeping up; they’re setting the pace. So, if there’s a single, actionable insight to glean from my journey, it’s this: enterprises need to actively pursue cross-industry collaborations, invest in AI-powered microservices, and hone their Agile professionals’ skill sets. Doing so isn’t just beneficial; it’s essential for staying competitive in an ever-evolving digital landscape. Conclusion: More Than Just Tech In integrating AI-enhanced microservices within the SAFe 5.0 framework, we’re not just embedding technology into structure; we’re embedding intelligence. This journey is about more than just adding another tool to our arsenal. It’s about enriching enterprise solutions, offering them agility and adaptability to not only face, but thrive in the challenges ahead. That's the adventure we find ourselves on, and these insights were hard-won, over cups of coffee and late-night debugging sessions. If you're on this path, embrace AI with open arms—because, believe me, it's not just the future; it's the present.
Justin Albano
Software Engineer,
IBM