The coregrid Radar is a weekly column where we curate key architectural insights and major releases. No need to search across multiple sources — everything in one place, from AI-native systems to Security and Crypto.
Observability & Reliability Engineering
Airbnb: From Vendors to Vanguard — Observability Ownership | Airbnb shares its transition from third-party observability vendors to an in-house platform, gaining tighter cost control, deeper workflow integration, and faster feature evolution. The key shift is treating observability as a strategic engineering capability rather than external tooling. Read the release
Upleveling Alert Development at Airbnb | Airbnb reframed alerting as a product discipline with standards, reviews, and quality gates to reduce noise and improve diagnostic value. The focus is systematic MTTR reduction through better alert design, not cultural fixes. Read the release
Zero-Code Observability for LLMs and Agents on Kubernetes | A practical approach to instrumenting LLM workloads without code changes, enabling automatic metrics, tracing of inference pipelines, and visibility into token usage and latency. Useful for teams operationalizing AI services at speed. Read the release
Monitoring MCP Servers with OpenLIT and Grafana Cloud | Model Context Protocol servers introduce a new operational surface for AI agents. This piece outlines how to monitor agent health, tool-call latency, and resource consumption using standardized observability pipelines. Read the release
Cloud Native & Kubernetes
How Reddit Migrated Petabyte-Scale Kafka to Kubernetes | A rare deep dive into migrating a massive stateful Kafka deployment to Kubernetes. Reddit details storage tuning, scheduling constraints, and capacity planning strategies required to run data-heavy systems reliably in a containerized environment. Read the release
Securing Production Debugging in Kubernetes | Guidance on enabling production debugging without compromising security, covering ephemeral containers, RBAC boundaries, and auditability. A strong reference for building compliant and controlled debugging workflows. Read the release
The Invisible Rewrite: Modernizing the Kubernetes Image Promoter | A behind-the-scenes rewrite of the Kubernetes image promotion tool focused on idempotency, supply chain integrity, and release transparency. A good example of improving reliability through internal tooling modernization. Read the release
Ingress2Gateway 1.0: Your Path to Gateway API | Ingress2Gateway simplifies migration to the Gateway API, which is emerging as the standard for L7 traffic management in Kubernetes. The shift enables more expressive and extensible network configurations at platform scale. Read the release
Running Agents on Kubernetes with Agent Sandbox | Agent Sandbox introduces a runtime model for long-lived AI agents inside Kubernetes clusters, combining isolation, resource control, and native integration with cluster primitives. Kubernetes continues to evolve as the default execution layer for AI orchestration. Read the release
Data Platforms & Distributed Systems
From ScyllaDB to Kafka: Real-Time Data at Scale | Natura’s architecture pairs ScyllaDB for low-latency storage with Kafka as a streaming backbone. The design highlights a clean separation between persistence and event distribution to maintain performance under sustained real-time load. Read the release
Lessons Learned Running Presto at Meta Scale | Meta outlines operational challenges in running Presto at hyperscale, including resource isolation, query skew, and multi-tenant workload management. Valuable insight for teams operating distributed SQL engines in large analytical environments. Read the release
MongoDB Query Plan Cache Explained | A technical breakdown of MongoDB’s query plan cache, detailing when caching improves performance and when re-planning can degrade it. Particularly relevant for high-throughput OLTP systems with dynamic query patterns. Read the release
Rate Limiting Strategies with Valkey and Redis | A comparison of token bucket, leaky bucket, and sliding window algorithms implemented with Redis and Valkey, analyzing trade-offs in accuracy, latency, and scalability. A practical guide for API gateways and edge protection layers. Read the release
Gossip Protocol Explained | An engineering-focused explanation of gossip protocols for membership management, state propagation, and anti-entropy in distributed systems. Essential background for understanding service discovery and cluster coordination at scale. Read the release
Architecture & Control Planes
How GitHub Rebuilt Search for High Availability in GitHub Enterprise Server | GitHub redesigned its search architecture to improve failure isolation and enable controlled failover in enterprise deployments. The case illustrates how to reduce blast radius while maintaining search consistency and performance. Read the release
Configuration as a Control Plane: Designing for Safety and Reliability at Scale | This article frames configuration management as a dedicated control plane with versioning, validation, and progressive rollout strategies. A useful perspective for large distributed systems where configuration errors carry high risk. Read the release
Morgan Stanley Rethinks Its API Program for the MCP Era | Morgan Stanley adapts its API architecture to support AI-driven workflows and MCP integrations, strengthening governance and contract models. APIs are increasingly treated as programmable interfaces for agents, not just human consumers. Read the release
Crossplane and AI: The Case for API-First Infrastructure | Crossplane promotes API-first infrastructure as a foundation for AI automation, positioning declarative APIs as the control surface for autonomous infrastructure agents. A forward-looking view on programmable cloud resources. Read the release
Security & Cryptography
High-Performance Envelope Encryption at Ariso.ai with Vault | Ariso.ai scales sensitive workloads using Vault’s Transit Engine and envelope encryption to minimize cryptographic overhead while maintaining strong key isolation. A practical reference for high-throughput systems with strict data protection requirements. Read the release