A selection of architectural insights and releases we read this week.
AI Systems & LLM Infrastructure
🔹 Blink (CPU-free LLM Inference)
A radical shift in serving architecture: fully offloading the inference pipeline to the GPU and SmartNIC eliminates the CPU as a bottleneck, reducing latency and increasing throughput under heavy loads.
Read the release
🔹 ForkKV (Disaggregated KV Cache for Multi-LoRA)
A copy-on-write KV cache allows scaling multi-LoRA inference without a linear increase in memory, making multi-tenant LLM serving significantly more efficient.
Read the release
🔹 Holos Multi-Agent System
LLM agents are moving to web-scale orchestration: an architecture is proposed where agents coordinate as a distributed system with explicit roles, memory, and communication.
Read the release
🔹 Anthropic Three-Agent Harness
A practical implementation of multi-agent development: separating roles (planner, coder, verifier) increases the resilience of long-running tasks and reduces quality degradation.
Read the release
Distributed Systems & HPC
🔹 Reverse Address Translation in Multi-GPU Pods
It is shown that RAT can become a hidden bottleneck in scale-up GPU configurations, affecting inter-GPU communication latency—critical for LLM training.
Read the release
🔹 Alltoallv RMA in MPI
An analysis of persistent RMA implementations shows how to reduce the overhead of collective operations—key to optimizing communication-heavy HPC workloads.
Read the release
🔹 Minos (GPU Workload Profiling)
A framework systematically links the performance and power profiles of GPU tasks, paving the way for energy-aware scheduling in HPC clusters.
Read the release
Cloud Native & Platform Engineering
🔹 OpenFaaS Runtime & Kubernetes Study
It is demonstrated that the choice of runtime and Kubernetes distribution significantly affects the cold start and throughput of serverless functions—tuning requires a systemic approach rather than relying on defaults.
Read the release
🔹 Pinterest Spark Auto Memory Retries
An engineering pattern: automatic retries with memory adaptation reduce OOM errors by 96%, turning unstable batch pipelines into predictable ones.
Read the release
🔹 Autonomous AI SRE Agent (Elasticsearch)
The end-to-end SRE cycle (deploy → monitor → heal) is automated via an agent, demonstrating a real transition to self-healing systems without human intervention. Read the release
Data Infrastructure & Databases
🔹 Etsy Migration to Vitess
The migration of 1000 MySQL shards (425 TB) to Vitess confirms Vitess’s maturity as a control plane for massive sharding and online zero-downtime migrations.
Read the release
Developer Experience & Performance Engineering
🔹 GitHub Diff Performance Engineering
A deep dive into optimizing diff rendering shows that bottlenecks often lie in comparison algorithms and layout rather than I/O—an important lesson for UI engineering on large data sets.
Read the release
Security & Blockchain
🔹 Routing Attacks in Ethereum PoS
Attacks at the network layer (routing/eclipse) remain a real threat to PoS: consensus is secure only with robust network connectivity.
Read the release