What Can Make Http Request Failed On Complex K8s Llm Infra P.s. Aibrix
TLDR;
Low disk space makes HTTP request routing fail due to WASM cache in envoy controller not being loaded properly. This can cause very weird failure pattern where a request only with additional header (routing-strategy) fails with no healthy upstream and a request without this header works perpectly…
Background and Context
I have been working on my intelligent request routing project for LLM inference applications. It has been built based on AIBrix. I set up a local dev environment on my macbook for faster development and debugging to avoid docker image push to and pull from the remote repository on the remote k8s cluster. The AIBrix installation was done correctly and curl test was also successful. However, it stopped working after I have a lunch… Finding out the root cause of the http request failure was tricky because the error was just saying it fails due to no healthy upstream but all AIBrix components were up running including the vllm app. And none of the components (aibrix controller manager, envoy gateway, aibrix gateway plugin, etc.) were telling what the problem was.
In this blog, I am going to talk about the troubleshooting process. It turned out a simple issue but it didn’t manifest at all and so hard to find since it was hidden like a ring lost in a beach.
AIBrix Infrastructure Overview
AIBrix is a scalable cloud-native infrastructure for LLM inference applications. While it provides many capabilities including auto-scaling, model serving, resource management, multi-tenancy, and cost optimization, this debugging session focuses specifically on its intelligent request routing feature.
The routing intelligence is one key feature among many others that AIBrix provides:
- Auto-scaling based on inference workload patterns
- Model serving with vLLM, TensorRT-LLM, and other framework integrations
- Resource management for optimal GPU/CPU allocation
- Multi-tenancy for serving multiple models and users
- Cost optimization through intelligent resource scheduling
- Observability for comprehensive LLM workload monitoring
Control Plane vs Data Plane Architecture
The system follows the standard separation of concerns used by modern service mesh and API gateway solutions (Istio, Consul Connect, AWS App Mesh, Linkerd, Kong):
Control Plane (configuration management):
- Envoy Gateway Controller: Translates Gateway API resources into Envoy configurations
- AIBrix Controller Manager: Manages AIBrix-specific resources and model discovery
Data Plane (traffic processing):
- Envoy Proxy Instance: Processes actual HTTP requests and applies routing rules
envoy-aibrix-system-aibrix-eg-903790dcdeployment and its pod
- aibrix-gateway-plugins: Provides LLM-specific routing intelligence
aibrix-gateway-pluginsand its pod
- Model Services: Execute inference workloads
- vllm pods
Core Components
AIBrix Controller Manager (aibrix-controller-manager): Manages custom resources like ModelAdapters and handles auto-discovery of LLM model deployments. It watches for pods with specific labels and automatically creates routing configurations. This component runs the AIBrix control plane logic.
Envoy Gateway Controller (envoy-gateway): The upstream Envoy Gateway project controller that translates Kubernetes Gateway API resources (Gateway, GatewayClass, HTTPRoute) into Envoy proxy configurations. This is the standard Envoy Gateway controller that acts as a “compiler” converting high-level Gateway API resources into low-level Envoy configuration.
Envoy Proxy Instance (envoy-aibrix-system-aibrix-eg-903790dc): The actual Envoy proxy pod that handles incoming requests and processes all HTTP traffic. This is the data plane component that applies routing rules, performs load balancing, and communicates with upstream services. It’s created and managed by the Envoy Gateway controller based on Gateway resource configurations.
aibrix-gateway-plugins (aibrix-gateway-plugins): The core AIBrix routing intelligence engine that implements LLM-specific routing strategies. This component receives requests from the Envoy proxy via the external processing mechanism and makes intelligent routing decisions based on factors like current load, model warm-up state, cache efficiency, SLA requirements, and resource utilization.
Model Services (llama2-7b, mock-app): The actual LLM serving applications that process inference requests. These can be vLLM, TensorRT-LLM, or other model serving frameworks integrated with AIBrix.
Request Lifecycle in AIbrix
Regarding the http request lifecycle, it first touches aibrix-gateway-plugins (envoy external service), the aibrix-gateway-plugins sends it to aibrix-gateway-plugins. After aibrix-gateway-plugin runs routing algorithm, it returns the request back to the aibrix-gateway-plugins with target pod, and then the aibrix-gateway-plugins forwards the request to the appropriate vllm pod. Let’s take a look at each of the components involved in request lifecycle to have better understanding of their role.
AIBrix leverages Envoy Proxy rather than building a custom proxy because it needs:
- High-performance HTTP/gRPC traffic handling for LLM inference workloads
- Advanced load balancing with health checking and circuit breaking
- Production-grade observability with metrics, tracing, and logging
- Extensibility through filters and external processing for AI-specific routing logic
- Cloud-native integration with Kubernetes service discovery
This allows AIBrix to focus on LLM-specific optimizations while leveraging proven proxy infrastructure.
Request Processing Flow
The request processing involves multiple layers in the data plane, with control plane components managing configuration but not touching actual requests:
- Client Request → Port-forward (localhost:8888)
- Port-forward → Envoy Proxy Instance (envoy-aibrix-system-aibrix-eg-903790dc)
- Envoy Proxy → Route matching based on HTTPRoute configurations (managed by Envoy Gateway Controller)
- Route Decision Branch:
- Direct Routes: Requests with
model: llama2-7bheader → Direct to model service (llama2-7b:8080) - Extension Processing: Requests with
routing-strategyheader → EnvoyExtensionPolicy triggers external processing
- Direct Routes: Requests with
- Envoy Proxy → aibrix-gateway-plugins service (via external processing filter)
- aibrix-gateway-plugins is an external processing service that Envoy calls via gRPC through the external processing filter. It runs routing algorithm.
- aibrix-gateway-plugins → Envoy proxy
- aibrix-gateway-plugins returns selected pod to envoy proxy
- Envoy proxy → selected vllm pod
- Forwards request to selected pod
Note: The Envoy Gateway Controller and AIBrix Controller Manager are control plane components that manage configuration but do not process actual requests. They only configure the data plane components that handle traffic.
Envoy Gateway Controller
It reads Gateway API resources and generates Envoy configuration:
- Converts HTTPRoute resources into Envoy route configurations
- Manages the lifecycle of Envoy proxy pods
- Applies EnvoyExtensionPolicy configurations as Envoy external processing filters
Envoy Proxy Instance (envoy-aibrix-system-aibrix-eg-903790dc)
It is the critical data plane component:
- Receives all incoming HTTP requests
- Applies routing rules generated by Envoy Gateway Controller
- For extension policy routes, forwards requests to external processing services
- Handles load balancing, health checking, and traffic management
aibrix-gateway-plugins
It provides the intelligence layer:
- Receives requests from Envoy via gRPC external processing API
- Maintains model metadata and performance metrics
- Implements various routing algorithms based on the routing-strategy header
- Returns routing decisions back to Envoy proxy
This architecture allows AIBrix to leverage standard Envoy Gateway functionality while adding intelligent routing capabilities through the extension mechanism.
Problem Description
The issue manifested as a http request failure that was highly specific to request headers: (Again, it was working originally. After I had food, it stopped working…)
This is the curl that needs to success. routing-strategy is specified as a header. model and message are specified in the body.
curl -v http://localhost:8888/v1/chat/completions \
-H "Content-Type: application/json" \
-H "routing-strategy: random" \
-d '{"model": "llama2-7b", "messages": [{"role": "user", "content": "test"}]}'
It failed, saying HTTP 503, “no healthy upstream”.
However, after testing different curl requests, I found curl succeeded when it routing-strategy header is omitted…. This is very weird.
Working request:
curl -v http://localhost:8888/v1/chat/completions \
-H "Content-Type: application/json" \
-H "model: llama2-7b" \
-d '{"model": "llama2-7b", "messages": [{"role": "user", "content": "test"}]}'
Result: HTTP 200!
Let’s take a look at what the difference between the above two failing and successful requests are. The Envoy access logs from the proxy instance (envoy-aibrix-system-aibrix-eg-903790dc) shows critical differences between working and failing requests. Let’s examine both cases:
kubectl logs envoy-aibrix-system-aibrix-eg-903790dc-7c8dd7d4b9-64w68 -n envoy-gateway-system
Working request logs (with model: llama2-7b and without routing-strategy header):
{":authority":"localhost:8888","bytes_received":73,"bytes_sent":302,"connection_termination_details":null,"downstream_local_address":"127.0.0.1:10080","downstream_remote_address":"127.0.0.1:55222","duration":6,"method":"POST","protocol":"HTTP/1.1","requested_server_name":null,"response_code":200,"response_code_details":"via_upstream","response_flags":"-","route_name":"httproute/aibrix-system/llama2-7b-router/rule/0/match/1/*","start_time":"2025-08-20T23:21:58.905Z","upstream_cluster":"httproute/aibrix-system/llama2-7b-router/rule/0","upstream_host":"10.244.0.19:8000","upstream_local_address":"10.244.0.30:48658","upstream_transport_failure_reason":null,"user-agent":"curl/8.7.1","x-envoy-origin-path":"/v1/chat/completions","x-envoy-upstream-service-time":null,"x-forwarded-for":"10.244.0.30","x-request-id":"4000a69d-3db6-404c-8892-4f8023e008f0"}
Ignore other parts but see “route_name”:”httproute/aibrix-system/llama2-7b-router/rule/0/match/1/*“. Envoy proxy finds the right route.
Failing request logs (with ‘routing-strategy’ header):
{":authority":"localhost:8888","bytes_received":0,"bytes_sent":19,"connection_termination_details":null,"downstream_local_address":"127.0.0.1:10080","downstream_remote_address":"127.0.0.1:47350","duration":0,"method":"POST","protocol":"HTTP/1.1","requested_server_name":null,"response_code":503,"response_code_details":"no_healthy_upstream","response_flags":"UH","route_name":"original_route","start_time":"2025-08-21T03:23:19.513Z","upstream_cluster":"original_destination_cluster","upstream_host":null,"upstream_local_address":null,"upstream_transport_failure_reason":null,"user-agent":"curl/8.7.1","x-envoy-origin-path":"/v1/chat/completions","x-envoy-upstream-service-time":null,"x-forwarded-for":"10.244.0.34","x-request-id":"34137566-5a0d-4835-878e-bfebcb2e67fd"}
Again, ignoring other parts but see “route_name”:”original_route”. This is envoy proxy falling back to the original route. In Envoy Proxy, original_route is a default route name used as a fallback when the HTTP request doesn’t match any of the explicitly defined routes in the configuration. The orignal_route here is localhost(127.0.0.1) and port 8888 in ip table.
Envoy receives the request and, based on its routing rules, forwards it to a different, internal service (vllm pods in our context). The source IP of the new connection from Envoy to the internal service is Envoy’s own IP address, not the original client’s IP.
So original_route will fail. We need the route given by envoy which is httproute/aibrix-system/llama2-7b-router/rule/0/match/1/*.
What this tells us:
The working request shows that Envoy successfully:
- Matched a configured HTTPRoute:
httproute/aibrix-system/llama2-7b-router/rule/0/match/1/* - Found a healthy upstream cluster:
httproute/aibrix-system/llama2-7b-router/rule/0 - Connected to the actual pod:
10.244.0.19:8000 - Received a response from the backend:
via_upstream
The failing request shows that Envoy:
- Fell back to default routing:
original_route(not a configured HTTPRoute) - Used fallback cluster:
original_destination_cluster(not a real backend) - Found no upstream:
upstream_host: null - Had no backend to connect to:
no_healthy_upstream
This pattern suggested that requests with routing-strategy headers were not matching any of the configured HTTPRoutes and were falling back to a default route with no available upstream. The original_route is Envoy’s fallback mechanism when no configured routes match the request, indicating a routing configuration or processing issue rather than a backend availability problem.
Investigations…
Investigating the symptom, requests with routing-strategy headers falls back to original_route and failed
Route Precedence Investigation
Reasoning: Since the routing behavior differed based on headers, we suspected that multiple HTTPRoutes were conflicting, causing requests to match the wrong route before reaching the extension processing pipeline.
Gateway API processes routes in a specific order, and I thought the issue might be related to route precedence - where a broader route was capturing requests before a more specific route could match them.
Let’s check the HTTPRoute configurations. There are three HTTPRoute resources.
kubectl get httproutes -A --sort-by=.metadata.creationTimestamp
NAMESPACE NAME HOSTNAMES AGE
aibrix-system aibrix-reserved-router-models-endpoint 12h
aibrix-system aibrix-reserved-router 8h
aibrix-system llama2-7b-router 6h
kubectl describe httproute -A
# aibrix-reserved-router - handles requests for extension processing
spec:
rules:
- matches:
- path:
type: PathPrefix
value: /v1/chat/completions
backendRefs:
- name: aibrix-gateway-plugins
port: 50052
# llama2-7b-router - handles direct model requests
spec:
rules:
- matches:
- path:
type: PathPrefix
value: /v1/chat/completions
headers:
- name: model
type: Exact
value: llama2-7b
backendRefs:
- name: llama2-7b
port: 8000
Let’s see llama2-7b-router status.
kubectl describe httproute llama2-7b-router -n aibrix-system
Status:
Conditions:
Type: Accepted
Status: True
Message: Route is accepted
Type: ResolvedRefs
Status: True
Message: Resolved all the Object references for the Route
All routes showed “Accepted” status, indicating the route configurations were valid.
We also verified the Envoy proxy was receiving these configurations:
kubectl logs envoy-aibrix-system-aibrix-eg-903790dc-7c8dd7d4b9-64w68 -n envoy-gateway-system | grep -i "route\|config"
The proxy logs showed it was receiving route updates, but requests with routing-strategy headers were still not matching the expected routes.
So….. the route configurations were not the problem - the Envoy proxy instance was receiving valid configurations, but something in the extension processing pipeline was causing requests to bypass the normal routing logic.
Header Matching Syntax Analysis
The Envoy proxy logs contained deprecation warnings about header matching syntax, suggesting that the header matching mechanism in the proxy might not be working correctly. Though, it should not be the reason since it worked earlier!!! But I was not trusting anything so let’s check…
Envoy proxy deprecation warnings:
kubectl logs envoy-aibrix-system-aibrix-eg-903790dc-7c8dd7d4b9-64w68 -n envoy-gateway-system
[WARNING] Deprecated field: type envoy.config.route.v3.HeaderMatcher Using deprecated option 'envoy.config.route.v3.HeaderMatcher.exact_match'
I hypothesized that this deprecated syntax might be causing header matching to fail within the Envoy proxy, particularly for complex header combinations.
kubectl edit httproute llama2-7b-router -n aibrix-system
# Changed from:
headers:
- name: model
type: Exact
value: llama2-7b
# Changed to:
headers:
- name: model
value: llama2-7b
I tried removing the type: Exact field.
After applying the changes, we verified the Envoy proxy received the updated configuration: However, the routing behavior remained unchanged. Requests with just the model header continued to work, while requests with routing-strategy headers still failed. So, this is not the root cause. I was not haunted as bad as I thought…
Configuration Comparison between problematic local kind cluster and working remote cluster
Reasoning: Since we had access to a working cluster with identical AIBrix setup, we decided to compare configurations to identify any differences that might explain the behavior variance between the working and non-working environments.
kubectl get envoyproxy aibrix-custom-proxy-config -n aibrix-system -o yaml
kubectl get gateway aibrix-eg -n aibrix-system -o yaml
kubectl get gatewayclass aibrix-eg -o yaml
kubectl get httproutes -A -o yaml
EnvoyProxy configurations were identical
The Envoy proxy pod configurations, including image versions, resource allocations, and environment variables, were identical between clusters.
And all other configurations were also same… It was driving me crazy.
Extension Policy Investigation (Actual Root Cause)
Reasoning: Since the previous investigations had ruled out basic configuration issues, I decided to examine the extension processing pipeline more closely. The fact that requests with routing-strategy headers were hitting “original_route” in the Envoy proxy suggested that the EnvoyExtensionPolicy mechanism was failing.
We checked the Envoy Gateway controller logs specifically for routing-related errors:
kubectl logs -n envoy-gateway-system deployment/envoy-gateway | grep -i "routing"
2025-08-21T03:22:49.283Z ERROR xds-translator failed to translate xds ir
"match":{"headers":[{"name":"routing-strategy","string_match":{"safe_regex":{"regex":".*"}}}],"prefix":"/v1"},
"name":"original_route",
"route":{"cluster":"original_destination_cluster","timeout":"120s"},
"typed_per_filter_config":{"envoy.filters.http.ext_proc/envoyextensionpolicy/aibrix-system/aibrix-gateway-plugins-extension-policy/extproc/0"
err:invalid RouteConfiguration.VirtualHosts[0]: embedded message failed validation
Oh OH OH Oh… There was indeed a special route being created for routing-strategy headers It was associated with an EnvoyExtensionPolicy (aibrix-gateway-plugins-extension-policy) The extension policy configuration was failing validation This caused requests to fall back to the “original_route” with no healthy upstream
(envoyextensionpolicy was a resource we were missing… who thinks it is the root cause…) Extension policy status check:
kubectl describe envoyextensionpolicy aibrix-gateway-plugins-extension-policy -n aibrix-system
Status:
Conditions:
Type: Accepted
Status: False
Reason: Invalid
Message: Wasm: wasm cache is not initialized.
Oh Oh Oh, my boi.
The extension policy was failing specifically because the WASM cache could not be initialized. This failure was preventing the Envoy proxy from properly configuring the external processing filter that handles requests destined for the aibrix-gateway-plugins.
When a request lacks the routing-strategy header but includes ‘model: llama2-7b’, it matches the llama2-7b-router HTTPRoute directly. Envoy routes it straight to the vLLM pod using the configured backend reference, bypassing any extension processing entirely. This path works because it doesn’t depend on the broken WASM cache.
When a request includes the routing-strategy header, it should match the aibrix-reserved-router HTTPRoute, which is configured to trigger the EnvoyExtensionPolicy for external processing through aibrix-gateway-plugins. However, due to the WASM cache initialization failure, the extension policy cannot be properly instantiated in Envoy’s filter chain. This causes the request to fail route matching entirely, falling back to Envoy’s default “original_route” - a passthrough route with no configured upstream, resulting in the “no healthy upstream” error.
In summary,
With model: llama2-7b header (no routing-strategy):
Matches the llama2-7b-router HTTPRoute directly Route: httproute/aibrix-system/llama2-7b-router/rule/0/match/1/* Goes directly to the vLLM pod at 10.244.0.19:8000 No extension processing involved
With routing-strategy header:
Should match the aibrix-reserved-router HTTPRoute Should trigger EnvoyExtensionPolicy for external processing But fails due to WASM cache issue Falls back to original_route (which has no valid upstream)
WASM Cache Analysis
Reasoning: With the WASM cache initialization failure identified, we needed to understand why the cache couldn’t be created. WASM (WebAssembly) is used by Envoy for extending functionality through external processing filters. If the cache can’t be initialized, extension policies will fail, and the Envoy proxy will fall back to default routing behavior.
Key diagnostic command:
kubectl logs -n envoy-gateway-system deployment/envoy-gateway | grep -i "wasm\|cache\|init"
Critical output:
2025-08-21T03:49:15.867Z ERROR gateway-api Failed to create Wasm cache directory
{"runner": "gateway-api", "error": "mkdir /var/lib/eg/wasm: no space left on device"}
This log entry immediately revealed the root cause - the system was running out of disk space and couldn’t create the WASM cache directory that the Envoy Gateway controller needed for extension policy processing.
Verification:
docker exec kind-control-plane df -h
Output:
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 59G 59G 0 100% /
The kind cluster’s filesystem was completely full. We also checked the specific impact on the Envoy proxy:
kubectl logs envoy-aibrix-system-aibrix-eg-903790dc-7c8dd7d4b9-64w68 -n envoy-gateway-system | grep -i "wasm\|cache"
The proxy logs confirmed it was not receiving proper extension filter configurations due to the upstream WASM cache failure.
Root Cause
The issue was caused by disk space exhaustion in the kind cluster, which had a cascading effect through the entire request processing pipeline:
Failure Chain
- Disk space exhaustion: The kind cluster filesystem reached 100% capacity
- WASM cache initialization failure: Envoy Gateway controller could not create
/var/lib/eg/wasmdirectory. Note that envoy gateway controller is different from data plane envoy proxy. - Extension policy failure: EnvoyExtensionPolicy (aibrix-gateway-plugins-extension-policy) failed with “Wasm: wasm cache is not initialized”
- Envoy configuration corruption: The Envoy proxy instance (envoy-aibrix-system-aibrix-eg-903790dc) received invalid route configurations from envoy gateway controller for requests needing extension processing with routing-strategy header.
- Routing fallback: Requests with routing-strategy headers fell back to “original_route” which had no healthy upstream
- Selective impact: Only requests requiring extension processing were affected; direct routes continued working
Extension Policy Configuration
The failing extension policy was configured as follows:
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: EnvoyExtensionPolicy
metadata:
name: aibrix-gateway-plugins-extension-policy
namespace: aibrix-system
spec:
targetRef:
kind: HTTPRoute
name: aibrix-reserved-router
extProc:
- backendRefs:
- name: aibrix-gateway-plugins
port: 50052
messageTimeout: 60s
processingMode:
request:
body: Buffered
response:
body: Streamed
This policy was designed to:
- Intercept requests matching the aibrix-reserved-router HTTPRoute
- Forward them to the aibrix-gateway-plugins for external processing
- Apply intelligent routing decisions based on the routing-strategy header
Request Flow Breakdown
Working flow (direct model requests):
- Request with
model: llama2-7bheader → Envoy proxy - Envoy matches llama2-7b-router HTTPRoute
- Direct routing to model service (no extension processing required)
- Model service processes request and returns response
Broken flow (routing-strategy requests):
- Request with
routing-strategyheader → Envoy proxy - Envoy attempts to match extension processing route
- Extension policy configuration is invalid due to WASM cache failure
- Request falls back to original_route with original_destination_cluster
- No healthy upstream available → 503 error
Solution
The solution involved addressing the root cause and restarting the affected components:
Disk cleanup:
docker system prune -f
docker image prune -a -f
docker volume prune -f
Verify space recovery:
docker exec kind-control-plane df -h
Restart Envoy Gateway controller (to reinitialize WASM cache):
kubectl rollout restart deployment/envoy-gateway -n envoy-gateway-system
kubectl rollout status deployment/envoy-gateway -n envoy-gateway-system
Restart Envoy proxy instance (to apply corrected configuration):
kubectl rollout restart deployment/envoy-aibrix-system-aibrix-eg-903790dc -n envoy-gateway-system
kubectl rollout status deployment/envoy-aibrix-system-aibrix-eg-903790dc -n envoy-gateway-system
Verification of fix:
kubectl describe envoyextensionpolicy aibrix-gateway-plugins-extension-policy -n aibrix-system
Output after fix:
Status:
Conditions:
Type: Accepted
Status: True
Message: Policy has been accepted.
Verify Envoy proxy configuration:
kubectl logs envoy-aibrix-system-aibrix-eg-903790dc-7c8dd7d4b9-64w68 -n envoy-gateway-system | grep -i "config\|route"
The logs confirmed the proxy was now receiving valid route configurations with proper extension processing filters.
Functional test:
curl -v http://localhost:8888/v1/chat/completions \
-H "Content-Type: application/json" \
-H "routing-strategy: random" \
-d '{"model": "llama2-7b", "messages": [{"role": "user", "content": "test"}]}'
Result: HTTP 200, proper routing through the complete AIBrix pipeline:
- Request → Envoy proxy
- Envoy proxy → Extension processing → aibrix-gateway-plugins
- aibrix-gateway-plugins → Intelligent routing decision → Model service
- Model service → Response
Key Takeaways
This debugging session highlighted several important lessons about complex distributed systems:
-
Infrastructure dependencies cascade: A simple infrastructure issue (disk space) can cause failures in sophisticated features (extension processing) that appear unrelated. The failure path went from disk space → WASM cache → extension policy → Envoy configuration → routing behavior.
-
Component interaction complexity: Modern service mesh architectures involve multiple layers (Envoy Gateway controller, Envoy proxy instances, extension policies, external processing services) where each layer can fail independently and cause different symptoms.
-
Logging strategy matters: The critical error was buried in the Envoy Gateway controller logs, not in the more obvious places like the Envoy proxy access logs or the AIBrix application logs. Understanding which component logs contain the most actionable information is crucial for efficient debugging.
-
Extension mechanism fragility: Extension-based architectures (like EnvoyExtensionPolicy) add powerful capabilities but also introduce additional failure modes. When extensions fail, systems often fall back to unexpected default behaviors that can be difficult to diagnose.
-
Infrastructure-first debugging: Even in complex distributed systems with sophisticated routing logic, basic infrastructure issues (disk space, memory, network) should be checked early in the debugging process.
Some complain
It was another frustrating collapse in k8s cluster. The root cause was something unexpected and the clue was buried deep in the infra structure. The same painful patterns arose a lot of times in other issues I encountered. It was so hard to find and fix because each component in the infra is distributed and when an issue occurs, it does not forward it to other stack or other infra component. When a problem X happens in component A and if there is component B is interacting with A related to the problem X, the component B often says something vague which can confuses the operator and it does not say the root cause which happened at A. I think this is hard to achieve vertically connecting them so any issue is not easily traceable across the system. Maybe it is architectural challenge..