Figuring Out Forever Terminating Pod In K8s Cluster
Debugging Forever Terminating Pods in Kubernetes
Tracking down why a pod was stuck in “Terminating” state for 10+ hours.
The Issue
One of our LLaMA model pods had been stuck in Terminating
state for over 10 hours. When we tried to delete it manually, the command hung forever:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
llama-3-8b-instruct-785975c76d-psqzk 2/2 Terminating 0 30d
$ kubectl delete pod llama-3-8b-instruct-785975c76d-psqzk
pod "llama-3-8b-instruct-785975c76d-psqzk" deleted
# ^ This command hangs here forever and never completes
The delete command appeared to accept the request but never finished executing.
Pod Description
First, we examined the pod itself:
$ kubectl describe pod llama-3-8b-instruct-785975c76d-psqzk
Key observations from the pod description:
Name: llama-3-8b-instruct-785975c76d-psqzk
Namespace: default
Node: 10.0.1.14/10.0.1.14
Start Time: Wed, 14 May 2025 20:40:45 -0700
Status: Terminating (lasts 10h)
Termination Grace Period: 60s
IP: 10.0.1.83
Controlled By: ReplicaSet/llama-3-8b-instruct-785975c76d
Conditions:
Type Status
Initialized True
Ready False
ContainersReady True
PodScheduled True
DisruptionTarget True
Events: <none>
Notable details:
- Status:
Terminating (lasts 10h)
- stuck for an unusually long time - Node:
10.0.1.14
- we’d need to investigate this node - Conditions:
Ready: False
butContainersReady: True
- containers were running fine - Events:
<none>
- no recent events explaining the termination - Termination Grace Period: Only 60 seconds, but it had been terminating for 10 hours
The pod itself looked healthy except for being stuck in termination. The containers were running normally, and there were no error events.
Checking Standard Resources
Persistent Volumes:
$ kubectl get pv,pvc
# Only found unrelated storage - no volume mount issues
Recent Events:
$ kubectl get events --sort-by=.metadata.creationTimestamp
LAST SEEN TYPE REASON OBJECT MESSAGE
2m23s Normal EnsuringLoadBalancer service/deepseek-llm-7b-chat Ensuring load balancer
19m Warning FailedScheduling pod/vllm-v6d-stability-test-79nvk 0/19 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/unreachable: }, 15 node(s) didn't match pod affinity rules...
21m Warning FailedScheduling pod/llama-3-8b-instruct-785975c76d-52btv 0/19 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/unreachable: }, 3 node(s) had untolerated taint {vci.vke.volcengine.com/node-type: vci}, 7 Insufficient nvidia.com/gpu...
21m Normal Scheduled pod/llama-3-8b-instruct-785975c76d-52btv Successfully assigned default/llama-3-8b-instruct-785975c76d-52btv to 10.0.1.7
The events revealed something interesting:
- Multiple pods failing to schedule due to
node.kubernetes.io/unreachable
taint - A new pod
llama-3-8b-instruct-785975c76d-52btv
was created and scheduled to node10.0.1.7
- The ReplicaSet had created a replacement - but our original pod was still stuck
This suggested that Kubernetes had detected an issue with a node and was trying to reschedule workloads, but our original pod was still stuck.
Node Investigation
We examined the node where the pod was running:
$ kubectl describe node 10.0.1.14
This command revealed the root cause:
Taints: node.kubernetes.io/unreachable:NoExecute
node.kubernetes.io/unreachable:NoSchedule
Conditions:
Ready Unknown
MemoryPressure Unknown
DiskPressure Unknown
PIDPressure Unknown
All conditions showed Unknown
status with the message: “Kubelet stopped posting node status”
Node Timeline and Kubernetes Response
Node conditions revealed a precise timeline:
- 01:14:35: All health checks passing (
ChronydIsHealthy
,KubeletIsHealthy
, etc.) - 01:15:14: Last successful heartbeat from kubelet
- 01:18:01: Communication completely lost - all conditions became
Unknown
The node went from completely healthy to completely unreachable in 3 minutes. This was a sudden, catastrophic failure.
Kubernetes Response to Node Failure:
- Node Controller Detection: After missing several heartbeats (typically 40s), the node controller marked the node as
Unknown
- Taint Application: Kubernetes automatically applied
node.kubernetes.io/unreachable:NoExecute
taint - Pod Eviction Trigger: The taint triggered pod eviction from the unreachable node
- Graceful Termination Attempt: Kubernetes sent SIGTERM to containers with a 60-second grace period
- ReplicaSet Response: The ReplicaSet detected a missing pod and created a replacement on a healthy node
- Stuck Termination: Since the node couldn’t respond, the pod never acknowledged termination
This explains why our kubectl delete
command hung - Kubernetes was waiting for the kubelet on the unreachable node to confirm the pod deletion.
Confirming the Diagnosis
A simple node status check confirmed our theory:
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
10.0.1.14 NotReady <none> 92d v1.28.3-vke.16
# ... other nodes all showing Ready
Node 10.0.1.14
was the only NotReady
node in the entire cluster.
Understanding the Root Cause
What happened:
- The underlying VM/instance hosting node
10.0.1.14
suffered a sudden infrastructure failure - This could be network connectivity loss, hardware failure, or hypervisor issues
- The kubelet could no longer communicate with the Kubernetes API server
- Kubernetes marked the node as unreachable and tried to gracefully terminate pods
- Since the node couldn’t respond, the pod got stuck in
Terminating
state - The ReplicaSet had already created a replacement pod on a healthy node
Why this wasn’t obvious initially:
- The pod description showed healthy containers
- No resource pressure indicators
- No obvious error messages
- The events log didn’t contain relevant information about this specific pod
Why kubectl delete Hangs Forever
When we tried to delete the pod manually, the command hung indefinitely:
$ kubectl delete pod llama-3-8b-instruct-785975c76d-psqzk
pod "llama-3-8b-instruct-785975c76d-psqzk" deleted
# ^ Command appears to succeed but hangs here forever
Why this happens:
- API Server Accepts Request: The API server receives and accepts the delete request
- Graceful Deletion Process: Kubernetes starts the graceful deletion workflow
- Waiting for Kubelet: The API server waits for the kubelet on node
10.0.1.14
to confirm pod deletion - Unreachable Node: Since the node is unreachable, the kubelet never responds
- Infinite Wait: The command hangs forever waiting for confirmation that will never come
This is Kubernetes’ safety mechanism - it won’t remove a pod from the API server until it’s certain the pod has actually stopped running on the node.
The Only Solution: Force Delete
Since the node couldn’t respond, the only way to clean up the stuck pod was to force delete it:
$ kubectl delete pod llama-3-8b-instruct-785975c76d-psqzk --force --grace-period=0
pod "llama-3-8b-instruct-785975c76d-psqzk" force deleted
The --force --grace-period=0
flags tell Kubernetes:
- Skip waiting for graceful termination
- Remove the pod from the API server immediately
- Don’t wait for kubelet confirmation
This is safe when you’ve confirmed the node is truly unreachable, as there’s no risk of the pod still running.
Conclusion
What initially appeared to be a simple stuck pod revealed a complex chain of infrastructure failure and Kubernetes safety mechanisms. The detective work showed us that:
- The pod wasn’t the problem - it was a symptom of node failure
- Normal deletion commands fail when nodes are unreachable due to Kubernetes’ safety mechanisms
- Force deletion is the only solution but should only be used after confirming the node is truly unreachable
- Kubernetes handled the situation correctly by automatically creating replacement pods on healthy nodes
The key insight was understanding why the kubectl delete
command hung forever - Kubernetes was protecting us from accidentally deleting a pod that might still be running by waiting for confirmation from an unreachable node. Once we understood this mechanism, the force delete approach became the clear and safe solution.
The next time you encounter a pod stuck in Terminating
state, remember to investigate the underlying node health before attempting any fixes. The pod might just be the messenger telling you about a deeper infrastructure issue.