I'm a Korean DevOps engineer preparing for international opportunities. Since English isn’t my first language, I’ve been practicing both my language skills and technical knowledge at the same time.
As part of my daily English practice, I asked ChatGPT (acting as my senior DevOps engineer) to give me one interview-style question each day. I try to answer in English based on what I know, and then improve my explanation through feedback and correction.
Here’s today’s question and my answer:
Question
Let’s say your production application is running on EKS, and suddenly, the application becomes unresponsive. No pods are restarting, and there are no obvious error logs. How would you start troubleshooting this issue?
My First Answer (Raw)
If there are no error logs and nothing special about metrics of resources, then there's a hig chance that traffic couldn't be routed to the application, i guess.
in such cases, it would be appropriate to check the reason from client to application.
First, you should check whether the domain resolves to the right ip using tools like dig or nslookup. it should point a load balancer, because it's the only way to expose application on eks.
if the domain points right loadbalancer. then you should check if the traffic can arrive to it. there's a chance that the traffic is blocked by security group or acl.
if you're sure that traffic can arrive to the loadbalancer, then you should check if the traffic can be forwarded from loadbalancer to a node. the security group of nodes should allow one of loadbalancer.
then you should check if the kube proxy can forward the traffic to pods. if you are using ingress, then the traffic is first routed to ingress controller, then is routed again to the appropriate service.
if there's no problem until here, then it must be pod's problem. since there are no problems of logs and metrics, it would be readiness problem, not liveness.
Refined Answer (with feedback)
If there are no error logs and nothing unusual in the resource metrics, there's a high chance that the traffic is not reaching the application.
In such cases, it's appropriate to troubleshoot from the outside in — from the client to the application.
First, I would check whether the domain resolves to the correct IP address using tools like dig or nslookup. In EKS, the domain should typically point to a load balancer, since that's the standard way to expose services externally.
If the domain is pointing to the correct load balancer, then I would verify whether traffic is actually reaching it. It's possible that the traffic is blocked by a security group or a network ACL.
If I'm sure that the traffic reaches the load balancer, the next step is to check whether the load balancer can forward the traffic to the nodes. The nodes’ security groups must allow traffic from the load balancer's security group.
After that, I would check whether the traffic is successfully routed from the node to the pods.
If you're using a LoadBalancer service, kube-proxy handles the routing directly to the pods. If you're using Ingress, then the traffic first goes to the Ingress Controller, which forwards it to the correct Service, and then to the pods.
If everything seems fine up to this point, then the issue may be at the pod level. Since there are no error logs and resource usage is normal, it’s more likely a readiness issue than a liveness one.
Top comments (0)