#
Nginx-Ingress Timeout Failure
Incident Date: 2026/02/12
- Issue reported around 1:30 PM when nginx-ingress stopped working in the EKS cluster.
- All services, deployments, and apps were functioning correctly when accessed via port-forwarding.
- Nginx ingress controller logs showed traffic being received.
- DNS nameservers and CNAME records for all domains/subdomains were correct.
- Uptime Robot reported Connection Timeout Errors across all monitored domains/subdomains.
- Manual domain access (e.g., www.cbioportal.org) showed significant delays - some domains eventually resolved after a minute or more, while most timed out.
- No error logs were found at the load balancer, ingress, or application level.
- The issue manifested as a significant delay between DNS resolution to the load balancer and the request being returned to the user.
#
Remediation
#
Attempted Solutions
- Debugging nginx-ingress: No errors or anomalies found in logs.
- Port-forwarding verification: All services/deployments/apps confirmed working.
- DNS verification: All CNAME records confirmed correct.
- Purging and recreating nginx-ingress deployment including load balancer: Did not resolve the issue.
#
Final Solution
- Deployed Traefik as an alternative ingress controller.
- Traffic now served through Traefik proxy layer instead of nginx-ingress.
- All services returned to normal operation.
#
Root Cause
The exact cause of the nginx-ingress failure remains unknown. The issue was characterized by:
- No visible errors in logs at any level (load balancer/ingress/application).
- Significant latency between DNS resolution and request completion.
- nginx-ingress helm chart appeared to be the source, but specific failure point was not identified.
#
Future Prevention
- Currently investigating disaster recovery options and backup strategies.
- Considering implementing multiple ingress controller options for faster failover.