Troubleshooting¶
Common issues and solutions for NovaEdge deployments.
Quick Diagnostics¶
System Status¶
# Check all pods
kubectl get pods -n novaedge-system
# Check cluster status (operator mode)
kubectl get novaedgecluster -n novaedge-system
# Check CRDs
kubectl get crds | grep novaedge
# Using novactl
novactl status
Component Logs¶
# Controller logs
kubectl logs -n novaedge-system -l app.kubernetes.io/name=novaedge-controller
# Agent logs
kubectl logs -n novaedge-system -l app.kubernetes.io/name=novaedge-agent
# Operator logs
kubectl logs -n novaedge-system -l app.kubernetes.io/name=novaedge-operator
Installation Issues¶
CRDs Not Found¶
Symptom: error: the server doesn't have a resource type "proxygateways"
Solution:
# Reinstall CRDs
make install-crds
# Or manually
kubectl apply -f config/crd/
# Verify
kubectl get crds | grep novaedge.io
Controller Not Starting¶
Symptom: Controller pod in CrashLoopBackOff
Diagnosis:
# Check pod events
kubectl describe pod -n novaedge-system -l app.kubernetes.io/name=novaedge-controller
# Check logs
kubectl logs -n novaedge-system -l app.kubernetes.io/name=novaedge-controller --previous
Common Causes:
| Cause | Solution |
|---|---|
| RBAC issues | Check ServiceAccount and ClusterRole |
| Missing CRDs | Install CRDs first |
| Resource limits | Increase memory/CPU limits |
| Leader election | Check lease object in namespace |
Agents Not Connecting¶
Symptom: Agents not receiving configuration
Diagnosis:
# Check agent logs
kubectl logs -n novaedge-system -l app.kubernetes.io/name=novaedge-agent | grep -i "connect\|error"
# Test connectivity
kubectl exec -n novaedge-system <agent-pod> -- nc -zv novaedge-controller 9090
# Check controller service
kubectl get svc -n novaedge-system novaedge-controller
Solutions:
# Verify controller address
kubectl get deployment novaedge-agent -n novaedge-system -o yaml | grep CONTROLLER_ADDR
# Check network policies
kubectl get networkpolicies -n novaedge-system
Routing Issues¶
Route Not Matching¶
Symptom: 404 for requests that should match
Diagnosis:
# Check route configuration
kubectl get proxyroute <route-name> -o yaml
# Check gateway listeners
kubectl get proxygateway <gateway-name> -o yaml
# Test with curl
curl -v -H "Host: example.com" http://<vip>/path
Common Causes:
| Cause | Check |
|---|---|
| Hostname mismatch | Verify hostnames in route |
| Path mismatch | Check path.type (Exact vs PathPrefix) |
| Gateway not listening | Verify listener port and protocol |
| Route not attached | Check parentRefs in route |
Backend Not Receiving Traffic¶
Symptom: Route matches but backend returns errors
Diagnosis:
# Check backend status
kubectl get proxybackend <backend-name> -o yaml
# Check endpoints
kubectl get endpoints <service-name>
# Test backend directly
kubectl exec -n novaedge-system <agent-pod> -- curl -v http://<backend-ip>:<port>/health
VIP Issues¶
VIP Not Reachable¶
Symptom: Cannot reach VIP address
Diagnosis:
# Check VIP status
kubectl get proxyvip -o yaml
# Check agent logs for VIP
kubectl logs -n novaedge-system -l app.kubernetes.io/name=novaedge-agent | grep -i vip
# Verify VIP is bound (on active node)
ip addr show | grep <vip-address>
Solutions by Mode:
L2 Mode¶
# Check ARP
arp -a | grep <vip-address>
# Send test ARP
arping -I eth0 <vip-address>
# Verify interface
kubectl get proxyvip -o yaml | grep interface
BGP Mode¶
# Check BGP sessions
kubectl exec -n novaedge-system <agent-pod> -- novactl bgp status
# Verify peer config
kubectl get proxyvip -o yaml | grep -A10 bgp
# Check router
show ip bgp summary
show ip route <vip-address>
VIP Failover Not Working¶
Symptom: VIP stuck on failed node
Diagnosis:
# Check controller logs for failover
kubectl logs -n novaedge-system -l app.kubernetes.io/name=novaedge-controller | grep -i failover
# Check VIP assignment
kubectl get proxyvip -o yaml | grep currentNode
# Verify health checks
kubectl describe proxyvip <vip-name>
TLS Issues¶
Certificate Not Found¶
Symptom: TLS handshake fails
Diagnosis:
# Check secret exists
kubectl get secret <tls-secret-name>
# Verify certificate data
kubectl get secret <tls-secret-name> -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -text
# Check gateway TLS config
kubectl get proxygateway -o yaml | grep -A10 tls
Certificate Expired¶
Symptom: TLS errors in browser/client
Solution:
# Check expiry
kubectl get secret <tls-secret-name> -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -dates
# Renew with cert-manager
kubectl delete certificate <cert-name>
kubectl apply -f certificate.yaml
SNI Not Matching¶
Symptom: Wrong certificate served
Solution:
- Verify hostname in certificate matches request
- Check listener hostnames configuration
- Ensure correct certificate order in certificateRefs
Policy Issues¶
Rate Limiting Not Working¶
Symptom: Rate limits not enforced
Diagnosis:
# Check policy attached
kubectl get proxypolicy -o yaml | grep targetRef
# Check policy config
kubectl describe proxypolicy <policy-name>
# Test rate limit
for i in {1..150}; do curl -s http://<vip>/api; done
JWT Validation Failing¶
Symptom: 401 for valid tokens
Diagnosis:
# Verify JWKS endpoint
curl -v <jwks-uri>
# Check issuer match
kubectl get proxypolicy -o yaml | grep issuer
# Decode token
echo <token> | cut -d. -f2 | base64 -d | jq
Common Causes:
| Cause | Solution |
|---|---|
| Wrong issuer | Match iss claim to policy |
| Wrong audience | Match aud claim to policy |
| JWKS unreachable | Check network from agent |
| Clock skew | Sync NTP on nodes |
Performance Issues¶
High Latency¶
Diagnosis:
# Check metrics
curl http://localhost:9090/metrics | grep request_duration
# Check backend health
kubectl get proxybackend -o yaml | grep -A10 healthCheck
# Check connection pool
curl http://localhost:9090/metrics | grep connections
Solutions: - Increase connection pool size - Use latency-aware LB (EWMA) - Enable keep-alive - Check backend capacity
Connection Timeouts¶
Diagnosis:
# Check timeout configuration
kubectl get proxybackend -o yaml | grep timeout
# Check circuit breaker
curl http://localhost:9090/metrics | grep circuit_breaker
Solutions:
Memory/CPU Issues¶
Diagnosis:
# Check resource usage
kubectl top pods -n novaedge-system
# Check OOM kills
kubectl describe pod <pod-name> | grep -i oom
Solutions:
Health Check Issues¶
All Backends Unhealthy¶
Symptom: No healthy endpoints
Diagnosis:
# Check health check config
kubectl get proxybackend -o yaml | grep -A10 healthCheck
# Test health endpoint manually
kubectl exec -n novaedge-system <agent-pod> -- curl -v http://<backend>:<port>/health
# Check agent logs
kubectl logs -n novaedge-system -l app.kubernetes.io/name=novaedge-agent | grep health
Common Causes:
| Cause | Solution |
|---|---|
| Wrong path | Verify healthCheck.path |
| Wrong port | Check service port mapping |
| Timeout too short | Increase healthCheck.timeout |
| Endpoint unreachable | Check network policies |
Multi-Cluster Issues¶
Remote Cluster Not Connecting¶
Diagnosis:
# Check remote cluster status
kubectl get novaedgeremotecluster -n novaedge-system
# Check agent logs on spoke
kubectl logs -n novaedge-system -l app.kubernetes.io/name=novaedge-agent | grep connect
# Verify mTLS certificates
kubectl get secret novaedge-agent-cert -n novaedge-system
Solutions: - Verify controller endpoint is reachable - Check firewall rules for gRPC port - Validate mTLS certificates - Check certificate expiry
Getting Help¶
Collect Debug Information¶
# Generate debug bundle
novactl debug bundle -o novaedge-debug.tar.gz
# Or manually
kubectl get all -n novaedge-system -o yaml > novaedge-resources.yaml
kubectl logs -n novaedge-system -l app.kubernetes.io/name=novaedge-controller > controller.log
kubectl logs -n novaedge-system -l app.kubernetes.io/name=novaedge-agent > agent.log
Check Documentation¶
Community Support¶
- GitHub Issues: Report bugs and request features
- Discussions: Ask questions and share solutions