Troubleshooting¶

Common issues and solutions for NovaEdge deployments.

Quick Diagnostics¶

System Status¶

# Check all pods
kubectl get pods -n novaedge-system

# Check cluster status (operator mode)
kubectl get novaedgecluster -n novaedge-system

# Check CRDs
kubectl get crds | grep novaedge

# Using novactl
novactl status

Component Logs¶

# Controller logs
kubectl logs -n novaedge-system -l app.kubernetes.io/name=novaedge-controller

# Agent logs
kubectl logs -n novaedge-system -l app.kubernetes.io/name=novaedge-agent

# Operator logs
kubectl logs -n novaedge-system -l app.kubernetes.io/name=novaedge-operator

Installation Issues¶

CRDs Not Found¶

Symptom: error: the server doesn't have a resource type "proxygateways"

Solution:

# Reinstall CRDs
make install-crds

# Or manually
kubectl apply -f config/crd/

# Verify
kubectl get crds | grep novaedge.io

Controller Not Starting¶

Symptom: Controller pod in CrashLoopBackOff

Diagnosis:

# Check pod events
kubectl describe pod -n novaedge-system -l app.kubernetes.io/name=novaedge-controller

# Check logs
kubectl logs -n novaedge-system -l app.kubernetes.io/name=novaedge-controller --previous

Common Causes:

Cause	Solution
RBAC issues	Check ServiceAccount and ClusterRole
Missing CRDs	Install CRDs first
Resource limits	Increase memory/CPU limits
Leader election	Check lease object in namespace

Agents Not Connecting¶

Symptom: Agents not receiving configuration

Diagnosis:

# Check agent logs
kubectl logs -n novaedge-system -l app.kubernetes.io/name=novaedge-agent | grep -i "connect\|error"

# Test connectivity
kubectl exec -n novaedge-system <agent-pod> -- nc -zv novaedge-controller 9090

# Check controller service
kubectl get svc -n novaedge-system novaedge-controller

Solutions:

# Verify controller address
kubectl get deployment novaedge-agent -n novaedge-system -o yaml | grep CONTROLLER_ADDR

# Check network policies
kubectl get networkpolicies -n novaedge-system

Routing Issues¶

Route Not Matching¶

Symptom: 404 for requests that should match

Diagnosis:

# Check route configuration
kubectl get proxyroute <route-name> -o yaml

# Check gateway listeners
kubectl get proxygateway <gateway-name> -o yaml

# Test with curl
curl -v -H "Host: example.com" http://<vip>/path

Common Causes:

Cause	Check
Hostname mismatch	Verify `hostnames` in route
Path mismatch	Check `path.type` (Exact vs PathPrefix)
Gateway not listening	Verify listener port and protocol
Route not attached	Check `parentRefs` in route

Backend Not Receiving Traffic¶

Symptom: Route matches but backend returns errors

Diagnosis:

# Check backend status
kubectl get proxybackend <backend-name> -o yaml

# Check endpoints
kubectl get endpoints <service-name>

# Test backend directly
kubectl exec -n novaedge-system <agent-pod> -- curl -v http://<backend-ip>:<port>/health

VIP Issues¶

VIP Not Reachable¶

Symptom: Cannot reach VIP address

Diagnosis:

# Check VIP status
kubectl get proxyvip -o yaml

# Check agent logs for VIP
kubectl logs -n novaedge-system -l app.kubernetes.io/name=novaedge-agent | grep -i vip

# Verify VIP is bound (on active node)
ip addr show | grep <vip-address>

Solutions by Mode:

L2 Mode¶

# Check ARP
arp -a | grep <vip-address>

# Send test ARP
arping -I eth0 <vip-address>

# Verify interface
kubectl get proxyvip -o yaml | grep interface

BGP Mode¶

# Check BGP sessions
kubectl exec -n novaedge-system <agent-pod> -- novactl bgp status

# Verify peer config
kubectl get proxyvip -o yaml | grep -A10 bgp

# Check router
show ip bgp summary
show ip route <vip-address>

VIP Failover Not Working¶

Symptom: VIP stuck on failed node

Diagnosis:

# Check controller logs for failover
kubectl logs -n novaedge-system -l app.kubernetes.io/name=novaedge-controller | grep -i failover

# Check VIP assignment
kubectl get proxyvip -o yaml | grep currentNode

# Verify health checks
kubectl describe proxyvip <vip-name>

TLS Issues¶

Certificate Not Found¶

Symptom: TLS handshake fails

Diagnosis:

# Check secret exists
kubectl get secret <tls-secret-name>

# Verify certificate data
kubectl get secret <tls-secret-name> -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -text

# Check gateway TLS config
kubectl get proxygateway -o yaml | grep -A10 tls

Certificate Expired¶

Symptom: TLS errors in browser/client

Solution:

# Check expiry
kubectl get secret <tls-secret-name> -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -dates

# Renew with cert-manager
kubectl delete certificate <cert-name>
kubectl apply -f certificate.yaml

SNI Not Matching¶

Symptom: Wrong certificate served

Solution: - Verify hostname in certificate matches request - Check listener hostnames configuration - Ensure correct certificate order in certificateRefs

Policy Issues¶

Rate Limiting Not Working¶

Symptom: Rate limits not enforced

Diagnosis:

# Check policy attached
kubectl get proxypolicy -o yaml | grep targetRef

# Check policy config
kubectl describe proxypolicy <policy-name>

# Test rate limit
for i in {1..150}; do curl -s http://<vip>/api; done

JWT Validation Failing¶

Symptom: 401 for valid tokens

Diagnosis:

# Verify JWKS endpoint
curl -v <jwks-uri>

# Check issuer match
kubectl get proxypolicy -o yaml | grep issuer

# Decode token
echo <token> | cut -d. -f2 | base64 -d | jq

Common Causes:

Cause	Solution
Wrong issuer	Match `iss` claim to policy
Wrong audience	Match `aud` claim to policy
JWKS unreachable	Check network from agent
Clock skew	Sync NTP on nodes

Performance Issues¶

High Latency¶

Diagnosis:

# Check metrics
curl http://localhost:9090/metrics | grep request_duration

# Check backend health
kubectl get proxybackend -o yaml | grep -A10 healthCheck

# Check connection pool
curl http://localhost:9090/metrics | grep connections

Solutions: - Increase connection pool size - Use latency-aware LB (EWMA) - Enable keep-alive - Check backend capacity

Connection Timeouts¶

Diagnosis:

# Check timeout configuration
kubectl get proxybackend -o yaml | grep timeout

# Check circuit breaker
curl http://localhost:9090/metrics | grep circuit_breaker

Solutions:

# Increase timeouts
spec:
  timeout:
    connect: 10s
    request: 60s
    idle: 120s

Memory/CPU Issues¶

Diagnosis:

# Check resource usage
kubectl top pods -n novaedge-system

# Check OOM kills
kubectl describe pod <pod-name> | grep -i oom

Solutions:

# Increase resources
resources:
  requests:
    cpu: 500m
    memory: 512Mi
  limits:
    cpu: 2000m
    memory: 2Gi

Health Check Issues¶

All Backends Unhealthy¶

Symptom: No healthy endpoints

Diagnosis:

# Check health check config
kubectl get proxybackend -o yaml | grep -A10 healthCheck

# Test health endpoint manually
kubectl exec -n novaedge-system <agent-pod> -- curl -v http://<backend>:<port>/health

# Check agent logs
kubectl logs -n novaedge-system -l app.kubernetes.io/name=novaedge-agent | grep health

Common Causes:

Cause	Solution
Wrong path	Verify healthCheck.path
Wrong port	Check service port mapping
Timeout too short	Increase healthCheck.timeout
Endpoint unreachable	Check network policies

Multi-Cluster Issues¶

Remote Cluster Not Connecting¶

Diagnosis:

# Check remote cluster status
kubectl get novaedgeremotecluster -n novaedge-system

# Check agent logs on spoke
kubectl logs -n novaedge-system -l app.kubernetes.io/name=novaedge-agent | grep connect

# Verify mTLS certificates
kubectl get secret novaedge-agent-cert -n novaedge-system

Solutions: - Verify controller endpoint is reachable - Check firewall rules for gRPC port - Validate mTLS certificates - Check certificate expiry

Getting Help¶

Collect Debug Information¶

# Generate debug bundle
novactl debug bundle -o novaedge-debug.tar.gz

# Or manually
kubectl get all -n novaedge-system -o yaml > novaedge-resources.yaml
kubectl logs -n novaedge-system -l app.kubernetes.io/name=novaedge-controller > controller.log
kubectl logs -n novaedge-system -l app.kubernetes.io/name=novaedge-agent > agent.log

Check Documentation¶

Community Support¶

GitHub Issues: Report bugs and request features
Discussions: Ask questions and share solutions