Health Checks¶
Configure active and passive health checking for backend services.
Overview¶
NovaEdge supports two types of health checking:
flowchart TB
subgraph Active["Active Health Checks"]
A1["Periodic probes"]
A2["Configurable interval"]
A3["HTTP/TCP/gRPC"]
end
subgraph Passive["Passive Health Checks"]
P1["Monitor responses"]
P2["Detect failures"]
P3["Circuit breaking"]
end
Active --> Backend((Backend))
Passive --> Backend
Active Health Checks¶
Periodically probe backends to determine health.
HTTP Health Check¶
apiVersion: novaedge.io/v1alpha1
kind: ProxyBackend
metadata:
name: api-backend
spec:
serviceRef:
name: api-service
port: 8080
lbPolicy: RoundRobin
healthCheck:
protocol: HTTP
interval: 10s
timeout: 5s
healthyThreshold: 2
unhealthyThreshold: 3
httpHealthCheck:
path: /health
expectedStatuses:
- 200
- 204
headers:
- name: X-Health-Check
value: novaedge
TCP Health Check¶
apiVersion: novaedge.io/v1alpha1
kind: ProxyBackend
metadata:
name: redis-backend
spec:
serviceRef:
name: redis
port: 6379
healthCheck:
protocol: TCP
interval: 5s
timeout: 2s
healthyThreshold: 1
unhealthyThreshold: 3
gRPC Health Check¶
apiVersion: novaedge.io/v1alpha1
kind: ProxyBackend
metadata:
name: grpc-backend
spec:
serviceRef:
name: grpc-service
port: 9090
healthCheck:
protocol: GRPC
interval: 10s
timeout: 5s
healthyThreshold: 2
unhealthyThreshold: 3
grpcHealthCheck:
serviceName: "my.service.Health"
Health Check Options¶
| Field | Default | Description |
|---|---|---|
protocol |
HTTP | Check protocol (HTTP, TCP, GRPC) |
interval |
10s | Time between checks |
timeout |
5s | Check timeout |
healthyThreshold |
2 | Consecutive successes to mark healthy |
unhealthyThreshold |
3 | Consecutive failures to mark unhealthy |
initialDelay |
0s | Delay before first check |
HTTP Check Options¶
| Field | Default | Description |
|---|---|---|
path |
/health | Health check path |
host |
- | Override Host header |
expectedStatuses |
[200] | Expected status codes |
headers |
[] | Additional headers |
Passive Health Checks¶
Monitor actual traffic for failures.
apiVersion: novaedge.io/v1alpha1
kind: ProxyBackend
metadata:
name: api-backend
spec:
serviceRef:
name: api-service
port: 8080
passiveHealthCheck:
enabled: true
consecutiveErrors: 5
interval: 30s
baseEjectionTime: 30s
maxEjectionPercent: 50
consecutiveGatewayErrors: 3
Passive Check Options¶
| Field | Default | Description |
|---|---|---|
enabled |
true | Enable passive checks |
consecutiveErrors |
5 | Errors before ejection |
interval |
30s | Analysis interval |
baseEjectionTime |
30s | Initial ejection duration |
maxEjectionPercent |
50 | Max % of endpoints to eject |
consecutiveGatewayErrors |
0 | 5xx errors before ejection |
Circuit Breaker¶
Prevent cascading failures by temporarily removing failing backends.
apiVersion: novaedge.io/v1alpha1
kind: ProxyBackend
metadata:
name: api-backend
spec:
serviceRef:
name: api-service
port: 8080
circuitBreaker:
consecutiveErrors: 5
interval: 30s
baseEjectionTime: 30s
maxEjectionPercent: 50
splitExternalLocalOriginErrors: true
Circuit Breaker States¶
stateDiagram-v2
[*] --> Closed: Start
Closed --> Open: Error threshold reached
Open --> HalfOpen: After ejection time
HalfOpen --> Closed: Success
HalfOpen --> Open: Failure
| State | Description |
|---|---|
| Closed | Normal operation, traffic flows |
| Open | Backend ejected, no traffic |
| Half-Open | Testing if backend recovered |
Connection Limits¶
Protect backends from overload:
apiVersion: novaedge.io/v1alpha1
kind: ProxyBackend
metadata:
name: api-backend
spec:
serviceRef:
name: api-service
port: 8080
connectionLimits:
maxConnections: 100
maxPendingRequests: 50
maxRequestsPerConnection: 1000
maxRetries: 3
Connection Limit Options¶
| Field | Default | Description |
|---|---|---|
maxConnections |
1000 | Maximum connections |
maxPendingRequests |
100 | Maximum pending requests |
maxRequestsPerConnection |
0 | Max requests per connection (0=unlimited) |
maxRetries |
3 | Maximum retries |
Retry Policy¶
Configure request retries:
apiVersion: novaedge.io/v1alpha1
kind: ProxyBackend
metadata:
name: api-backend
spec:
serviceRef:
name: api-service
port: 8080
retryPolicy:
retryOn:
- 5xx
- reset
- connect-failure
- retriable-4xx
numRetries: 3
perTryTimeout: 5s
retryBackOff:
baseInterval: 25ms
maxInterval: 250ms
Retry Conditions¶
| Condition | Description |
|---|---|
5xx |
Retry on 5xx responses |
reset |
Retry on connection reset |
connect-failure |
Retry on connection failure |
retriable-4xx |
Retry on 409 |
refused-stream |
Retry on REFUSED_STREAM |
Health Status¶
Check backend health status:
# View backend health
kubectl get proxybackend api-backend -o yaml
# Using novactl
novactl get backends
novactl describe backend api-backend
Example status:
status:
endpoints:
- address: 10.0.0.5:8080
healthy: true
lastCheck: "2024-01-15T10:30:00Z"
- address: 10.0.0.6:8080
healthy: false
lastCheck: "2024-01-15T10:30:00Z"
lastError: "connection refused"
healthyCount: 1
unhealthyCount: 1
conditions:
- type: Healthy
status: "True"
message: "At least one endpoint is healthy"
Health Check Flow¶
sequenceDiagram
participant HC as Health Checker
participant EP as Endpoint
participant LB as Load Balancer
loop Every interval
HC->>EP: HTTP GET /health
alt Healthy response
EP-->>HC: 200 OK
HC->>HC: success++
alt success >= healthyThreshold
HC->>LB: Mark healthy
end
else Unhealthy response
EP-->>HC: Error/Timeout
HC->>HC: failures++
alt failures >= unhealthyThreshold
HC->>LB: Mark unhealthy
end
end
end
Metrics¶
| Metric | Description |
|---|---|
novaedge_health_check_total |
Total health checks |
novaedge_health_check_success_total |
Successful health checks |
novaedge_health_check_failure_total |
Failed health checks |
novaedge_endpoint_healthy |
Endpoint health status (1=healthy) |
novaedge_circuit_breaker_state |
Circuit breaker state |
novaedge_ejections_total |
Total ejections |
Troubleshooting¶
All Endpoints Unhealthy¶
# Check endpoint status
kubectl get proxybackend api-backend -o yaml
# Check health check path manually
kubectl exec -it <agent-pod> -- curl -v http://<endpoint>:8080/health
# Check agent logs
kubectl logs -n novaedge-system -l app.kubernetes.io/name=novaedge-agent | grep health
High Ejection Rate¶
# Check circuit breaker metrics
curl http://localhost:9090/metrics | grep circuit_breaker
# Review passive health settings
kubectl get proxybackend api-backend -o yaml | grep -A10 passiveHealthCheck
Best Practices¶
- Set appropriate thresholds - Balance between quick detection and false positives
- Use dedicated health endpoints - Don't use heavy endpoints for health checks
- Match health check to traffic - Use HTTP checks for HTTP services
- Configure reasonable timeouts - Health check timeout < interval
- Monitor health metrics - Alert on elevated failure rates
Next Steps¶
- Load Balancing - Configure LB algorithms
- Observability - Monitor health checks
- Troubleshooting - Common issues